概要: Plagger::Entry->body
から一部を切り出すプラグイン。
P::P::Filter::EntryFullText で HTML ページを丸ごとPlagger::Entry->body
に突っ込まれたときに、
body 要素以下だけにしたいなーということで作った。
deps/Filter-ExtractBody.yaml
name: Filter::ExtractBody author: Naoki Okamura depends: HTML::TreeBuilder::Xpath: 0
lib/Plagger/Plugin/Filter/ExtractBody.pm
package Plagger::Plugin::Filter::ExtractBody; use strict; use warnings; use Plagger::Util; use Plagger::Text; use HTML::TreeBuilder::XPath; use base qw( Plagger::Plugin ); sub register { my ( $self, $c ) = @_; $c->register_hook( $self, 'update.entry.fixup' => $self->can('update'), ); } sub update { my ( $self, $c, $args ) = @_; my $entry = $args->{'entry'}; return if ( ! $entry->body->is_html ); my $body = $entry->body->data; $body = $self->extract( $body ); $body = Plagger::Text->new( type => 'html', data => $body ); $entry->body( $body ); return 1; } sub extract { my ( $self, $text ) = @_; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse( $text ); $tree->eof; my $xpath = $self->conf->{'xpath'} || '//body'; no warnings 'redefine'; local *HTML::Element::_xml_escape = $self->can('escape_xml'); use warnings; my $body = q{}; for my $node ( $tree->findnodes( $xpath ) ) { $body .= ( $node->isElementNode ) ? $node->as_XML : $node->getValue ; } return $body; } sub escape_xml { for my $x ( @_ ) { $x = Plagger::Util::encode_xml( $x ); } } 1; __END__
t/plugins/Filter-ExtractBody/base.t
use strict; use t::TestPlagger; test_plugin_deps; plan 'no_plan'; run_eval_expected; __END__ === Loading Filter::ExtractBody --- input config plugins: - module: CustomFeed::Debug config: title: title entry: - link: file://$t::TestPlagger::BaseDirURI/t/samples/xoxo.html - module: Filter::EntryFullText config: store_html_on_failure: 1 - module: Filter::ExtractBody config: xpath: //ul/li/a [[@title="blog.bulknews.net"]] --- expected my $content = q{<a href="http://blog.bulknews.net/mt/" title="blog.bulknews.net">blog.bulknews.net</a> }; is( $context->update->feeds-> [[0]] ->entries-> [[0]] ->body->html, $content, );
ライセンスは Perl と同等で。使い方はこんな感じ。
plugins: - module: Filter::EntryFullText config: store_html_on_failure: 1 - module: Filter::ExtractBody
- P::P::Filter::Diff とかと組み合わせるといいかもしれない。
ちなみにまだ POD を書いてない。