カラクリスタ

『輝かしい青春』なんて失かったヒトのブログ

Plagger::Plugin::Filter::ExtractBody

概要: Plagger::Entry->bodyから一部を切り出すプラグイン。


P::P::Filter::EntryFullTextでHTMLページを丸ごとPlagger::Entry->bodyに突っ込まれたときに、 body要素以下だけにしたいなーということで作った。

deps/Filter-ExtractBody.yaml

name: Filter::ExtractBody
author: Naoki Okamura
depends:
  HTML::TreeBuilder::Xpath: 0

lib/Plagger/Plugin/Filter/ExtractBody.pm

package Plagger::Plugin::Filter::ExtractBody;
use strict;
use warnings;
use Plagger::Util;
use Plagger::Text;
use HTML::TreeBuilder::XPath;
use base qw( Plagger::Plugin );
sub register {
    my ( $self, $c ) = @_;
    $c->register_hook(
        $self,
        'update.entry.fixup' => $self->can('update'),
    );
}
sub update {
    my ( $self, $c, $args ) = @_;
    my $entry = $args->{'entry'};
    return if ( ! $entry->body->is_html );
    my $body = $entry->body->data;
       $body = $self->extract( $body );
       $body = Plagger::Text->new( type => 'html', data => $body );
    $entry->body( $body );
    return 1;
}
sub extract {
    my ( $self, $text ) = @_;
    my $tree = HTML::TreeBuilder::XPath->new;
    $tree->parse( $text );
    $tree->eof;
    my $xpath = $self->conf->{'xpath'} || '//body';
    no warnings 'redefine';
    local *HTML::Element::_xml_escape = $self->can('escape_xml');
    use warnings;
    my $body = q{};
    for my $node ( $tree->findnodes( $xpath ) ) {
        $body .= ( $node->isElementNode ) ? $node->as_XML : $node->getValue ;
    }
    return $body;
}
sub escape_xml {
    for my $x ( @_ ) {
        $x = Plagger::Util::encode_xml( $x );
    }
}
1;
__END__

t/plugins/Filter-ExtractBody/base.t

use strict;
use t::TestPlagger;
test_plugin_deps;
plan 'no_plan';
run_eval_expected;
__END__
=== Loading Filter::ExtractBody
--- input config
plugins:
  - module: CustomFeed::Debug
    config:
      title: title
      entry:
        - link: file://$t::TestPlagger::BaseDirURI/t/samples/xoxo.html
  - module: Filter::EntryFullText
    config:
      store_html_on_failure: 1
  - module: Filter::ExtractBody
    config:
      xpath: //ul/li/a[@title="blog.bulknews.net"]
--- expected
my $content = q{<a href="http://blog.bulknews.net/mt/" title="blog.bulknews.net">blog.bulknews.net</a>
};
is(
    $context->update->feeds->[0]->entries->[0]->body->html,
    $content,
);

ライセンスはPerlと同等で。使い方はこんな感じ。

plugins:
  - module: Filter::EntryFullText
    config:
      store_html_on_failure: 1
  - module: Filter::ExtractBody

P::P::Filter::Diffとかと組み合わせるといいかもしれない。

ちなみにまだPODを書いてない。