Plagger::Plugin::Filter::ExtractBody

概要: Plagger::Entry->bodyから一部を切り出すプラグイン。


P::P::Filter::EntryFullTextでHTMLページを丸ごとPlagger::Entry->bodyに突っ込まれたときに、 body要素以下だけにしたいなーということで作った。

deps/Filter-ExtractBody.yaml

name: Filter::ExtractBody
author: Naoki Okamura
depends:
HTML::TreeBuilder::Xpath: 0

lib/Plagger/Plugin/Filter/ExtractBody.pm

package Plagger::Plugin::Filter::ExtractBody;
use strict;
use warnings;
use Plagger::Util;
use Plagger::Text;
use HTML::TreeBuilder::XPath;
use base qw( Plagger::Plugin );
sub register {
my ( $self, $c ) = @_;
$c->register_hook(
$self,
'update.entry.fixup' => $self->can('update'),
);
}
sub update {
my ( $self, $c, $args ) = @_;
my $entry = $args->{'entry'};
return if ( ! $entry->body->is_html );
my $body = $entry->body->data;
$body = $self->extract( $body );
$body = Plagger::Text->new( type => 'html', data => $body );
$entry->body( $body );
return 1;
}
sub extract {
my ( $self, $text ) = @_;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse( $text );
$tree->eof;
my $xpath = $self->conf->{'xpath'} || '//body';
no warnings 'redefine';
local *HTML::Element::_xml_escape = $self->can('escape_xml');
use warnings;
my $body = q{};
for my $node ( $tree->findnodes( $xpath ) ) {
$body .= ( $node->isElementNode ) ? $node->as_XML : $node->getValue ;
}
return $body;
}
sub escape_xml {
for my $x ( @_ ) {
$x = Plagger::Util::encode_xml( $x );
}
}
1;
__END__

t/plugins/Filter-ExtractBody/base.t

use strict;
use t::TestPlagger;
test_plugin_deps;
plan 'no_plan';
run_eval_expected;
__END__
=== Loading Filter::ExtractBody
--- input config
plugins:
- module: CustomFeed::Debug
config:
title: title
entry:
- link: file://$t::TestPlagger::BaseDirURI/t/samples/xoxo.html
- module: Filter::EntryFullText
config:
store_html_on_failure: 1
- module: Filter::ExtractBody
config:
xpath: //ul/li/a @title="blog.bulknews.net" 
--- expected
my $content = q{<a href="http://blog.bulknews.net/mt/" title="blog.bulknews.net">blog.bulknews.net</a>
};
is(
$context->update->feeds-> 0 ->entries-> 0 ->body->html,
$content,
);

ライセンスはPerlと同等で。使い方はこんな感じ。

plugins:
- module: Filter::EntryFullText
config:
store_html_on_failure: 1
- module: Filter::ExtractBody

ちなみにまだPODを書いてない。

#FIXME

nyarla が大体

Scrapbox でコメントや意見を書く