Skip to content Skip to sidebar Skip to footer

How Can I Parse Only Part Of An Html File And Ignore The Rest?

In each of 5,000 HTML files I have to get only one line of text, which is line 999. How can I tell the HTML::Parser that I only have to get line 999?

dataset 1:

Solution 1:

Do you mean the 999th line or the 999th table row?

The former might be

perl -ne 'print if $. == 999' /path/to/*.dat

The latter would involve an HTML parser and some selection logic. A Sax parser might be better for fast processing of a large number of files. It probably depends which version of HTML is used and whether it is "well-formed".

Perl has many XML and HTML parsers - did you have any particular module in mind?


EDIT:

Your problem seems to be your XPath expression. The actual HTML is much more complex than your XPath suggests. The following expression works better

#!/usr/bin/perlusestrict;
usewarnings;
useLWP::Simple;
useHTML::TreeBuilder::XPath;

## replace this with a loop over 5000 existing files#
my $url = 'http://www.kultusportal-bw.de/'.
          'servlet/PB/menu/1188427/index.html'.
          '?COMPLETEHREF='.
          'http://www.kultus-bw.de/'.
          'did_abfrage/detail.php?id=04313488';
my $html = get $url;

my $tree = HTML::TreeBuilder::XPath->new();
## within the loop process the html like this#$tree->parse($html);
$tree->eof;
print$tree->findvalue('//table[@bgcolor]/tr[1]');

Try cutting the above and pasting into a file then running it with Perl.

Post a Comment for "How Can I Parse Only Part Of An Html File And Ignore The Rest?"