python method to extract content (excluding navigation) from an HTML page

前端未结

关注

 5  469

无人及你 2021-01-31 23:13

Of course an HTML page can be parsed using any number of python parsers, but I\'m surprised that there don\'t seem to be any public parsing scripts to extract meaningful content

5条回答

不思量自难忘° (楼主)

2021-02-01 00:07

Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/

It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).

Here's an example from the google code page:


# Import the Template class.
>>> from templatemaker import Template

# Create a Template instance.
>>> t = Template()

# Learn a Sample String.
>>> t.learn('this and that')

# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'this and that'

# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('alex and sue')
True

# Sure enough, the template now has some holes.
>>> t.as_text('!')
'! and !'

0 讨论(0)

查看其它5个回答