python method to extract content (excluding navigation) from an HTML page

前端 未结 5 466
无人及你
无人及你 2021-01-31 23:13

Of course an HTML page can be parsed using any number of python parsers, but I\'m surprised that there don\'t seem to be any public parsing scripts to extract meaningful content

5条回答
  •  星月不相逢
    2021-01-31 23:45

    Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.

    Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.

    Was there a particular type of information you were trying to extract or some other end goal?

    You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).

    Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?

    [EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:

    +1 points for every 100 words
    +1 points for every child element that has > 100 words
    -1 points if the section name contains the word 'nav'
    -2 points if the section name contains the word 'advert'
    

    If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.

    [EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?

提交回复
热议问题