Python html parsing that actually works

前端未结

关注

 5  1187

轻奢々 2021-01-31 21:09

I\'m trying to parse some html in Python. There were some methods that actually worked before... but nowadays there\'s nothing I can actually use without workarounds.

5条回答

孤城傲影 (楼主)

2021-01-31 21:22
I've used pyparsing for a number of HTML page scraping projects. It is a sort of middle-ground between BeautifulSoup and the full HTML parsers on one end, and the too-low-level approach of regular expressions (that way lies madness).

With pyparsing, you can often get good HTML scraping results by identifying the specific subset of the page or data that you are trying to extract. This approach avoids the issues of trying to parse everything on the page, since some problematic HTML outside of your region of interest could throw off a comprehensive HTML parser.

While this sounds like just a glorified regex approach, pyparsing offers builtins for working with HTML- or XML-tagged text. Pyparsing avoids many of the pitfalls that frustrate the regex-based solutions:
- accepts whitespace without littering '\s*' all over your expression
- handles unexpected attributes within tags
- handles attributes in any order
- handles upper/lower case in tags
- handles attribute names with namespaces
- handles attribute values in double quotes, single quotes, or no quotes
- handles empty tags (those of the form )
- returns parsed tag data with object-attribute access to tag attributes
Here's a simple example from the pyparsing wiki that gets tags from a web page:
```
from pyparsing import makeHTMLTags, SkipTo

# read HTML from a web page
page = urllib.urlopen( "http://www.yahoo.com" )
htmlText = page.read()
page.close()

# define pyparsing expression to search for within HTML    
anchorStart,anchorEnd = makeHTMLTags("a")
anchor = anchorStart + SkipTo(anchorEnd).setResultsName("body") + anchorEnd

for tokens,start,end in anchor.scanString(htmlText):
    print tokens.body,'->',tokens.href
```
This will pull out the tags, even if there are other portions of the page containing problematic HTML. There are other HTML examples at the pyparsing wiki:
- http://pyparsing.wikispaces.com/file/view/makeHTMLTagExample.py
- http://pyparsing.wikispaces.com/file/view/getNTPserversNew.py
- http://pyparsing.wikispaces.com/file/view/htmlStripper.py
- http://pyparsing.wikispaces.com/file/view/withAttribute.py
Pyparsing is not a total foolproof solution to this problem, but by exposing the parsing process to you, you can better control which pieces of the HTML you are specifically interested in, process them, and skip the rest.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...