发表新帖

发表新帖

Using Beautiful Soup to find specific class

后端未结

关注

 3  641

不要未来只要你来 2021-02-02 02:50

I am trying to use Beautiful Soup to scrape housing price data from Zillow.

I get the web page by property id, eg. http://www.zillow.com/homes/for_sale/18429834_zpid/

3条回答

夕颜 (楼主)

2021-02-02 03:29
Your HTML is non-well-formed and in cases like this, choosing the right parser is crucial. In BeautifulSoup, there are currently 3 available HTML parsers which work and handle broken HTML differently:
- html.parser (built-in, no additional modules needed)
- lxml (the fastest, requires lxml to be installed)
- html5lib (the most lenient, requires html5lib to be installed)
The Differences between parsers documentation page describes the differences in more detail. In your case, to demonstrate the difference:
```
>>> from bs4 import BeautifulSoup
>>> import requests
>>> 
>>> zpid = "18429834"
>>> url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
>>> response = requests.get(url)
>>> html = response.content
>>> 
>>> len(BeautifulSoup(html, "html5lib").find_all('div', attrs={"class":"home-summary-row"}))
0
>>> len(BeautifulSoup(html, "html.parser").find_all('div', attrs={"class":"home-summary-row"}))
3
>>> len(BeautifulSoup(html, "lxml").find_all('div', attrs={"class":"home-summary-row"}))
3
```
As you can see, in your case, both html.parser and lxml do the job, but html5lib does not.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题