I am trying to use Beautiful Soup to scrape housing price data from Zillow.
I get the web page by property id, eg. http://www.zillow.com/homes/for_sale/18429834_zpid/>
Your HTML is non-well-formed and in cases like this, choosing the right parser is crucial. In BeautifulSoup, there are currently 3 available HTML parsers which work and handle broken HTML differently:
html.parser
(built-in, no additional modules needed)lxml
(the fastest, requires lxml
to be installed)html5lib
(the most lenient, requires html5lib
to be installed)The Differences between parsers documentation page describes the differences in more detail. In your case, to demonstrate the difference:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> zpid = "18429834"
>>> url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
>>> response = requests.get(url)
>>> html = response.content
>>>
>>> len(BeautifulSoup(html, "html5lib").find_all('div', attrs={"class":"home-summary-row"}))
0
>>> len(BeautifulSoup(html, "html.parser").find_all('div', attrs={"class":"home-summary-row"}))
3
>>> len(BeautifulSoup(html, "lxml").find_all('div', attrs={"class":"home-summary-row"}))
3
As you can see, in your case, both html.parser
and lxml
do the job, but html5lib
does not.