Using Beautiful Soup to find specific class

后端 未结 3 641

I am trying to use Beautiful Soup to scrape housing price data from Zillow.

I get the web page by property id, eg. http://www.zillow.com/homes/for_sale/18429834_zpid/

3条回答
  •  夕颜
    夕颜 (楼主)
    2021-02-02 03:29

    Your HTML is non-well-formed and in cases like this, choosing the right parser is crucial. In BeautifulSoup, there are currently 3 available HTML parsers which work and handle broken HTML differently:

    • html.parser (built-in, no additional modules needed)
    • lxml (the fastest, requires lxml to be installed)
    • html5lib (the most lenient, requires html5lib to be installed)

    The Differences between parsers documentation page describes the differences in more detail. In your case, to demonstrate the difference:

    >>> from bs4 import BeautifulSoup
    >>> import requests
    >>> 
    >>> zpid = "18429834"
    >>> url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
    >>> response = requests.get(url)
    >>> html = response.content
    >>> 
    >>> len(BeautifulSoup(html, "html5lib").find_all('div', attrs={"class":"home-summary-row"}))
    0
    >>> len(BeautifulSoup(html, "html.parser").find_all('div', attrs={"class":"home-summary-row"}))
    3
    >>> len(BeautifulSoup(html, "lxml").find_all('div', attrs={"class":"home-summary-row"}))
    3
    

    As you can see, in your case, both html.parser and lxml do the job, but html5lib does not.

提交回复
热议问题