Python - beautifulsoup - how to deal with missing closing tags

前端 未结 1 1423
深忆病人
深忆病人 2021-01-15 13:25

I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll(\'tr\') I get the entire tab

1条回答
  •  伪装坚强ぢ
    2021-01-15 13:51

    As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.

    In your example I've used lxml as the parser and it gave the following result:

    soup = BeautifulSoup(data, "lxml")
    table = soup.findAll("table")[0]
    rows = table.find_all('tr')
    for tr in rows:
        print(tr.get_text(strip=True))
    

    Note that lxml added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).

    0 讨论(0)
提交回复
热议问题