Beautiful Soup findAll doesn't find them all

廉价感情. 提交于 2019-11-26 04:43:17

问题


I\'m trying to parse a website and get some info with BeautifulSoup.findAll but it doesn\'t find them all.. I\'m using python3

the code is this

#!/usr/bin/python3

from bs4 import BeautifulSoup
from urllib.request import urlopen

page = urlopen (\"http://mangafox.me/directory/\")
# print (page.read ())
soup = BeautifulSoup (page.read ())

manga_img = soup.findAll (\'a\', {\'class\' : \'manga_img\'}, limit=None)

for manga in manga_img:
    print (manga[\'href\'])

it only prints the half of them...


回答1:


Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml parser is not dealing very well with it:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18

The standard library html.parser has less trouble with this specific page:

>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44

Translating that to your specific code sample using urllib, you would specify the parser thus:

soup = BeautifulSoup(page, 'html.parser')  # BeatifulSoup can do the reading


来源:https://stackoverflow.com/questions/16322862/beautiful-soup-findall-doesnt-find-them-all

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!