问题
I\'m trying to parse a website and get some info with BeautifulSoup.findAll but it doesn\'t find them all.. I\'m using python3
the code is this
#!/usr/bin/python3
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = urlopen (\"http://mangafox.me/directory/\")
# print (page.read ())
soup = BeautifulSoup (page.read ())
manga_img = soup.findAll (\'a\', {\'class\' : \'manga_img\'}, limit=None)
for manga in manga_img:
print (manga[\'href\'])
it only prints the half of them...
回答1:
Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml
parser is not dealing very well with it:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18
The standard library html.parser has less trouble with this specific page:
>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44
Translating that to your specific code sample using urllib
, you would specify the parser thus:
soup = BeautifulSoup(page, 'html.parser') # BeatifulSoup can do the reading
来源:https://stackoverflow.com/questions/16322862/beautiful-soup-findall-doesnt-find-them-all