问题
I am trying to scrape with soup and am obtaining an empty set when I call findAll
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url='https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/SearchDisplayView?catalogId=10123&langId=44&storeId=10151&krypto=70KutR16JmLgr7Ka%2F385RFXrzDpOkSqx%2FRC3DnlU09%2BYcw0pR5cfIfC0kOlQywiD%2BTEe7ppq8ENXglbpqA8sDUtif1h3ZjrEoQkV29%2B90iqljHi2gm2T%2BDZHH2%2FCNeKB%2BkVglbz%2BNx1bKsSfE5L6SVtckHxg%2FM%2F%2FVieWp8vgaJTan0k1WrPjCrVuDs5WnbRN#langId=44&storeId=10151&catalogId=10123&categoryId=&parent_category_rn=&top_category=&pageSize=60&orderBy=RELEVANCE&searchTerm=milk&beginIndex=0&hideFilters=true&categoryFacetId1='
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,'html.parser')
containers = page_soup.findAll("div",{"class":"product"})
containers
I also got empty datasets from these articles: findAll returning empty for html
and BeautifulSoup find_all() returns no data
Can anyone offer any help?
回答1:
The page content is loaded with javascript, so you can't just use BeautifulSoup
to parse it. You have to use another module like selenium
to simulate javacript execution.
Here is an exemple:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
url='https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/SearchDisplayView?catalogId=10123&langId=44&storeId=10151&krypto=70KutR16JmLgr7Ka%2F385RFXrzDpOkSqx%2FRC3DnlU09%2BYcw0pR5cfIfC0kOlQywiD%2BTEe7ppq8ENXglbpqA8sDUtif1h3ZjrEoQkV29%2B90iqljHi2gm2T%2BDZHH2%2FCNeKB%2BkVglbz%2BNx1bKsSfE5L6SVtckHxg%2FM%2F%2FVieWp8vgaJTan0k1WrPjCrVuDs5WnbRN#langId=44&storeId=10151&catalogId=10123&categoryId=&parent_category_rn=&top_category=&pageSize=60&orderBy=RELEVANCE&searchTerm=milk&beginIndex=0&hideFilters=true&categoryFacetId1='
driver = webdriver.Firefox()
driver.get(url)
page = driver.page_source
page_soup = soup(page,'html.parser')
containers = page_soup.findAll("div",{"class":"product"})
print(containers)
print(len(containers))
OUTPUT:
[
<div class="product "> ...
...,
<div class="product hl-product hookLogic highlighted straplineRow" ...
]
64
来源:https://stackoverflow.com/questions/55387710/soup-findall-returning-empty-list