Beautifulsoup returns incomplete html

问题

I am reading a book about Python right now. There is a small project for homework: "Write a program that goes to a photo-sharing site like Flickr or Imgur, searches for a category of photos, and then downloads all the resulting images." It is suggested to use only webbrowser, requests and bs4 libraries.

I cannot do it for Flickr. I found that the parser cannot go inside the element (div class="interaction-view"). Using "Inspect element" in Chrome I can see that there are a few "div" elements inside it and "a" element. However, when I use bs4 library it cannot see it.

My code like this:

#!/usr/bin/env python3
# To download photos from Flickr

import requests, bs4

search_name = "spam"
website_name = requests.get('https://www.flickr.com/search/?text='
                       + search_name)
website_name.raise_for_status()
parse_obj = bs4.BeautifulSoup(website_name.text, "html.parser")
elements = parse_obj.select('body #content main .main.search-photos-results \
                .view.photo-list-view.requiredToShowOnServer \
                .view.photo-list-photo-view.requiredToShowOnServer.awake \
                .interaction-view')
print(elements)

It only prints:

[<div class="interaction-view"></div>, <div class="interaction-view"></div>...]

Without any nested elements and I do not understand why... Thank you!

回答1:

The issue is that the content of <div class="interaction-view"></div> on flickr is only loaded via javascript. You can check that if you view the page source, you'll find: <div class="interaction-view"></div> with no content in the div tag.

You need to execute javascript somehow. Since beautifulsoup doesn't offer this, one solution is to use selenium for that. pip install selenium and install geckodriver for firefox (on OSX: brew install geckodriver). Then change your code to use selenium to load the page:

#!/usr/bin/env python3

import requests, bs4
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

search_name = "spam"
url = 'https://www.flickr.com/search/?text=%s' % search_name

browser = webdriver.Firefox()
browser.get(url)
delay = 3
WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_id('...')))

soup = bs4.BeautifulSoup(browser.page_source, "html.parser")


elements = soup.select('body #content main .main.search-photos-results \
                .view.photo-list-view.requiredToShowOnServer \
                .view.photo-list-photo-view.requiredToShowOnServer.awake \
                .interaction-view')
print(elements)

The WebDriverWait part is needed so selenium waits with parsing until a certain element is loaded. You need to change ... to an id you know will be present. See this answer to check how it can be done with classes.

来源：https://stackoverflow.com/questions/41706274/beautifulsoup-returns-incomplete-html

标签

python

parsing

beautifulsoup

flickr