Can't parse a Google search result page using BeautifulSoup

点点圈 提交于 2020-03-23 08:03:52

问题


I'm parsing webpages using BeautifulSoup from bs4 in python. When I inspected the elements of a google search page, this was the division having the 1st result:

Image

and since it had class = 'r' I wrote this code:

import requests
site = requests.get('https://www.google.com/search?client=firefox-b-d&ei=CLtgXt_qO7LH4-EP6LSzuAw&q=%22narendra+modi%22+%\22scams%22+%\22frauds%22+%\22corruption%22+%22modi%22+-lalit+-nirav&oq=%22narendra+modi%22+%\22scams%22+%\22frauds%22+%\22corruption%22+%22modi%22+-lalit+-nirav&gs_l=psy-ab.3...5077.11669..12032...5.0..0.202.2445.1j12j1......0....1..gws-wiz.T_WHav1OCvk&ved=0ahUKEwjfjrfv94LoAhWy4zgGHWjaDMcQ4dUDCAo&uact=5')
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)

But the command prompt returned just []

What could've gone wrong and how to correct it?

Also, Here's the webpage.

EDIT 1: I edited my code accordingly by adding the dictionary for headers, yet the result is the same []. Here's the new code:

import requests
headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
site = requests.get('https://www.google.com/search?client=firefox-b-d&ei=CLtgXt_qO7LH4-EP6LSzuAw&q=%22narendra+modi%22+%22cams%22+%22frauds%22+%22corruption%22+%22modi%22+-lalit+-nirav&oq=%22narendra+modi%22+%22scams%22+%22frauds%22+%22corruption%22+%22modi%22+-lalit+-nirav&gs_l=psy-ab.3...5077.11669..12032...5.0..0.202.2445.1j12j1......0....1..gws-wiz.T_WHav1OCvk&ved=0ahUKEwjfjrfv94LoAhWy4zgGHWjaDMcQ4dUDCAo&uact=5', headers = headers)
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)

NOTE: When I tell it to print the entire page, there's no problem, or when I take list(page.children) , it works fine.


回答1:


Some website requires User-Agent header to be set to prevent fake request from non-browser. But, fortunately there's a way to pass headers to the request as such

# Define a dictionary of http request headers
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
} 

# Pass in the headers as a parameterized argument
requests.get(url, headers=headers)

Note: List of user agents can be found here



来源:https://stackoverflow.com/questions/60575036/cant-parse-a-google-search-result-page-using-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!