Scrape Google Search Result Description Using BeautifulSoup

问题

I want to Scrape Google Search Result Description Using BeautifulSoup but I am not able to scrape the tag which is containing the description.

Ancestor:

html
body#gsr.srp.vasq.wf-b
div#main
div#cnt.big
div.mw
div#rcnt
div.col
div#center_col
div#res.med
div#search
div
div#rso
div.g
div.rc
div.IsZvec
div
span.aCOpRe

Children

em

Python Code:

from bs4 import BeautifulSoup
import requests
import bs4.builder._lxml
import re

search = input("Enter the search term:")
param = {"q": search}

r = requests.get("https://google.com/search?q=", params = param)

soup = BeautifulSoup(r.content, "lxml")
soup.prettify()

title = soup.findAll("div",class_ = "BNeawe vvjwJb AP7Wnd")

for t in title:
    print(t.get_text())

description = soup.findAll("span", class_ = "aCOpRe")

for d in description:
    print(d.get_text())

print("\n")
link = soup.findAll("a")

for link in  soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)",link["href"].replace("/url?q=","")))

Image Link displaying the tag

回答1:

You might want to try the CSS selector and then just pull the text out.

For example:

import requests
from bs4 import BeautifulSoup


page = requests.get("https://www.google.com/search?q=scrap").text
soup = BeautifulSoup(page, "html.parser").select(".s3v9rd.AP7Wnd")

for item in soup:
    print(item.getText(strip=True))

Sample output for scrap:

discard or remove from service (a redundant, old, or inoperative vehicle, vessel, or machine), especially so as to convert it to scrap metal.

回答2:

The proper CSS selector for snippets (descriptions) of Google Search results is .aCOpRe span:not(.f).

Here's a full example in online IDE.

from bs4 import BeautifulSoup
import requests
import re

param = {"q": "coffee"}
headers = {
    "User-Agent":
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15"
}

r = requests.get("https://google.com/search", params=param, headers=headers)

soup = BeautifulSoup(r.content, "lxml")
soup.prettify()

title = soup.select(".DKV0Md span")

for t in title:
    print(f"Title: {t.get_text()}\n")

snippets = soup.select(".aCOpRe span:not(.f)")

for d in snippets:
    print(f"Snippet: {d.get_text()}\n")

link = soup.findAll("a")

for link in soup.find_all("a", href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)", link["href"].replace("/url?q=", "")))

Output

Title: Coffee - Wikipedia

Title: Coffee: Benefits, nutrition, and risks - Medical News Today

...

Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red in color – indicating ripeness – they are picked, processed, and dried.

Snippet: When people think of coffee, they usually think of its ability to provide an energy boost. ... This article looks at the health benefits of drinking coffee, the evidence ...

...

Alternatively, you can extract data from Google Search via SerpApi.

curl example

curl -s 'https://serpapi.com/search?q=coffee&location=Sweden&google_domain=google.se&gl=se&hl=sv&num=100'

Python example

from serpapi import GoogleSearch
import os

params = {
    "engine": "google",
    "q": "coffee",
    "location": "Sweden",
    "google_domain": "google.se",
    "gl": "se",
    "hl": "sv",
    "num": 100,
    "api_key": os.getenv("API_KEY")
}

client = GoogleSearch(params)
data = client.get_dict()

print("Organic results")

for result in data['organic_results']:
    print(f"""
Title: {result['title']}
Link: {result['link']}
Position: {result['position']}
Snippet: {result['snippet']}
""")

Output

Organic results

Title: Coffee - Wikipedia
Link: https://en.wikipedia.org/wiki/Coffee
Position: 1
Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red ...


Title: Drop Coffee
Link: https://www.dropcoffee.com/
Position: 2
Snippet: Drop Coffee is an award winning roastery in Stockholm, representing Sweden four times in the World Coffee Roasting Championship, placing second, third and ...

...

Disclaimer: I work at SerpApi.

来源：https://stackoverflow.com/questions/64880683/scrape-google-search-result-description-using-beautifulsoup

标签

python

beautifulsoup

google-search