问题
I am trying to scrap google search and people also search links.
Example when you go on google and you search Christopher nolan. Google also produces a "people also search for" which includes images of people related to the our search which is Christopher nolan. In this case our People also search produces (Christian bale,Emma Thomas, Zack Synder etc). I am interested in scraping this data.
I am using scrapy
framework and wrote a simple scrapper but it returns an empty csv data file. Below is code I have so far your help is appreciated. Hope everything is clear in what i want to achieve. I used Xpath helper(google app) to help find the Xpath.
My code:
# PyGSSpider(spidder folder)
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from PyGoogleSearch.items import PyGSItem
import sys
class PyGSSpider(CrawlSpider):
name = "google"
allowed_domains = ["www.google.com"]
start_urls = ["https://www.google.com/#q=christopher+nolan"]
#Extracts Christopher Nolan link
rules = [
Rule(SgmlLinkExtractor(allow=("https://www.google.com/search?q=christpher+noaln&oq=christpher+noaln&aqs")), follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
#Parse function for extracting the people also search link.
def parse_item(self,response):
self.log('Hi, this is an item page! %s' % response.url)
sel=Selector(response)
item=PyGSItem()
item['peoplealsosearchfor'] = sel.xpath('//div[@id="cnt"]/@href').extract()
return item
items.py:
from scrapy.item import Item, Field
class PyGSItem(Item):
peoplealsosearchfor = Field()
回答1:
The reason this won't work is because Google enforcer an algorithm which prevents bots from using their search.
However using Selenium might do the trick.
来源:https://stackoverflow.com/questions/23840059/scrapy-google-search