问题
I would like to crawl/check multiple websites(on same domain) for a specific keyword. I have found this script, but I can't find how to add the specific keyword to be search for. What the script needs to do is find the keyword, and give the result in which link it was found. Could anyone point me to where i could read more about this ? I have been reading scrapy's documentation, but I can't seem to find this.
Thank you.
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def __init__(self):
self.page_number = starting_number
def start_requests(self):
# generate page IDs from 1000 down to 501
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i, callback=self.parse)
def parse(self, response):
**parsing data from the webpage**
回答1:
You'll need to use some parser or regex to find the text you are looking for inside the response body.
every scrapy callback method contains the response body inside the response
object, which you can check with response.body
(for example inside the parse
method), then you'll have to use some regex or better xpath
or css
selectors to go to the path of your text knowing the xml structure of the page you crawled.
Scrapy lets you use the response
object as a Selector, so you can go to the title of the page with response.xpath('//head/title/text()')
for example.
Hope it helped.
来源:https://stackoverflow.com/questions/33989925/using-scrapy-to-find-specific-text-from-multiple-websites