scrapy-spider | 易学教程

Scrap multiple accounts aka multiple logins

阅读更多关于 Scrap multiple accounts aka multiple logins

问题 I scrap successfully data for a single account. I want to scrap multiple accounts on a single website, multiple accounts needs multiple logins, I want a way how to manage login/logout ? 回答1: you can scrape multiples accounts in parallel using multiple cookiejars per account session, see "cookiejar" request meta key at http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=cookiejar#std:reqmeta-cookiejar To clarify: suppose we have an array of accounts in settings.py : MY

How to pass custom settings through CrawlerProcess in scrapy?

阅读更多关于 How to pass custom settings through CrawlerProcess in scrapy?

问题 I have two CrawlerProcesses, each is calling different spider. I want to pass custom settings to one of these processes to save the output of the spider to csv, I thought I could do this: storage_settings = {'FEED_FORMAT': 'csv', 'FEED_URI': 'foo.csv'} process = CrawlerProcess(get_project_settings()) process.crawl('ABC', crawl_links=main_links, custom_settings=storage_settings ) process.start() and in my spider I read them as an argument: def __init__(self, crawl_links=None, allowed_domains

How to collect stats from within scrapy spider callback?

阅读更多关于 How to collect stats from within scrapy spider callback?

问题 How can I collect stats from within a spider callback? Example class MySpider(Spider): name = "myspider" start_urls = ["http://example.com"] def parse(self, response): stats.set_value('foo', 'bar') Not sure what to import or how to make stats available in general. 回答1: Check out the stats page from the scrapy documentation. The documentation states that the Stats Collector, but it may be necessary to add from scrapy.stats import stats to your spider code to be able to do stuff with it. EDIT:

How to handle connection or download error in Scrapy?

阅读更多关于 How to handle connection or download error in Scrapy?

I am using the following to check for (internet) connection errors in my spider.py : def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, callback=self.parse, errback=self.handle_error) def handle_error(self, failure): if failure.check(DNSLookupError): # or failure.check(UnknownHostError): request = failure.request self.logger.error('DNSLookupError on: %s', request.url) print("\nDNS Error! Please check your internet connection!\n") elif failure.check(HttpError): response = failure.value.response self.logger.error('HttpError on: %s', response.url) print('\nSpider

Scrapy celery and multiple spiders

阅读更多关于 Scrapy celery and multiple spiders

I'm using scrapy and i'm trying to use celery to manage multiple spiders on one machine. The problem i have is (bit difficult to explain), that the spiders gets multiplied -> meaning if my first spider starts and i start a second spider the first spider executes twice. See my code here: ProcessJob.py class ProcessJob(): def processJob(self, job): #update job mysql = MysqlConnector.Mysql() db = mysql.getConnection(); cur = db.cursor(); job.status = 1 update = "UPDATE job SET status=1 WHERE id=" + str(job.id) cur.execute(update) db.commit() db.close() #Start new crawler configure_logging()

Pyinstaller scrapy error:

阅读更多关于 Pyinstaller scrapy error:

问题 After installing all dependencies for scrapy on windows 32bit. I've tried to build an executable from my scrapy spider. Spider script "runspider.py" works ok when running as "python runspider.py" Building executable "pyinstaller --onefile runspider.py": C:\Users\username\Documents\scrapyexe>pyinstaller --onefile runspider.py 19 INFO: wrote C:\Users\username\Documents\scrapyexe\runspider.spec 49 INFO: Testing for ability to set icons, version resources... 59 INFO: ... resource update available

Scrapy store returned items in variables to use in main script

阅读更多关于 Scrapy store returned items in variables to use in main script

I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes: import scrapy from scrapy.crawler import CrawlerProcess class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/' ] custom_settings = { 'LOG_ENABLED': 'False', } def parse(self, response): global title # This would work, but there should be a better way title = response.css('title::text').extract_first() process = CrawlerProcess({ 'USER

How to scrape a web site with sucuri protection

阅读更多关于 How to scrape a web site with sucuri protection

问题 According to Scrapy Documetions I want to crawl and scrape data from several sites, My codes works correctly with usual website,but when I want crawl a website with Sucuri I don't get any data, it seems sucuri firewall prevent me to access to websites markups. The target website is http://www.dwarozh.net/ and This is my spider snippet from scrapy import Spider from scrapy.selector import Selector import scrapy from Stack.items import StackItem from bs4 import BeautifulSoup from scrapy import

ModuleNotFoundError: No module named 'Scrapy'

阅读更多关于 ModuleNotFoundError: No module named 'Scrapy'

import Scrapy class NgaSpider(Scrapy.Spider): name = "NgaSpider" host = "http://bbs.ngacn.cc/" start_urls = [ "http://bbs.ngacn.cc/thread.php?fid=406", ] def parse(self, response): print ("response.body") Error: ModuleNotFoundError: No module named 'Scrapy' What is going on to fix this issue？ You are incorrectly importing the scrapy module. Find a simple tutorial and references from here. You have to do the following changes: import scrapy # Change here class NgaSpider(scrapy.Spider): # Change here too name = "NgaSpider" host = "http://bbs.ngacn.cc/" start_urls = [ "http://bbs.ngacn.cc/thread

Parsing the urls in sitemap with different url format using sitemap spider in scrapy, python

阅读更多关于 Parsing the urls in sitemap with different url format using sitemap spider in scrapy, python

I am using sitemap spider in scrapy, python. The sitemap seems to have unusual format with '//' in front of urls: <url> <loc>//www.example.com/10/20-baby-names</loc> </url> <url> <loc>//www.example.com/elizabeth/christmas</loc> </url> myspider.py from scrapy.contrib.spiders import SitemapSpider from myspider.items import * class MySpider(SitemapSpider): name = "myspider" sitemap_urls = ["http://www.example.com/robots.txt"] def parse(self, response): item = PostItem() item['url'] = response.url item['title'] = response.xpath('//title/text()').extract() return item I am getting this error: raise