scrapy-spider

Scrap multiple accounts aka multiple logins

匆匆过客 提交于 2019-12-07 09:42:33
问题 I scrap successfully data for a single account. I want to scrap multiple accounts on a single website, multiple accounts needs multiple logins, I want a way how to manage login/logout ? 回答1: you can scrape multiples accounts in parallel using multiple cookiejars per account session, see "cookiejar" request meta key at http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=cookiejar#std:reqmeta-cookiejar To clarify: suppose we have an array of accounts in settings.py : MY

How to pass custom settings through CrawlerProcess in scrapy?

被刻印的时光 ゝ 提交于 2019-12-07 08:53:59
问题 I have two CrawlerProcesses, each is calling different spider. I want to pass custom settings to one of these processes to save the output of the spider to csv, I thought I could do this: storage_settings = {'FEED_FORMAT': 'csv', 'FEED_URI': 'foo.csv'} process = CrawlerProcess(get_project_settings()) process.crawl('ABC', crawl_links=main_links, custom_settings=storage_settings ) process.start() and in my spider I read them as an argument: def __init__(self, crawl_links=None, allowed_domains

How to collect stats from within scrapy spider callback?

倖福魔咒の 提交于 2019-12-07 00:23:07
问题 How can I collect stats from within a spider callback? Example class MySpider(Spider): name = "myspider" start_urls = ["http://example.com"] def parse(self, response): stats.set_value('foo', 'bar') Not sure what to import or how to make stats available in general. 回答1: Check out the stats page from the scrapy documentation. The documentation states that the Stats Collector, but it may be necessary to add from scrapy.stats import stats to your spider code to be able to do stuff with it. EDIT:

How to handle connection or download error in Scrapy?

别来无恙 提交于 2019-12-06 16:26:26
I am using the following to check for (internet) connection errors in my spider.py : def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, callback=self.parse, errback=self.handle_error) def handle_error(self, failure): if failure.check(DNSLookupError): # or failure.check(UnknownHostError): request = failure.request self.logger.error('DNSLookupError on: %s', request.url) print("\nDNS Error! Please check your internet connection!\n") elif failure.check(HttpError): response = failure.value.response self.logger.error('HttpError on: %s', response.url) print('\nSpider

Scrapy celery and multiple spiders

我怕爱的太早我们不能终老 提交于 2019-12-06 15:41:10
I'm using scrapy and i'm trying to use celery to manage multiple spiders on one machine. The problem i have is (bit difficult to explain), that the spiders gets multiplied -> meaning if my first spider starts and i start a second spider the first spider executes twice. See my code here: ProcessJob.py class ProcessJob(): def processJob(self, job): #update job mysql = MysqlConnector.Mysql() db = mysql.getConnection(); cur = db.cursor(); job.status = 1 update = "UPDATE job SET status=1 WHERE id=" + str(job.id) cur.execute(update) db.commit() db.close() #Start new crawler configure_logging()

Pyinstaller scrapy error:

旧城冷巷雨未停 提交于 2019-12-06 13:57:41
问题 After installing all dependencies for scrapy on windows 32bit. I've tried to build an executable from my scrapy spider. Spider script "runspider.py" works ok when running as "python runspider.py" Building executable "pyinstaller --onefile runspider.py": C:\Users\username\Documents\scrapyexe>pyinstaller --onefile runspider.py 19 INFO: wrote C:\Users\username\Documents\scrapyexe\runspider.spec 49 INFO: Testing for ability to set icons, version resources... 59 INFO: ... resource update available

Scrapy store returned items in variables to use in main script

筅森魡賤 提交于 2019-12-06 13:38:14
I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes: import scrapy from scrapy.crawler import CrawlerProcess class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/' ] custom_settings = { 'LOG_ENABLED': 'False', } def parse(self, response): global title # This would work, but there should be a better way title = response.css('title::text').extract_first() process = CrawlerProcess({ 'USER

How to scrape a web site with sucuri protection

女生的网名这么多〃 提交于 2019-12-06 12:23:31
问题 According to Scrapy Documetions I want to crawl and scrape data from several sites, My codes works correctly with usual website,but when I want crawl a website with Sucuri I don't get any data, it seems sucuri firewall prevent me to access to websites markups. The target website is http://www.dwarozh.net/ and This is my spider snippet from scrapy import Spider from scrapy.selector import Selector import scrapy from Stack.items import StackItem from bs4 import BeautifulSoup from scrapy import

ModuleNotFoundError: No module named 'Scrapy'

旧巷老猫 提交于 2019-12-06 11:54:59
import Scrapy class NgaSpider(Scrapy.Spider): name = "NgaSpider" host = "http://bbs.ngacn.cc/" start_urls = [ "http://bbs.ngacn.cc/thread.php?fid=406", ] def parse(self, response): print ("response.body") Error: ModuleNotFoundError: No module named 'Scrapy' What is going on to fix this issue? You are incorrectly importing the scrapy module. Find a simple tutorial and references from here. You have to do the following changes: import scrapy # Change here class NgaSpider(scrapy.Spider): # Change here too name = "NgaSpider" host = "http://bbs.ngacn.cc/" start_urls = [ "http://bbs.ngacn.cc/thread

Parsing the urls in sitemap with different url format using sitemap spider in scrapy, python

隐身守侯 提交于 2019-12-06 11:11:35
I am using sitemap spider in scrapy, python. The sitemap seems to have unusual format with '//' in front of urls: <url> <loc>//www.example.com/10/20-baby-names</loc> </url> <url> <loc>//www.example.com/elizabeth/christmas</loc> </url> myspider.py from scrapy.contrib.spiders import SitemapSpider from myspider.items import * class MySpider(SitemapSpider): name = "myspider" sitemap_urls = ["http://www.example.com/robots.txt"] def parse(self, response): item = PostItem() item['url'] = response.url item['title'] = response.xpath('//title/text()').extract() return item I am getting this error: raise