scrapy-spider

Scrapy celery and multiple spiders

∥☆過路亽.° 提交于 2019-12-23 02:55:09
问题 I'm using scrapy and i'm trying to use celery to manage multiple spiders on one machine. The problem i have is (bit difficult to explain), that the spiders gets multiplied -> meaning if my first spider starts and i start a second spider the first spider executes twice. See my code here: ProcessJob.py class ProcessJob(): def processJob(self, job): #update job mysql = MysqlConnector.Mysql() db = mysql.getConnection(); cur = db.cursor(); job.status = 1 update = "UPDATE job SET status=1 WHERE id=

Scrapy getting data from links within table

喜欢而已 提交于 2019-12-23 02:46:13
问题 I am trying to scrape data from the html table, Texas Death Row I able to pull the existing data from the table using the spider script below: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from texasdeath.items import DeathItem class DeathSpider(BaseSpider): name = "death" allowed_domains = ["tdcj.state.tx.us"] start_urls = [ "https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html" ] def parse(self, response): hxs = HtmlXPathSelector(response)

How to enable overwriting a file everytime in scrapy item export?

最后都变了- 提交于 2019-12-23 01:12:15
问题 I am scraping a website which returns in a list of urls . Example - scrapy crawl xyz_spider -o urls.csv It is working absolutely fine now I want is to make new urls.csv not append data into the file. Is there any parameter passing I can do to make it enable? 回答1: Unfortunately scrapy can't do this at the moment. There is a proposed enhancement on github though: https://github.com/scrapy/scrapy/issues/547 However you can easily do redirect the output to stdout and redirect that to a file:

How to recursively crawl subpages with Scrapy

徘徊边缘 提交于 2019-12-22 18:36:27
问题 So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like: Category 1 name Subcategory 1 name data from this subcategory's page Subcategory n name data from this page Category n name Subcategory 1 name

Scrapy store returned items in variables to use in main script

江枫思渺然 提交于 2019-12-22 14:05:03
问题 I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes: import scrapy from scrapy.crawler import CrawlerProcess class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/' ] custom_settings = { 'LOG_ENABLED': 'False', } def parse(self, response): global title # This would work, but there should

CrawlSpider with Splash getting stuck after first URL

妖精的绣舞 提交于 2019-12-22 10:55:04
问题 I'm writing a scrapy spider where I need to render some of the responses with splash. My spider is based on CrawlSpider. I need to render my start_url responses to feed my crawl spider. Unfortunately my crawl spider stops after rendering of the first responds. Any idea what is going wrong? class VideoSpider(CrawlSpider): start_urls = ['https://juke.com/de/de/search?q=1+Mord+f%C3%BCr+2'] rules = ( Rule(LinkExtractor(allow=()), callback='parse_items',process_request = "use_splash",), ) def use

How many items has been scraped per start_url

ぃ、小莉子 提交于 2019-12-22 08:50:19
问题 I use scrapy to crawl 1000 urls and store scraped item in a mongodb. I'd to know how many items have been found for each url. From scrapy stats I can see 'item_scraped_count': 3500 However, I need this count for each start_url separately. There is also referer field for each item that I might use to count each url items manually: 2016-05-24 15:15:10 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=6w-_ucPV674> (referer: https://www.youtube.com/results?q=billys&sp=EgQIAhAB)

How to get the followers of a person as well as comments under the photos in instagram using scrapy?

牧云@^-^@ 提交于 2019-12-21 22:40:27
问题 As you see, the following json has number of followers as well as number of comments but how can I access the data within each comment as well as ID of followers so I could crawl into them? { "logging_page_id": "profilePage_20327023", "user": { "biography": null, "blocked_by_viewer": false, "connected_fb_page": null, "country_block": false, "external_url": null, "external_url_linkshimmed": null, "followed_by": { "count": 2585 }, "followed_by_viewer": false, "follows": { "count": 561 },

Logging to specific error log file in scrapy

瘦欲@ 提交于 2019-12-20 13:15:49
问题 I am running a log of scrapy by doing this: from scrapy import log class MySpider(BaseSpider): name = "myspider" def __init__(self, name=None, **kwargs): LOG_FILE = "logs/spider.log" log.log.defaultObserver = log.log.DefaultObserver() log.log.defaultObserver.start() log.started = False log.start(LOG_FILE, loglevel=log.INFO) super(MySpider, self).__init__(name, **kwargs) def parse(self,response): .... raise Exception("Something went wrong!") log.msg('Something went wrong!', log.ERROR) #

Get scrapy spider to crawl entire site

拜拜、爱过 提交于 2019-12-20 10:42:32
问题 I am using scrapy to crawl old sites that I own, I am using the code below as my spider. I don't mind having files outputted for each webpage, or a database with all the content within that. But I do need to be able to have the spider crawl the whole thing with out me having to put in every single url that I am currently having to do import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["www.example.com"] start_urls = [ "http://www.example.com/contactus" ] def parse