scrapy-spider

How to control the order of yield in Scrapy

牧云@^-^@ 提交于 2020-01-01 12:01:37
问题 Help! Reading the following scrapy code and the result of crawler. I want to crawl some data from http://china.fathom.info/data/data.json, and only Scrapy is allowed. But I don't know how to control the order of yield. I look forward to process all parse_member request in the loop and then return the group_item , but seems yield item is always executed before yield request. start_urls = [ "http://china.fathom.info/data/data.json" ] def parse(self, response): groups = json.loads(response.body)

scraping multiple pages with scrapy

醉酒当歌 提交于 2020-01-01 03:28:11
问题 I am trying to use scrapy to scrape a website that has several pages of information. my code is: from scrapy.spider import BaseSpider from scrapy.selector import Selector from tcgplayer1.items import Tcgplayer1Item class MySpider(BaseSpider): name = "tcg" allowed_domains = ["http://www.tcgplayer.com/"] start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"] def parse(self, response): hxs = Selector(response) titles = hxs.xpath("//div[@class='magicCard']") for title in

Scrapy CrawlSpider retry scrape

我的未来我决定 提交于 2019-12-30 11:24:08
问题 For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like: def parse_page(self, response): url = response.url # Check to make sure the page is loaded if 'var PageIsLoaded = false;' in response.body: self.logger.warning('parse_page

Scrapy upload file

半腔热情 提交于 2019-12-29 09:19:10
问题 I am making a form request to a website using scrapy. The form requires to upload a pdf file, How can we do it in Scrapy. I am trying this like - FormRequest(url,callback=self.parseSearchResponse,method="POST",formdata={'filename':'abc.xyz','file':'path to file/abc.xyz'}) 回答1: At this very moment Scrapy has no built-in support for uploading files. File uploading via forms in HTTP was specified in RFC1867. According to the spec, an HTTP request with Content-Type: multipart/form-data is

Scrapy: Extracting data from source and its links

我们两清 提交于 2019-12-25 17:19:41
问题 Edited question to link to original: Scrapy getting data from links within table From the link https://www.tdcj.state.tx.us/death_row/dr_info/trottiewillielast.html I am trying to get info from the main table as well as the data within the other 2 links within the table. I managed to pull from one, but question is going to the other link and appending the data in one line. from urlparse import urljoin import scrapy from texasdeath.items import DeathItem class DeathItem(Item): firstName =

scrapy-splash usage for rendering javascript

回眸只為那壹抹淺笑 提交于 2019-12-25 08:24:45
问题 This is a follow up of my previous quesion I installed splash and scrapy-splash. And also followed the instructions for scrapy-splash. I edited my code as follows: import scrapy from scrapy_splash import SplashRequest class CityDataSpider(scrapy.Spider): name = "citydata" def start_requests(self): urls = [ 'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p

How to add instance variable to Scrapy CrawlSpider?

落爺英雄遲暮 提交于 2019-12-25 07:24:47
问题 I am running a CrawlSpider and I want to implement some logic to stop following some of the links in mid-run, by passing a function to process_request . This function uses the spider's class variables in order to keep track of the current state, and depending on it (and on the referrer URL), links get dropped or continue to be processed: class BroadCrawlSpider(CrawlSpider): name = 'bitsy' start_urls = ['http://scrapy.org'] foo = 5 rules = ( Rule(LinkExtractor(), callback='parse_item', process

Scrapy pipeline extracting in the wrong csv format

坚强是说给别人听的谎言 提交于 2019-12-25 03:43:08
问题 My Hacker News spider outputs all the results on one line, instead of one each line, as it can be seen here. All on the same line Here is my code. import scrapy import string import urlparse from scrapy.selector import Selector from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors import LinkExtractor class HnItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() score = scrapy.Field() class HnSpider(scrapy.Spider): name = 'hackernews' allowed_domains = [

Unable to rename downloaded images through pipelines without the usage of item.py

半城伤御伤魂 提交于 2019-12-25 00:20:03
问题 I've created a script using python's scrapy module to download and rename movie images from multiple pages out of a torrent site and store them in a desktop folder. When it is about downloading and storing those images in a desktop folder, my script is the same errorlessly. However, what I'm struggling to do now is rename those files on the fly. As I didn't make use of item.py file and I do not wish to either, I hardly understand how the logic of pipelines.py file would be to handle the

I want to add item class within an item class

风流意气都作罢 提交于 2019-12-24 20:14:18
问题 Final JSON will be : "address": ----, "state": ----, year: { "first": ----, "second": { "basic": ----, "Information": ----, } }, I want to create my items.py like (just example): class Item(scrapy.Item): address = scrapy.Field() state = scrapy.Field() year = scrapy.Field(first), scrapy.Field(second) class first(scrapy.Item): amounts = scrapy.Field() class second(scrapy.Item): basic = scrapy.Field() information = scrapy.Field() How to implement this , already checked this https://doc.scrapy