scrapy-spider

Pass Selenium HTML string to Scrapy to add URLs to Scrapy list of URLs to scrape

為{幸葍}努か 提交于 2019-12-14 03:55:40
问题 I'm very new to Python, Scrapy and Selenium. Thus, any help you could provide would be most appreciated. I'd like to be able to take HTML I've obtained from Selenium as the page source and processes it into a Scrapy Response object. The main reason is to be able to add the URLs in the Selenium Webdriver page source to the list of URLs Scrapy will parse. Again, any help would be appreciated. As a quick second question, does anyone know how to view the list of URLs that are in or were in the

From scraper_user.items import UserItem ImportError: No module named scraper_user.items

坚强是说给别人听的谎言 提交于 2019-12-14 03:35:59
问题 I am following this guide for scraping data from instagram: http://www.spataru.at/scraping-instagram-scrapy/ but I get this error: mona@pascal:~/computer_vision/instagram/instagram$ ls instagram scrapy.cfg mona@pascal:~/computer_vision/instagram/instagram$ scrapy crawl instagramspider 2017-03-01 15:30:10-0600 [scrapy] INFO: Scrapy 0.14.4 started (bot: instagram) 2017-03-01 15:30:10-0600 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats,

Trouble getting correct Xpath

无人久伴 提交于 2019-12-13 22:22:49
问题 I am trying to pull all product links and image links out of a shopping widget using general xpaths. This is the site: http://www.stopitrightnow.com/ This is the xpath I have: xpath('.//*[@class="shopthepost-widget"]/a/@href').extract() I would of thought this would pull all links but it does nothing. Following is the beginning of the widget source for reference. class="shopthepost-widget" data-widget-id="708473" data-widget-uid="1"><div id="stp-55d44feabd0eb" class="stp-outer stp-no-controls

Calling the same spider programmatically

百般思念 提交于 2019-12-13 21:52:33
问题 I have a spider which crawls links for the websites passed. I want to start the same spider again when its execution is finished with different set of data. How to restart the same crawler again? The websites are passed through database. I want the crawler to run in a unlimited loop until all the websites are crawled. Currently I have to start the crawler scrapy crawl first all the time. Is there any way to start the crawler once and it will stop when all the websites are crawled? I searched

unable to scrape myntra API data using scrapy framework 307 redirect error

被刻印的时光 ゝ 提交于 2019-12-13 10:07:15
问题 Below is the spider code: import scrapy class MyntraSpider(scrapy.Spider): custom_settings = { 'HTTPCACHE_ENABLED': False, 'dont_redirect': True, #'handle_httpstatus_list' : [302,307], #'CRAWLERA_ENABLED': False, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36', } name = "heytest" allowed_domains = ["www.myntra.com"] start_urls = ["https://www.myntra.com/web/v2/search/data/duke"] def parse(self, response): self

webpage access while using scrapy

岁酱吖の 提交于 2019-12-13 04:01:00
问题 I am new to python and scrapy. I followed the tutorial and tried to crawl few webpages. I used the code in the tutorial and replaced the URLs - http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0 and http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819

IMDB web crawler - Scrapy - Python

若如初见. 提交于 2019-12-13 03:48:36
问题 import scrapy from imdbscrape.items import MovieItem class MovieSpider(scrapy.Spider): name = 'movie' allowed_domains = ['imdb.com'] start_urls = ['https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc'] def parse(self, response): urls = response.css('h3.lister-item-header > a::attr(href)').extract() for url in urls: yield scrapy.Request(url=response.urljoin(url),callback=self.parse_movie) nextpg = response.css('div.desc > a::attr(href)').extract_first() if

Change number of running spiders scrapyd

天大地大妈咪最大 提交于 2019-12-13 00:35:27
问题 Hey so I have about 50 spiders in my project and I'm currently running them via scrapyd server. I'm running into an issue where some of the resources I use get locked and make my spiders fail or go really slow. I was hoping their was some way to tell scrapyd to only have 1 running spider at a time and leave the rest in the pending queue. I didn't see a configuration option for this in the docs. Any help would be much appreciated! 回答1: This can be controlled by scrapyd settings. Set max_proc

Scraping Ajax based Review Page with Scrapy

左心房为你撑大大i 提交于 2019-12-13 00:32:22
问题 There. I am trying to scrape a website. Everything is working fine, the problem is that I cannot figure out how to scrape the ajax contents. The website I am scraping uses ajax content to get review pages using Post request. Here is what chrome dev tool say. Chrome Dev tool I researched a lot but I cannot understand how to scrape ajax contents. I know about form data and post or get request but I cannot use them. Moreover, I don't know how to scrape the content I need. I guess it cannot be

needing help to simulate an xhr request

一世执手 提交于 2019-12-12 18:51:22
问题 I need to scrape a website with a "load more button". This is my spider code written in Python: import scrapy import json import requests import re from parsel import Selector from scrapy.selector import Selector from scrapy.http import HtmlResponse headers = { 'origin': 'https://www.tayara.tn', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-US,en;q=0.9', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari