scrapy-spider | 易学教程

Pass Selenium HTML string to Scrapy to add URLs to Scrapy list of URLs to scrape

阅读更多关于 Pass Selenium HTML string to Scrapy to add URLs to Scrapy list of URLs to scrape

问题 I'm very new to Python, Scrapy and Selenium. Thus, any help you could provide would be most appreciated. I'd like to be able to take HTML I've obtained from Selenium as the page source and processes it into a Scrapy Response object. The main reason is to be able to add the URLs in the Selenium Webdriver page source to the list of URLs Scrapy will parse. Again, any help would be appreciated. As a quick second question, does anyone know how to view the list of URLs that are in or were in the

From scraper_user.items import UserItem ImportError: No module named scraper_user.items

阅读更多关于 From scraper_user.items import UserItem ImportError: No module named scraper_user.items

问题 I am following this guide for scraping data from instagram: http://www.spataru.at/scraping-instagram-scrapy/ but I get this error: mona@pascal:~/computer_vision/instagram/instagram$ ls instagram scrapy.cfg mona@pascal:~/computer_vision/instagram/instagram$ scrapy crawl instagramspider 2017-03-01 15:30:10-0600 [scrapy] INFO: Scrapy 0.14.4 started (bot: instagram) 2017-03-01 15:30:10-0600 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats,

Trouble getting correct Xpath

阅读更多关于 Trouble getting correct Xpath

问题 I am trying to pull all product links and image links out of a shopping widget using general xpaths. This is the site: http://www.stopitrightnow.com/ This is the xpath I have: xpath('.//*[@class="shopthepost-widget"]/a/@href').extract() I would of thought this would pull all links but it does nothing. Following is the beginning of the widget source for reference. class="shopthepost-widget" data-widget-id="708473" data-widget-uid="1"><div id="stp-55d44feabd0eb" class="stp-outer stp-no-controls

Calling the same spider programmatically

阅读更多关于 Calling the same spider programmatically

问题 I have a spider which crawls links for the websites passed. I want to start the same spider again when its execution is finished with different set of data. How to restart the same crawler again? The websites are passed through database. I want the crawler to run in a unlimited loop until all the websites are crawled. Currently I have to start the crawler scrapy crawl first all the time. Is there any way to start the crawler once and it will stop when all the websites are crawled? I searched

unable to scrape myntra API data using scrapy framework 307 redirect error

阅读更多关于 unable to scrape myntra API data using scrapy framework 307 redirect error

问题 Below is the spider code: import scrapy class MyntraSpider(scrapy.Spider): custom_settings = { 'HTTPCACHE_ENABLED': False, 'dont_redirect': True, #'handle_httpstatus_list' : [302,307], #'CRAWLERA_ENABLED': False, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36', } name = "heytest" allowed_domains = ["www.myntra.com"] start_urls = ["https://www.myntra.com/web/v2/search/data/duke"] def parse(self, response): self

webpage access while using scrapy

阅读更多关于 webpage access while using scrapy

问题 I am new to python and scrapy. I followed the tutorial and tried to crawl few webpages. I used the code in the tutorial and replaced the URLs - http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0 and http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819

IMDB web crawler - Scrapy - Python

阅读更多关于 IMDB web crawler - Scrapy - Python

问题 import scrapy from imdbscrape.items import MovieItem class MovieSpider(scrapy.Spider): name = 'movie' allowed_domains = ['imdb.com'] start_urls = ['https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc'] def parse(self, response): urls = response.css('h3.lister-item-header > a::attr(href)').extract() for url in urls: yield scrapy.Request(url=response.urljoin(url),callback=self.parse_movie) nextpg = response.css('div.desc > a::attr(href)').extract_first() if

Change number of running spiders scrapyd

阅读更多关于 Change number of running spiders scrapyd

问题 Hey so I have about 50 spiders in my project and I'm currently running them via scrapyd server. I'm running into an issue where some of the resources I use get locked and make my spiders fail or go really slow. I was hoping their was some way to tell scrapyd to only have 1 running spider at a time and leave the rest in the pending queue. I didn't see a configuration option for this in the docs. Any help would be much appreciated! 回答1: This can be controlled by scrapyd settings. Set max_proc

Scraping Ajax based Review Page with Scrapy

阅读更多关于 Scraping Ajax based Review Page with Scrapy

问题 There. I am trying to scrape a website. Everything is working fine, the problem is that I cannot figure out how to scrape the ajax contents. The website I am scraping uses ajax content to get review pages using Post request. Here is what chrome dev tool say. Chrome Dev tool I researched a lot but I cannot understand how to scrape ajax contents. I know about form data and post or get request but I cannot use them. Moreover, I don't know how to scrape the content I need. I guess it cannot be

needing help to simulate an xhr request

阅读更多关于 needing help to simulate an xhr request

问题 I need to scrape a website with a "load more button". This is my spider code written in Python: import scrapy import json import requests import re from parsel import Selector from scrapy.selector import Selector from scrapy.http import HtmlResponse headers = { 'origin': 'https://www.tayara.tn', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-US,en;q=0.9', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari