scrapy-spider

ImportError: No module named win32api while using Scrapy

半城伤御伤魂 提交于 2019-12-20 09:15:53
问题 I am a new learner of Scrapy. I installed python 2.7 and all other engines needed. Then I tried to build a Scrapy project following the tutorial http://doc.scrapy.org/en/latest/intro/tutorial.html. In the crawling step, after I typed scrapy crawl dmoz it generated this error message ImportError: No module named win32api. [twisted] CRITICAL : Unhandled error in deferred I am using Windows. Stack trace: I am using Windows. 回答1: Try this. pip install pypiwin32 回答2: If you search a bit along the

Scrapy: non-blocking pause

隐身守侯 提交于 2019-12-20 08:59:59
问题 I have a problem. I need to stop the execution of a function for a while, but not stop the implementation of parsing as a whole. That is, I need a non-blocking pause. It's looks like: class ScrapySpider(Spider): name = 'live_function' def start_requests(self): yield Request('some url', callback=self.non_stop_function) def non_stop_function(self, response): for url in ['url1', 'url2', 'url3', 'more urls']: yield Request(url, callback=self.second_parse_function) # Here I need some function for

Scrapy: how to use items in spider and how to send items to pipelines?

自作多情 提交于 2019-12-20 08:49:37
问题 I am new to scrapy and my task is simple: For a given e-commerce website: crawl all website pages look for products page If the URL point to a product page Create an Item Process the item to store it in a database I created the spider but products are just printed in a simple file. My question is about the project structure: how to use items in spider and how to send items to pipelines ? I can't find a simple example of a project using items and pipelines. 回答1: How to use items in my spider?

Speed up web scraper

穿精又带淫゛_ 提交于 2019-12-20 08:15:10
问题 I am scraping 23770 webpages with a pretty simple web scraper using scrapy . I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages). I have looked on the scrapy webpage and the mailing lists and stackoverflow , but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it.

Python Scrapy Get HTML <script> tag

蓝咒 提交于 2019-12-20 07:46:08
问题 I have a project and i need the get script in html code. <script> (function() { ... / More Code Level.grade = "2"; Level.level = "1"; Level.max_line = "5"; Level.cozum = 'adım 12\ndön sağ\nadım 13\ndön sol\nadım 11'; ... / More Code </script> How i get only " adım 12\ndön sağ\nadım 13\ndön sol\nadım 11 " this code? Thanks for Helps 回答1: Use Regex to do that First grab the content of that SCRIPT tag like response.css("script").extract_first() And then use this regex (Level\.cozum = )(.*?)(\;)

Order a json by field using scrapy

倾然丶 夕夏残阳落幕 提交于 2019-12-20 05:53:42
问题 I have created a spider to scrape problems from projecteuler.net. Here I have concluded my answer to a related question with I launch this with the command scrapy crawl euler -o euler.json and it outputs an array of unordered json objects, everyone corrisponding to a single problem: this is fine for me because I'm going to process it with javascript, even if I think resolving the ordering problem via scrapy can be very simple. But unfortunately, ordering items to write in json by scrapy (I

Scrapy installed, but won't recognized in the command line

元气小坏坏 提交于 2019-12-20 03:46:09
问题 I installed Scrapy in my python 2.7 environment in windows 7 but when I trying to start a new Scrapy project using scrapy startproject newProject the command prompt show this massage 'scrapy' is not recognized as an internal or external command, operable program or batch file. Note: I also have python 3.5 but that do not have scrapy This question is not duplicate of this 回答1: See the official documentation. Set environment variable Install pywin32 回答2: Scrapy should be in your environment

Force Python Scrapy not to encode URL

百般思念 提交于 2019-12-19 18:23:22
问题 There are some URLs with [] in it like http://www.website.com/CN.html?value_ids[]=33&value_ids[]=5007 But when I try scraping this URL with Scrapy, it makes Request to this URL http://www.website.com/CN.html?value_ids%5B%5D=33&value_ids%5B%5D=5007 How can I force scrapy to not to urlenccode my URLs? 回答1: When creating a Request object scrapy applies some url encoding methods. To revert these you can utilize a custom middleware and change the url to your needs. You could use a Downloader

Dynamic rules based on start_urls for Scrapy CrawlSpider?

梦想与她 提交于 2019-12-19 05:09:04
问题 I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain different from the original domain). I managed to do that with 2 rules but they are based on the domain of the site being crawled. If I want to run this on multiple websites I run into a problem because I don't know which "start_url" I'm currently on so I can't change the rule appropriately. Here's what I came up with so far, it

How to prevent a twisted.internet.error.ConnectionLost error when using Scrapy?

孤街浪徒 提交于 2019-12-18 13:02:55
问题 I'm scraping some pages with scrapy and get the following error: twisted.internet.error.ConnectionLost My command line output: 2015-05-04 18:40:32+0800 [cnproxy] INFO: Spider opened 2015-05-04 18:40:32+0800 [cnproxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-05-04 18:40:32+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-05-04 18:40:32+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 2015-05-04 18:40:32+0800 [cnproxy] DEBUG: