scrapy-spider | 易学教程

How to handle connection or download error in Scrapy?

阅读更多关于 How to handle connection or download error in Scrapy?

问题 I am using the following to check for (internet) connection errors in my spider.py : def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, callback=self.parse, errback=self.handle_error) def handle_error(self, failure): if failure.check(DNSLookupError): # or failure.check(UnknownHostError): request = failure.request self.logger.error('DNSLookupError on: %s', request.url) print("\nDNS Error! Please check your internet connection!\n") elif failure.check(HttpError):

How do you open a file stream for reading using Scrapy?

阅读更多关于 How do you open a file stream for reading using Scrapy?

问题 Using Scrapy, I want to use my extracted url to read a binary file into memory and extract the contents. Currently, I can find the URL on the page using a selector e.g. myFile = response.xpath('//a[contains(@href,".interestingfileextension")]/@href').extract() How do I then read that file into memory so that I can look for content in that file? Many thanks 回答1: Make a request and explore the content in the callback: def parse(self, response): url = response.xpath('//a[contains(@href,"

Scrapy can not scrape a second page using itemloader

阅读更多关于 Scrapy can not scrape a second page using itemloader

问题 Update: 7/29, 9:29pm: After reading this post, I updated my code. UPDATE: 7/28/15, at 7:35pm, following Martin's suggestion, the message changed, but still no listing of items or writing to database. ORIGINAL: I can successfully scrape a single page (the base page). Now I tried to scrape one of the items from another url found from the "base" page, using Request and callback command. But it does not work. The spider is here: from scrapy.spider import Spider from scrapy.selector import

Why does my Scrapy code return an empty array?

阅读更多关于 Why does my Scrapy code return an empty array?

问题 I am building a web scraper for wunderground.com, but I my code returns the value of "[]" for inches_rain and humidity. Could anyone see why this is happening? # -*- coding: utf-8 -*- import scrapy from scrapy.selector import Selector import time from wunderground_scraper.items import WundergroundScraperItem class WundergroundComSpider(scrapy.Spider): name = "wunderground" allowed_domains = ["www.wunderground.com"] start_urls = ( 'http://www.wunderground.com/q/zmw:10001.5.99999', ) def parse

How to get the job description using scrapy?

阅读更多关于 How to get the job description using scrapy?

问题 I'm new to scrapy and XPath but programming in Python for sometime. I would like to get the email , name of the person making the offer and phone number from the page https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ using scrapy. As you see, the email and phone is provided as text inside the <p> tag and that makes it hard to extract. My idea is to first get text inside the Job Overview or at least all the text talking about this

Parsing the urls in sitemap with different url format using sitemap spider in scrapy, python

阅读更多关于 Parsing the urls in sitemap with different url format using sitemap spider in scrapy, python

问题 I am using sitemap spider in scrapy, python. The sitemap seems to have unusual format with '//' in front of urls: <url> <loc>//www.example.com/10/20-baby-names</loc> </url> <url> <loc>//www.example.com/elizabeth/christmas</loc> </url> myspider.py from scrapy.contrib.spiders import SitemapSpider from myspider.items import * class MySpider(SitemapSpider): name = "myspider" sitemap_urls = ["http://www.example.com/robots.txt"] def parse(self, response): item = PostItem() item['url'] = response

Write functions for all scrapy spiders

阅读更多关于 Write functions for all scrapy spiders

问题 So I'm trying to write functions that can be called upon from all scrapy spiders. Is there one place in my project where I can just define these functions or do I need to import them in each spider? Thanks 回答1: You can't implicitly import code (at least not without hacking around) in python, after all explicit is better than implicit - so it's not a good idea. However in scrapy it's very common to have base Spider class with common functions and methods. Lets assume you have this tree: ├──

Scrapy- How to extract all blog posts from a category?

阅读更多关于 Scrapy- How to extract all blog posts from a category?

问题 I am using scrapy to extract all the posts of my blog. The problem is I cannot figure out how to create a rule that reads all the posts in any given blog category? example: On my blog the category, "Environment setup" has 17 posts. So in the scrapy code I can hard code it as given but this is not a very practical approach start_urls=["https://edumine.wordpress.com/category/ide- configuration/environment-setup/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2

Scrapy get website with error “DNS lookup failed”

阅读更多关于 Scrapy get website with error “DNS lookup failed”

问题 I'm trying to use Scrapy to get all links on websites where the "DNS lookup failed". The problem is, every website without any errors are print on the parse_obj method but when an url return DNS lookup failed, the callback parse_obj is not call . I want to get all domain with the error " DNS lookup failed ", how can I do that ? Logs : 2016-03-08 12:55:12 [scrapy] INFO: Spider opened 2016-03-08 12:55:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-03

AttributeError: 'module' object has no attribute 'Spider'

阅读更多关于 AttributeError: 'module' object has no attribute 'Spider'

问题 I just started to learn scrapy. So I followed the scrapy documentation. I just written the first spider mentioned in that site. import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body