scrapy-spider

How to handle connection or download error in Scrapy?

风流意气都作罢 提交于 2019-12-08 08:09:48
问题 I am using the following to check for (internet) connection errors in my spider.py : def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, callback=self.parse, errback=self.handle_error) def handle_error(self, failure): if failure.check(DNSLookupError): # or failure.check(UnknownHostError): request = failure.request self.logger.error('DNSLookupError on: %s', request.url) print("\nDNS Error! Please check your internet connection!\n") elif failure.check(HttpError):

How do you open a file stream for reading using Scrapy?

拥有回忆 提交于 2019-12-08 07:16:31
问题 Using Scrapy, I want to use my extracted url to read a binary file into memory and extract the contents. Currently, I can find the URL on the page using a selector e.g. myFile = response.xpath('//a[contains(@href,".interestingfileextension")]/@href').extract() How do I then read that file into memory so that I can look for content in that file? Many thanks 回答1: Make a request and explore the content in the callback: def parse(self, response): url = response.xpath('//a[contains(@href,"

Scrapy can not scrape a second page using itemloader

北城以北 提交于 2019-12-08 06:51:01
问题 Update: 7/29, 9:29pm: After reading this post, I updated my code. UPDATE: 7/28/15, at 7:35pm, following Martin's suggestion, the message changed, but still no listing of items or writing to database. ORIGINAL: I can successfully scrape a single page (the base page). Now I tried to scrape one of the items from another url found from the "base" page, using Request and callback command. But it does not work. The spider is here: from scrapy.spider import Spider from scrapy.selector import

Why does my Scrapy code return an empty array?

感情迁移 提交于 2019-12-08 04:50:40
问题 I am building a web scraper for wunderground.com, but I my code returns the value of "[]" for inches_rain and humidity. Could anyone see why this is happening? # -*- coding: utf-8 -*- import scrapy from scrapy.selector import Selector import time from wunderground_scraper.items import WundergroundScraperItem class WundergroundComSpider(scrapy.Spider): name = "wunderground" allowed_domains = ["www.wunderground.com"] start_urls = ( 'http://www.wunderground.com/q/zmw:10001.5.99999', ) def parse

How to get the job description using scrapy?

妖精的绣舞 提交于 2019-12-08 04:23:41
问题 I'm new to scrapy and XPath but programming in Python for sometime. I would like to get the email , name of the person making the offer and phone number from the page https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ using scrapy. As you see, the email and phone is provided as text inside the <p> tag and that makes it hard to extract. My idea is to first get text inside the Job Overview or at least all the text talking about this

Parsing the urls in sitemap with different url format using sitemap spider in scrapy, python

三世轮回 提交于 2019-12-07 20:35:16
问题 I am using sitemap spider in scrapy, python. The sitemap seems to have unusual format with '//' in front of urls: <url> <loc>//www.example.com/10/20-baby-names</loc> </url> <url> <loc>//www.example.com/elizabeth/christmas</loc> </url> myspider.py from scrapy.contrib.spiders import SitemapSpider from myspider.items import * class MySpider(SitemapSpider): name = "myspider" sitemap_urls = ["http://www.example.com/robots.txt"] def parse(self, response): item = PostItem() item['url'] = response

Write functions for all scrapy spiders

本小妞迷上赌 提交于 2019-12-07 20:08:23
问题 So I'm trying to write functions that can be called upon from all scrapy spiders. Is there one place in my project where I can just define these functions or do I need to import them in each spider? Thanks 回答1: You can't implicitly import code (at least not without hacking around) in python, after all explicit is better than implicit - so it's not a good idea. However in scrapy it's very common to have base Spider class with common functions and methods. Lets assume you have this tree: ├──

Scrapy- How to extract all blog posts from a category?

江枫思渺然 提交于 2019-12-07 12:33:45
问题 I am using scrapy to extract all the posts of my blog. The problem is I cannot figure out how to create a rule that reads all the posts in any given blog category? example: On my blog the category, "Environment setup" has 17 posts. So in the scrapy code I can hard code it as given but this is not a very practical approach start_urls=["https://edumine.wordpress.com/category/ide- configuration/environment-setup/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2

Scrapy get website with error “DNS lookup failed”

孤街醉人 提交于 2019-12-07 11:41:58
问题 I'm trying to use Scrapy to get all links on websites where the "DNS lookup failed". The problem is, every website without any errors are print on the parse_obj method but when an url return DNS lookup failed, the callback parse_obj is not call . I want to get all domain with the error " DNS lookup failed ", how can I do that ? Logs : 2016-03-08 12:55:12 [scrapy] INFO: Spider opened 2016-03-08 12:55:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-03

AttributeError: 'module' object has no attribute 'Spider'

对着背影说爱祢 提交于 2019-12-07 09:46:31
问题 I just started to learn scrapy. So I followed the scrapy documentation. I just written the first spider mentioned in that site. import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body