scrapy-spider | 易学教程

Parsing stray text with Scrapy

阅读更多关于 Parsing stray text with Scrapy

问题 Any idea how to extract 'TEXT TO GRAB' from this piece of markup: <a itemprop="url" href="http://www.example.com"> LINK </a> > TEXT TO GRAB 回答1: It's not an ideal solution but it should do the trick: from scrapy import Selector content=""" <a itemprop="url" href="http://www.example.com"> LINK </a>

How to bypass a 'cookiewall' when using scrapy?

阅读更多关于 How to bypass a 'cookiewall' when using scrapy?

问题 I'm a new user to Scrapy. After following the tutorials for extracting data from websites, I am trying to accomplish something similar on forums. What I want is to extract all posts on a forum page (to start with). However, this particular forum has a 'cookie wall'. So when I want to extract from http://forum.fok.nl/topic/2413069, each session I first need to click the "Yes, I accept cookies"-button. My very basic scraper currently looks like this: class FokSpider(scrapy.Spider): name = 'fok'

Why isn't XMLFeedSpider failing to iterate through the designated nodes?

阅读更多关于 Why isn't XMLFeedSpider failing to iterate through the designated nodes?

问题 I'm trying to parse through PLoS's RSS feed to pick up new publications. The RSS feed is located here. Below is my spider: from scrapy.contrib.spiders import XMLFeedSpider class PLoSSpider(XMLFeedSpider): name = "plos" itertag = 'entry' allowed_domains = ["plosone.org"] start_urls = [ ('http://www.plosone.org/article/feed/search' '?unformattedQuery=*%3A*&sort=Date%2C+newest+first') ] def parse_node(self, response, node): pass This configuration produces the following log output (note the

How to extract data from dynamic websites like Flipkart using selenium and Scrapy?

阅读更多关于 How to extract data from dynamic websites like Flipkart using selenium and Scrapy?

问题 As Flipkart.com shows only 15 to 20 results on 1st page and when scrolled it shows more results. Scrapy extracts results of 1st page successfully but not of next pages. I tried using Selenium for it, but couldn't find success. Here is my code :- from scrapy.spider import Spider from scrapy.selector import Selector from flipkart.items import FlipkartItem from scrapy.spider import BaseSpider from selenium import webdriver class FlipkartSpider(BaseSpider): name = "flip1" allowed_domains = [

Scrapy needs to crawl all the next links on website and move on to the next page

阅读更多关于 Scrapy needs to crawl all the next links on website and move on to the next page

问题 I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it?? from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from delh.items import DelhItem class criticspider(CrawlSpider): name ="delh" allowed_domains =["consumercomplaints.in"] #start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/

Scrapy: crawl 1 level deep on offsite links

阅读更多关于 Scrapy: crawl 1 level deep on offsite links

问题 In scrapy how would I go about having scrapy crawl only 1 level deep for all links outside the allowed domains. Within the crawl, I want to be able to make sure all outbound links within the site are working and not 404'd. I do not want it to crawl the whole site of the non-allowed domain. I am currently processing allowed domain 404s. I know that I can set a DEPTH_LIMIT of 1, but that will affect the allowed domain as well. my code: from scrapy.selector import Selector from scrapy.spiders

scrapy “Missing scheme in request url”

阅读更多关于 scrapy “Missing scheme in request url”

问题 Here's my code below- import scrapy from scrapy.http import Request class lyricsFetch(scrapy.Spider): name = "lyricsFetch" allowed_domains = ["metrolyrics.com"] print "\nEnter the name of the ARTIST of the song for which you want the lyrics for. Minimise the spelling mistakes, if possible." artist_name = raw_input('>') print "\nNow comes the main part. Enter the NAME of the song itself now. Again, try not to have any spelling mistakes." song_name = raw_input('>') artist_name = artist_name

What is the correct form of work with cookies in scrapy

阅读更多关于 What is the correct form of work with cookies in scrapy

问题 I'm very newbie,I am working with scrapy in a web that use cookies, This is a problem for me , because I can obtain data the a web without cookies but obtain the data of a web with cookies is dificult for me. I have this code structure class mySpider(BaseSpider): name='data' allowed_domains =[] start_urls =["http://...."] def parse(self, response): sel = HtmlXPathSelector(response) items = sel.xpath('//*[@id=..............') vlrs =[] for item in items: myItem['img'] = item.xpath('....')

Extracting Images in Scrapy

阅读更多关于 Extracting Images in Scrapy

问题 I've read through a few other answers here but I'm missing something fundamental. I'm trying to extract the images from a website with a CrawlSpider. settings.py BOT_NAME = 'healthycomm' SPIDER_MODULES = ['healthycomm.spiders'] NEWSPIDER_MODULE = 'healthycomm.spiders' ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1} IMAGES_STORE = '~/Desktop/scrapy_nsml/healthycomm/images' items.py class HealthycommItem(scrapy.Item): page_heading = scrapy.Field() page_title = scrapy.Field

Scrapy CrawlSpider Crawls Nothing

阅读更多关于 Scrapy CrawlSpider Crawls Nothing

问题 I am trying to crawl Booking.Com. The spider opens and closes without opening and crawling the url.[Output][1] [1]: https://i.stack.imgur.com/9hDt6.png I am new to python and Scrapy. Here is the code I have written so far. Please point out what I am doing wrong. import scrapy import urllib from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.item import Item from scrapy.loader import ItemLoader from CinemaScraper.items import CinemascraperItem