scrapy-spider

Parsing stray text with Scrapy

旧时模样 提交于 2019-12-24 19:08:10
问题 Any idea how to extract 'TEXT TO GRAB' from this piece of markup: <span class="navigation_page"> <span> <a itemprop="url" href="http://www.example.com"> <span itemprop="title">LINK</span> </a> </span> <span class="navigation-pipe">></span> TEXT TO GRAB </span> 回答1: It's not an ideal solution but it should do the trick: from scrapy import Selector content=""" <span class="navigation_page"> <span> <a itemprop="url" href="http://www.example.com"> <span itemprop="title">LINK</span> </a> </span>

How to bypass a 'cookiewall' when using scrapy?

时光毁灭记忆、已成空白 提交于 2019-12-24 18:55:06
问题 I'm a new user to Scrapy. After following the tutorials for extracting data from websites, I am trying to accomplish something similar on forums. What I want is to extract all posts on a forum page (to start with). However, this particular forum has a 'cookie wall'. So when I want to extract from http://forum.fok.nl/topic/2413069, each session I first need to click the "Yes, I accept cookies"-button. My very basic scraper currently looks like this: class FokSpider(scrapy.Spider): name = 'fok'

Why isn't XMLFeedSpider failing to iterate through the designated nodes?

偶尔善良 提交于 2019-12-24 13:40:27
问题 I'm trying to parse through PLoS's RSS feed to pick up new publications. The RSS feed is located here. Below is my spider: from scrapy.contrib.spiders import XMLFeedSpider class PLoSSpider(XMLFeedSpider): name = "plos" itertag = 'entry' allowed_domains = ["plosone.org"] start_urls = [ ('http://www.plosone.org/article/feed/search' '?unformattedQuery=*%3A*&sort=Date%2C+newest+first') ] def parse_node(self, response, node): pass This configuration produces the following log output (note the

How to extract data from dynamic websites like Flipkart using selenium and Scrapy?

China☆狼群 提交于 2019-12-24 12:51:51
问题 As Flipkart.com shows only 15 to 20 results on 1st page and when scrolled it shows more results. Scrapy extracts results of 1st page successfully but not of next pages. I tried using Selenium for it, but couldn't find success. Here is my code :- from scrapy.spider import Spider from scrapy.selector import Selector from flipkart.items import FlipkartItem from scrapy.spider import BaseSpider from selenium import webdriver class FlipkartSpider(BaseSpider): name = "flip1" allowed_domains = [

Scrapy needs to crawl all the next links on website and move on to the next page

允我心安 提交于 2019-12-24 12:22:11
问题 I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it?? from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from delh.items import DelhItem class criticspider(CrawlSpider): name ="delh" allowed_domains =["consumercomplaints.in"] #start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/

Scrapy: crawl 1 level deep on offsite links

[亡魂溺海] 提交于 2019-12-24 00:06:53
问题 In scrapy how would I go about having scrapy crawl only 1 level deep for all links outside the allowed domains. Within the crawl, I want to be able to make sure all outbound links within the site are working and not 404'd. I do not want it to crawl the whole site of the non-allowed domain. I am currently processing allowed domain 404s. I know that I can set a DEPTH_LIMIT of 1, but that will affect the allowed domain as well. my code: from scrapy.selector import Selector from scrapy.spiders

scrapy “Missing scheme in request url”

允我心安 提交于 2019-12-23 18:15:50
问题 Here's my code below- import scrapy from scrapy.http import Request class lyricsFetch(scrapy.Spider): name = "lyricsFetch" allowed_domains = ["metrolyrics.com"] print "\nEnter the name of the ARTIST of the song for which you want the lyrics for. Minimise the spelling mistakes, if possible." artist_name = raw_input('>') print "\nNow comes the main part. Enter the NAME of the song itself now. Again, try not to have any spelling mistakes." song_name = raw_input('>') artist_name = artist_name

What is the correct form of work with cookies in scrapy

三世轮回 提交于 2019-12-23 17:09:12
问题 I'm very newbie,I am working with scrapy in a web that use cookies, This is a problem for me , because I can obtain data the a web without cookies but obtain the data of a web with cookies is dificult for me. I have this code structure class mySpider(BaseSpider): name='data' allowed_domains =[] start_urls =["http://...."] def parse(self, response): sel = HtmlXPathSelector(response) items = sel.xpath('//*[@id=..............') vlrs =[] for item in items: myItem['img'] = item.xpath('....')

Extracting Images in Scrapy

核能气质少年 提交于 2019-12-23 12:07:48
问题 I've read through a few other answers here but I'm missing something fundamental. I'm trying to extract the images from a website with a CrawlSpider. settings.py BOT_NAME = 'healthycomm' SPIDER_MODULES = ['healthycomm.spiders'] NEWSPIDER_MODULE = 'healthycomm.spiders' ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1} IMAGES_STORE = '~/Desktop/scrapy_nsml/healthycomm/images' items.py class HealthycommItem(scrapy.Item): page_heading = scrapy.Field() page_title = scrapy.Field

Scrapy CrawlSpider Crawls Nothing

天大地大妈咪最大 提交于 2019-12-23 04:53:25
问题 I am trying to crawl Booking.Com. The spider opens and closes without opening and crawling the url.[Output][1] [1]: https://i.stack.imgur.com/9hDt6.png I am new to python and Scrapy. Here is the code I have written so far. Please point out what I am doing wrong. import scrapy import urllib from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.item import Item from scrapy.loader import ItemLoader from CinemaScraper.items import CinemascraperItem