scrapy-spider | 易学教程

How to get contents of HTML Script tag

阅读更多关于 How to get contents of HTML Script tag

问题 I'm trying to scrape the geo data from a URL for my scraping practice. But I'm having trouble while handling contents of script tag. Following is the contents of script tag : <script type="application/ld+json"> { "address": { "@type": "PostalAddress", "streetAddress": "5080 Riverside Drive", "addressLocality": "Macon", "addressRegion": "GA", "postalCode": "31210-1100", "addressCountry": "US" }, "telephone": "478-471-0171", "geo": { "@type": "GeoCoordinates", "latitude": "32.9252435",

Scrapy CSS Selector ignore tags and get text only

阅读更多关于 Scrapy CSS Selector ignore tags and get text only

问题 I have the following HTML : <li class="last"> <span>SKU:</span> 483151 </li> I was able to select them using : SKU_SELECTOR = '.aaa .bbb .last ::text' sku = response.css(SKU_SELECTOR).extract_first().strip() How can I get the number only and ignore the span. 回答1: Your css selector has unnecessary space before ::text . SKU_SELECTOR = '.aaa .bbb .last ::text' ^ Space indicates that any decendant-or-self node qualifies for this selector where you want to select only text under self. I got it

Scrapy Google Search

阅读更多关于 Scrapy Google Search

问题 I am trying to scrap google search and people also search links. Example when you go on google and you search Christopher nolan. Google also produces a "people also search for" which includes images of people related to the our search which is Christopher nolan. In this case our People also search produces (Christian bale,Emma Thomas, Zack Synder etc). I am interested in scraping this data. I am using scrapy framework and wrote a simple scrapper but it returns an empty csv data file. Below is

Extract data from a gsmarena page using scrapy

阅读更多关于 Extract data from a gsmarena page using scrapy

问题 I'm trying to download data from a gsmarena page: "http://www.gsmarena.com/htc_one_me-7275.php". However the data is classified in form of tables and table rows. The data is of the format: table header > td[@class='ttl'] > td[@class='nfo'] Edited code: Thanks to the help of community members at stackexchange, I've reformatted the code as: Items.py file: import scrapy class gsmArenaDataItem(scrapy.Item): phoneName = scrapy.Field() phoneDetails = scrapy.Field() pass Spider file: from scrapy

Scrapy Limit Requests For Testing

阅读更多关于 Scrapy Limit Requests For Testing

问题 I've been searching the scrapy documentation for a way to limit the number of requests my spiders are allowed to make. During development I don't want to sit here and wait for my spiders to finish an entire crawl, even though the crawls are pretty focused they can still take quite awhile. I want the ability to say, "After x requests to the site I'm scraping stop generating new requests." I was wondering if there is a setting for this I may have missed or some other way to do it using the

Scrapy outputs [ into my .json file

阅读更多关于 Scrapy outputs [ into my .json file

问题 A genuine Scrapy and Python noob here so please be patient with any silly mistakes. I'm trying to write a spider to recursively crawl a news site and return the headline, date, and first paragraph of the Article. I managed to crawl a single page for one item but the moment I try and expand beyond that it all goes wrong. my Spider: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import

Scrapy crawl spider does not download files?

阅读更多关于 Scrapy crawl spider does not download files?

问题 So I am made a crawl spider which crawls this website (https://minerals.usgs.gov/science/mineral-deposit-database/#products, follows every link on that web page, from which it scrapes the title and it is suppesed to download the file as well. Howerver this does not happen and there is no error indication in the log! LOG SAMPLE 2018-11-19 18:20:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sciencebase.gov/catalog/item/5a1492c3e4b09fc93dcfd574> {'date': [datetime.datetime(2018,

Running more than one spider in a for loop

阅读更多关于 Running more than one spider in a for loop

问题 I try to instantiate multiple spiders. The first one works fine, but the second one gives me an error: ReactorNotRestartable. feeds = { 'nasa': { 'name': 'nasa', 'url': 'https://www.nasa.gov/rss/dyn/breaking_news.rss', 'start_urls': ['https://www.nasa.gov/rss/dyn/breaking_news.rss'] }, 'xkcd': { 'name': 'xkcd', 'url': 'http://xkcd.com/rss.xml', 'start_urls': ['http://xkcd.com/rss.xml'] } } With the items above, I try to run two spiders in a loop, like this: from scrapy.crawler import

scrapy crawler to pass multiple item classes to pipeline

阅读更多关于 scrapy crawler to pass multiple item classes to pipeline

问题 Hi i am very new to Python and Scrapy, this is my first code and i cant solve a problem that looks pretty basic. I have the crawler set to do two things: 1- Find all pagination URLs, visit them and get some data from each page 2- Get all links listed on the results pages, visite them and crawl for each location data I am taking the decision of each item to parse using rules with callback. I created to classes inside items.py for each parser The second rule is processing perfect but the first

Scrapy: Rules set inside init are ignored by CrawlSpider

阅读更多关于 Scrapy: Rules set inside __init__ are ignored by CrawlSpider

问题 I've been stuck on this for a few days, and it's making me go crazy. I call my scrapy spider like this: scrapy crawl example -a follow_links="True" I pass in the "follow_links" flag to determine whether the entire website should be scraped, or just the index page I have defined in the spider. This flag is checked in the spider's constructor to see which rule should be set: def __init__(self, *args, **kwargs): super(ExampleSpider, self).__init__(*args, **kwargs) self.follow_links = kwargs.get(