Scrapy not following pagination properly, catches the first link in the pagination

蹲街弑〆低调 提交于 2019-12-11 10:37:08

问题


Yesterday I started learning Scrapy to extract some information but I can't seem to get the pagination right. I followed the tutorial here but I think the site has a different pagination system.

Most pagination's have a class="next" but this one doesn't have that. It only has a list where the current page is listed as a span with the class current:

<div class="pagination">
    <ul class="page-numbers">
        <li><span class='page-numbers current'>1</span></li>
        <li><a class='page-numbers' href='https://www.musicfestivalwizard.com/all-festivals/page/2/'>2</a></li>
        <li><a class='page-numbers' href='https://www.musicfestivalwizard.com/all-festivals/page/3/'>3</a></li>
        <li><a class='page-numbers' href='https://www.musicfestivalwizard.com/all-festivals/page/4/'>4</a></li>
        <li><a class='page-numbers' href='https://www.musicfestivalwizard.com/all-festivals/page/5/'>5</a></li>
    </ul>
</div>

And here is my scraper:

import scrapy


class MfwspiderSpider(scrapy.Spider):
    name = 'mfwspider'
    allowed_domains = ['www.musicfestivalwizard.com']
    start_urls = ['https://www.musicfestivalwizard.com/all-festivals/',]



def parse(self, response):
    pagenumber = 1
    for festival in response.css("span.festivalleft"):
        print("-------")
        yield {
            'date' : festival.css(".festivaldate::text").extract(),
            'location' : festival.css(".festivallocation::text").extract_first(),
            'title' : festival.css(".festivaltitle > a::text").extract_first(),
            }

    next_page = start_urls[0] + str(pagenumber) + "/"
    print(next_page)
    print("^^^^^^^^^^^^^^^^^^")

    if next_page is not None:
        yield response.follow(next_page, callback=self.parse,)

As you can see I added some print() statements for debugging. and here is my console output:

    scrapy crawl mfwspider
2018-05-06 00:21:45 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: lineups)
2018-05-06 00:21:45 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-05-06 00:21:45 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'lineups', 'NEWSPIDER_MODULE': 'lineups.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['lineups.spiders']}
2018-05-06 00:21:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2018-05-06 00:21:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-06 00:21:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-06 00:21:46 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-05-06 00:21:46 [scrapy.core.engine] INFO: Spider opened
2018-05-06 00:21:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-06 00:21:46 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-05-06 00:21:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musicfestivalwizard.com/robots.txt> (referer: None)
2018-05-06 00:21:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musicfestivalwizard.com/all-festivals/> (referer: None)
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 3-6, 2018'], 'location': 'Numero Uno, Malta', 'title': 'Lost And Found Malta 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['April 27-May 6, 2018'], 'location': 'New Orleans, LA', 'title': 'New Orleans Jazz Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 2-May 6, 2018'], 'location': 'West Palm Beach, FL', 'title': 'Sunfest 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Memphis, TN', 'title': 'Beale Street Music Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 5-6, 2018'], 'location': 'Liverpool, UK', 'title': 'Liverpool Sound City 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4–6, 2018'], 'location': 'Atlanta, GA', 'title': 'Shaky Knees Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Concord, NC', 'title': 'Carolina Rebellion 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Winooski, VT', 'title': 'Waking Windows 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Texas Tour', 'title': 'JMBLYA 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 3-6, 2018'], 'location': 'San Diego, CA', 'title': 'West Coast Weekender 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['April 27-May 12, 2017'], 'location': 'Australia Tour', 'title': 'Groovin’ The Moo 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 7-13. 2018'], 'location': 'Toronto, ON', 'title': 'Canadian Music Week 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 11-13, 2018'], 'location': 'London, UK', 'title': 'Peckham Rye 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 12-13, 2018'], 'location': 'Somerset, WI', 'title': 'Northern Invasion 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 6-13, 2018'], 'location': 'Lyon, France', 'title': 'Nuits Sonores 2018'}
https://www.musicfestivalwizard.com/all-festivals/page/2/
^^^^^^^^^^^^^^^^^^
2018-05-06 00:21:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musicfestivalwizard.com/all-festivals/page/2/> (referer: https://www.musicfestivalwizard.com/all-festivals/)
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 12-13, 2018'], 'location': 'Chiba, Japan', 'title': 'Electric Daisy Carnival Japan 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 11-13, 2018'], 'location': 'Arcosanti, AZ', 'title': 'FORM Arcosanti Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 11-13, 2018'], 'location': 'Atlanta, GA', 'title': 'Shaky Beats Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 11-13, 2018'], 'location': 'Miami, FL', 'title': 'Rolling Loud Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 17-19, 2018'], 'location': 'Brighton, UK', 'title': 'The Great Escape 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-20, 2018'], 'location': 'Gulf Shores, AL', 'title': 'Hangout Fest 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-20, 2018'], 'location': 'Saint-Laurent-de-Cuves, France', 'title': 'Papillons De Nuit 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['June 19-20, 2018'], 'location': 'Margny-lès-Compiègne, France', 'title': 'Imaginarium Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': [' May 18-20, 2018'], 'location': 'Columbus, OH', 'title': 'Rock on the Range 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 17-20, 2018'], 'location': 'Durham, NC', 'title': 'Moogfest 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 19-20, 2018'], 'location': 'Paris, France', 'title': 'Marvellous Island Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-20, 2018'], 'location': 'Montreal, QC', 'title': 'Pouzza Fest 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-20, 2018'], 'location': 'Houthalen-Helchteren, Belgium', 'title': 'Extrema Outdoor Belgium 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 17-20, 2018'], 'location': 'Joshua Tree, CA', 'title': 'Joshua Tree Festival Spring 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-21, 2018'], 'location': 'Las Vegas, NV', 'title': 'Electric Daisy Carnival Vegas 2018'}
https://www.musicfestivalwizard.com/all-festivals/
^^^^^^^^^^^^^^^^^^
2018-05-06 00:21:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musicfestivalwizard.com/all-festivals/> (referer: https://www.musicfestivalwizard.com/all-festivals/page/2/)
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 3-6, 2018'], 'location': 'Numero Uno, Malta', 'title': 'Lost And Found Malta 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['April 27-May 6, 2018'], 'location': 'New Orleans, LA', 'title': 'New Orleans Jazz Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 2-May 6, 2018'], 'location': 'West Palm Beach, FL', 'title': 'Sunfest 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Memphis, TN', 'title': 'Beale Street Music Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 5-6, 2018'], 'location': 'Liverpool, UK', 'title': 'Liverpool Sound City 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4–6, 2018'], 'location': 'Atlanta, GA', 'title': 'Shaky Knees Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Concord, NC', 'title': 'Carolina Rebellion 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Winooski, VT', 'title': 'Waking Windows 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Texas Tour', 'title': 'JMBLYA 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 3-6, 2018'], 'location': 'San Diego, CA', 'title': 'West Coast Weekender 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['April 27-May 12, 2017'], 'location': 'Australia Tour', 'title': 'Groovin’ The Moo 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 7-13. 2018'], 'location': 'Toronto, ON', 'title': 'Canadian Music Week 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 11-13, 2018'], 'location': 'London, UK', 'title': 'Peckham Rye 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 12-13, 2018'], 'location': 'Somerset, WI', 'title': 'Northern Invasion 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 6-13, 2018'], 'location': 'Lyon, France', 'title': 'Nuits Sonores 2018'}
https://www.musicfestivalwizard.com/all-festivals/page/2/
^^^^^^^^^^^^^^^^^^
2018-05-06 00:21:47 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.musicfestivalwizard.com/all-festivals/page/2/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2018-05-06 00:21:47 [scrapy.core.engine] INFO: Closing spider (finished)
2018-05-06 00:21:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1092,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 48590,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 4,
 'dupefilter/filtered': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 5, 5, 22, 21, 47, 746610),
 'item_scraped_count': 45,
 'log_count/DEBUG': 51,
 'log_count/INFO': 7,
 'memusage/max': 66899968,
 'memusage/startup': 66899968,
 'request_depth_max': 3,
 'response_received_count': 4,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2018, 5, 5, 22, 21, 46, 20038)}
2018-05-06 00:21:47 [scrapy.core.engine] INFO: Spider closed (finished)

I think what I need something to select the li after the . How can I do this in scrapy? Is there a better way to do it?


回答1:


You can use a XPath statement to extract the next page.

The following XPath looks for the li element of the current page which is indicated by the class. Then it takes the next li elements href.

    xpath_next_page = ' .//li/*[@class="page-numbers current"]/parent::li/following-sibling::li[1]/a/@href'
    next_page = response.xpath(xpath_next_page).extract_first()

I tested this with the site and it seems to work pretty well. But I needed to add some DOWNLOAD_DELAY for not being denied to scrape through all pages.



来源:https://stackoverflow.com/questions/50194545/scrapy-not-following-pagination-properly-catches-the-first-link-in-the-paginati

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!