Scrapy parse pagination without next link

删除回忆录丶 提交于 2020-12-13 03:36:41

问题


I'm trying to parse a pagination without next link. The html is belove:

<div id="pagination" class="pagination">
    <ul>
        <li>
            <a href="//www.demopage.com/category_product_seo_name" class="page-1 ">1</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=2" class="page-2 ">2</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=3" class="page-3 ">3</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=4" class="page-4 active">4</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=5" class="page-5">5</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=6" class="page-6 ">6</a>
        </li>
        <li>
                <span class="page-... three-dots">...</span>
        </li>
        <li>
           <a href="//www.demopage.com/category_product_seo_name?page=50" class="page-50 ">50</a>
        </li>
    </ul>   
</div>

For this html I have try this xpath:

response.xpath('//div[@class="pagination"]/ul/li/a/@href').extract()
or 
response.xpath('//div[@class="pagination"]/ul/li/a/@href/following-sibling::a[1]/@href').extract()

is there a good way to parse this pagination? Thanks for all.

PS: I have checked this answers too:

Answer 1

Answer 2


回答1:


One solution is to scrape x number of pages, but this isn't always a good solution if the total number of pages isn't constant:

class MySpider(scrapy.spider):
    num_pages = 10
    def start_requests(self):
        requests = []
        for i in range(1, self.num_pages)
            requests.append(scrapy.Request(
                url='www.demopage.com/category_product_seo_name?page={0}'.format(i)
            ))
        return requests

    def parse(self, response):
        #parse pages here.

Update

You can also keep track of the page count and do something like this. a[href~="?page=2"]::attr(href) will target a elements which href attribute contains the string specified. (I'm not currently able to test if this code works, but something in the style of this should do it)

class MySpider(scrapy.spider):
    start_urls = ['https://demopage.com/search?p=1']
    page_count = 1


def parse(self, response):
     self.page_count += 1
     #parse response

     next_url = response.css('#pagination > ul > li > a[href~="?page={0}"]::attr(href)'.format(self.page_count))
     if next_url:
         yield scrapy.Request(
             url = next_url
         )



回答2:


You can simply get all the pagination links and run it inside the loop every time you have to call the below code and available pagination links will be returned by the selector. You don't need to worry about duplicate URLs as scrapy will handle this one for you. You can also use scrapy Rules as well.

 response.css('.pagination ::attr(href)').getall()


来源:https://stackoverflow.com/questions/63244175/scrapy-parse-pagination-without-next-link

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!