Scrapyjs + Splash click controller button

前端 未结 1 735
伪装坚强ぢ
伪装坚强ぢ 2021-01-02 20:07

Hello I have installed Scrapyjs + Splash and I use the following code

import json

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.         


        
相关标签:
1条回答
  • 2021-01-02 20:50

    Here's a basic (yet working) example of how to pass JavaScript code in a lua script for Splash, using the /execute endpoint

    # -*- coding: utf-8 -*-
    import json
    from six.moves.urllib.parse import urljoin
    
    import scrapy
    
    
    class WhoscoredspiderSpider(scrapy.Spider):
        name = "whoscoredspider"
        allowed_domains = ["whoscored.com"]
        start_urls = (
            'http://www.whoscored.com/Regions/81/Tournaments/3/Seasons/4336/Stages/9192/Fixtures/Germany-Bundesliga-2014-2015',
        )
    
        def start_requests(self):
            script = """
            function main(splash)
                local url = splash.args.url
                assert(splash:go(url))
                assert(splash:wait(1))
    
                -- go back 1 month in time and wait a little (1 second)
                assert(splash:runjs("$('#date-controller > a:first-child').click()"))
                assert(splash:wait(1))
    
                -- return result as a JSON object
                return {
                    html = splash:html(),
                    -- we don't need screenshot or network activity
                    --png = splash:png(),
                    --har = splash:har(),
                }
            end
            """
            for url in self.start_urls:
                yield scrapy.Request(url, self.parse_result, meta={
                    'splash': {
                        'args': {'lua_source': script},
                        'endpoint': 'execute',
                    }
                })
    
        def parse_result(self, response):
    
            # fetch base URL because response url is the Splash endpoint
            baseurl = response.meta["splash"]["args"]["url"]
    
            # decode JSON response
            splash_json = json.loads(response.body_as_unicode())
    
            # and build a new selector from the response "html" key from that object
            selector = scrapy.Selector(text=splash_json["html"], type="html")
    
            # loop on the table row
            for table in selector.css('table#tournament-fixture'):
    
                # seperating on each date (<tr> elements with a <th>)
                for cnt, header in enumerate(table.css('tr.rowgroupheader'), start=1):
                    self.logger.info("date: %s" % header.xpath('string()').extract_first())
    
                    # after each date, look for sibling <tr> elements
                    # that have only N preceding tr/th,
                    # N being the number of headers seen so far
                    for row in header.xpath('''
                            ./following-sibling::tr[not(th/@colspan)]
                                                   [count(preceding-sibling::tr[th/@colspan])=%d]''' % cnt):
                        self.logger.info("record: %s" % row.xpath('string()').extract_first())
                        match_report_href = row.css('td > a.match-report::attr(href)').extract_first()
                        if match_report_href:
                            self.logger.info("match report: %s" % urljoin(baseurl, match_report_href))
    

    Sample logs:

    $ scrapy crawl whoscoredspider 
    2016-03-07 19:21:38 [scrapy] INFO: Scrapy 1.0.5 started (bot: whoscored)
    (...stripped...)
    2016-03-07 19:21:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, SplashMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2016-03-07 19:21:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2016-03-07 19:21:38 [scrapy] INFO: Enabled item pipelines: 
    2016-03-07 19:21:38 [scrapy] INFO: Spider opened
    2016-03-07 19:21:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2016-03-07 19:21:43 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/execute> (referer: None)
    2016-03-07 19:21:43 [whoscoredspider] INFO: date: Saturday, Apr 4 2015
    2016-03-07 19:21:43 [whoscoredspider] INFO: record: 14:30FTWerder Bremen0 : 0Mainz 05Match Report2
    2016-03-07 19:21:43 [whoscoredspider] INFO: match report: http://www.whoscored.com/Matches/834843/MatchReport
    2016-03-07 19:21:43 [whoscoredspider] INFO: record: 14:30FTEintracht Frankfurt2 : 2Hannover 96Match Report1
    2016-03-07 19:21:43 [whoscoredspider] INFO: match report: http://www.whoscored.com/Matches/834847/MatchReport
    (...stripped...)
    2016-03-07 19:21:43 [whoscoredspider] INFO: date: Sunday, Apr 26 2015
    2016-03-07 19:21:43 [whoscoredspider] INFO: record: 14:30FT1Paderborn2 : 2Werder BremenMatch Report2
    2016-03-07 19:21:43 [whoscoredspider] INFO: match report: http://www.whoscored.com/Matches/834837/MatchReport
    2016-03-07 19:21:43 [whoscoredspider] INFO: record: 16:30FTBorussia M.Gladbach1 : 0WolfsburgMatch Report12
    2016-03-07 19:21:43 [whoscoredspider] INFO: match report: http://www.whoscored.com/Matches/834809/MatchReport
    2016-03-07 19:21:43 [scrapy] INFO: Closing spider (finished)
    2016-03-07 19:21:43 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 1015,
     'downloader/request_count': 1,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 143049,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2016, 3, 7, 18, 21, 43, 662973),
     'log_count/DEBUG': 2,
     'log_count/INFO': 90,
     'log_count/WARNING': 3,
     'response_received_count': 1,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'splash/execute/request_count': 1,
     'splash/execute/response_count/200': 1,
     'start_time': datetime.datetime(2016, 3, 7, 18, 21, 38, 772848)}
    2016-03-07 19:21:43 [scrapy] INFO: Spider closed (finished)
    
    0 讨论(0)
自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题