Splash lua script to do multiple clicks and visits

前端 未结 1 438
隐瞒了意图╮
隐瞒了意图╮ 2020-12-30 14:20

I\'m trying to crawl Google Scholar search results and get all the BiBTeX format of each result matching the search. Right now I have a Scrapy crawler with Splash. I have a

相关标签:
1条回答
  • 2020-12-30 15:03

    Okay so I hacked up a solution which works. First of all we'll need the Lua script to be mutable so we'll make it a function:

    def script(n):
        _script = """
            function main(splash)
              local url = splash.args.url
              local href = ""
              assert(splash:go(url))
              assert(splash:wait(0.5))
              splash:runjs('document.querySelectorAll("a.gs_nph[aria-controls=gs_cit]")[{}].click()')
              splash:wait(3)
              href = splash:evaljs('document.querySelectorAll("a.gs_citi")[0].href')
              assert(splash:go(href))
              return {}
            end
            """.format(n, "{html=splash:html(),png=splash:png(), href=href,}")
        return _script
    

    I then had to modify the parse function so that it clicks all the "Cite" links on the page. The way to do that is to iterate through all the matching "Cite" links on the page and to click on each one individually. I made the Lua script load the page again (which is dirty but I can't think of any other way) and click on the index of the queried "Cite" link. Also it has to make duplicate requests hence why the dont_filter=True is there:

    def parse(self, response):
            n = len(response.css("a.gs_nph[aria-controls=gs_cit]").extract())
            for i in range(n):
                yield SplashRequest(response.url, self.parse_bib,
                                    endpoint="execute",
                                    args={"lua_source": script(i)},
                                    dont_filter=True)
    

    Hope this helps.

    0 讨论(0)
提交回复
热议问题