splash-js-render

Splash lua script to do multiple clicks and visits

隐身守侯 提交于 2019-12-30 03:27:08
问题 I'm trying to crawl Google Scholar search results and get all the BiBTeX format of each result matching the search. Right now I have a Scrapy crawler with Splash. I have a lua script which will click the "Cite" link and load up the modal window before getting the href of the BibTeX format of the citation. But seeing that there are multiple search results and hence multiple "Cite" links, I need to click them all and load up the individual BibTeX pages. Here's what I have: import scrapy from

Scrapy Shell and Scrapy Splash

拈花ヽ惹草 提交于 2019-12-17 17:31:16
问题 We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container. If we want to use Splash in the spider, we configure several required project settings and yield a Request specifying specific meta arguments: yield Request(url, self.parse_result, meta={ 'splash': { 'args': { # set rendering arguments here 'html': 1, 'png': 1, # 'url' is prefilled from request url }, # optional parameters 'endpoint': 'render.json',

scrapy-splash active content selector works in shell but not with spider

柔情痞子 提交于 2019-12-11 00:47:00
问题 I just started using scrapy-splash to retrieve the number of bookings from opentable.com. The following works fine in the shell: $ scrapy shell 'http://localhost:8050/render.html?url=https://www.opentable.com/new-york-restaurant-listings&timeout=10&wait=0.5' ... In [1]: response.css('div.booking::text').extract() Out[1]: ['Booked 59 times today', 'Booked 20 times today', 'Booked 17 times today', 'Booked 29 times today', 'Booked 29 times today', ... ] However, this simple spider returns an

Read cookies from Splash request

元气小坏坏 提交于 2019-12-07 17:17:08
问题 I'm trying to access cookies after I've made a request using Splash. Below is how I've build the request. script = """ function main(splash) splash:init_cookies(splash.args.cookies) assert(splash:go{ splash.args.url, headers=splash.args.headers, http_method=splash.args.http_method, body=splash.args.body, }) assert(splash:wait(0.5)) local entries = splash:history() local last_response = entries[#entries].response return { url = splash:url(), headers = last_response.headers, http_status = last

Scrapy Splash won't execute lua script

半腔热情 提交于 2019-12-07 08:00:52
问题 I have ran across an issue in which my Lua script refuses to execute. The returned response from the ScrapyRequest call seems to be an HTML body, while i'm expecting a document title. I am assuming that the Lua script is never being called as it seems to have no apparent effect on the response. I have dug a lot through the documentation and can't quite seem to figure out what is missing here. Does anyone have any suggestions? from urlparse import urljoin import scrapy from scrapy_splash

How to install python-gtk2, python-webkit and python-jswebkit on OSX

纵饮孤独 提交于 2019-12-06 02:49:56
问题 I've read through many of the related questions but am still unclear how to do this as there are many software combinations available and many solutions seem outdated. What is the best way to install the following on my virtual environment on OSX: python-gtk2 python-webkit python-jswebkit Do I also have to install GTK+ and Webkit? If so, how? Would also appreciate a simple explanation on how these pieces of software work together. (I'm trying to use scrapyjs which requires these libraries)

Adding a wait-for-element while performing a SplashRequest in python Scrapy

走远了吗. 提交于 2019-11-29 07:52:34
问题 I am trying to scrape a few dynamic websites using Splash for Scrapy in python. However, I see that Splash fails to wait for the complete page to load in certain cases. A brute force way to tackle this problem was to add a large wait time (eg. 5 seconds in the below snippet). However, this is extremely inefficient and still fails to load certain data (sometimes it take longer than 5 seconds to load the content). Is there some sort of a wait-for-element condition that can be put through these

how does scrapy-splash handle infinite scrolling?

隐身守侯 提交于 2019-11-27 16:47:57
问题 I want to reverse engineering the contents generated by scrolling down in the webpage. The problem is in the url https://www.crowdfunder.com/user/following_page/80159?user_id=80159&limit=0&per_page=20&screwrand=933 . screwrand doesn't seem to follow any pattern, so the reversing the urls don't work. I'm considering the automatic rendering using Splash. How to use Splash to scroll like browsers? Thanks a lot! Here are the codes for two request: request1 = scrapy_splash.SplashRequest('https:/