scrapy-splash

Scrapy-splash - does splash:go(url) in lua_script perform GET request again?

為{幸葍}努か 提交于 2020-01-03 05:16:09
问题 I'm new to Scrapy-splash and I'm trying to scrape a lazy datatable which is a table with AJAX pagination. So I need to load the website, wait until JS is executed, get html of the table and then click on the "Next" button on pagination. My approach works but I'm afraid I'm requesting the website two times. First time when I yield the SplashRequest and then when lua_script is executed. Is it true? If yes, how to make it perform request just once? class JSSpider(scrapy.Spider): name = 'js

Scrapy-splash - does splash:go(url) in lua_script perform GET request again?

旧城冷巷雨未停 提交于 2020-01-03 05:16:08
问题 I'm new to Scrapy-splash and I'm trying to scrape a lazy datatable which is a table with AJAX pagination. So I need to load the website, wait until JS is executed, get html of the table and then click on the "Next" button on pagination. My approach works but I'm afraid I'm requesting the website two times. First time when I yield the SplashRequest and then when lua_script is executed. Is it true? If yes, how to make it perform request just once? class JSSpider(scrapy.Spider): name = 'js

Scrapy-Splash Session Handling

不羁岁月 提交于 2020-01-03 01:23:10
问题 I have been trying to login to a website and then crawl some urls only accesible after signing in. def start_requests(self): script = """ function main(splash) splash:init_cookies(splash.args.cookies) assert(splash:go(splash.args.url)) splash:set_viewport_full() local search_input = splash:select('input[name=username]') search_input:send_text("MY_USERNAME") splash:evaljs("document.getElementById('password').value = 'MY_PASSWORD';") local submit_button = splash:select('input[name=signin]')

Splash lua script to do multiple clicks and visits

隐身守侯 提交于 2019-12-30 03:27:08
问题 I'm trying to crawl Google Scholar search results and get all the BiBTeX format of each result matching the search. Right now I have a Scrapy crawler with Splash. I have a lua script which will click the "Cite" link and load up the modal window before getting the href of the BibTeX format of the citation. But seeing that there are multiple search results and hence multiple "Cite" links, I need to click them all and load up the individual BibTeX pages. Here's what I have: import scrapy from

CrawlSpider with Splash getting stuck after first URL

妖精的绣舞 提交于 2019-12-22 10:55:04
问题 I'm writing a scrapy spider where I need to render some of the responses with splash. My spider is based on CrawlSpider. I need to render my start_url responses to feed my crawl spider. Unfortunately my crawl spider stops after rendering of the first responds. Any idea what is going wrong? class VideoSpider(CrawlSpider): start_urls = ['https://juke.com/de/de/search?q=1+Mord+f%C3%BCr+2'] rules = ( Rule(LinkExtractor(allow=()), callback='parse_items',process_request = "use_splash",), ) def use

Scrapy Shell and Scrapy Splash

拈花ヽ惹草 提交于 2019-12-17 17:31:16
问题 We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container. If we want to use Splash in the spider, we configure several required project settings and yield a Request specifying specific meta arguments: yield Request(url, self.parse_result, meta={ 'splash': { 'args': { # set rendering arguments here 'html': 1, 'png': 1, # 'url' is prefilled from request url }, # optional parameters 'endpoint': 'render.json',

Splash do not render the whole page

社会主义新天地 提交于 2019-12-13 09:11:17
问题 I like to use scrapy and splash to grabb some data but poorly splash seems not to render the whole --> page <--. The page should look like this: But it looks like this: So some of the more important information is missing. I already tried to increase the waiting time but this had no positive effect. Does anyone have an idea what I could do, to make this work? 回答1: Try to look at Splash FAQ where common problems with page rendering are discussed. Especially, I've often seen problem with

Scrapy-splash not allowing infinite scroll to complete

早过忘川 提交于 2019-12-13 03:24:56
问题 I am scraping a used car dealer website that has some javascript on the car listing pages hence using scrapy-splash. The car dealer webpages also have infinite scroll until all their cars are listed. The problem i am having is that on some occasions the code below does not let the infinite scroll continue to the end - and i am not sure why - and so i miss some of the cars. I have reduced the Concurrent requests right back to 1 in the Settings file and therefore i know that i at least start to

Scrapy/Splash Click on a button then get content from new page in new window

≯℡__Kan透↙ 提交于 2019-12-11 08:57:52
问题 I'm facing a problem that when I click on a button, then Javascript handle the action then it redirect to a new page with new window (It's similar to when you click on <a> with target _Blank ). In the scrapy/splash I don't know how to get content from the new page (I means I don't know how to control that new page). Anyone can help! script = """ function main(splash) assert(splash:go(splash.args.url)) splash:wait(0.5) local element = splash:select('div.result-content-columns div.result-title'

scrapy-splash active content selector works in shell but not with spider

柔情痞子 提交于 2019-12-11 00:47:00
问题 I just started using scrapy-splash to retrieve the number of bookings from opentable.com. The following works fine in the shell: $ scrapy shell 'http://localhost:8050/render.html?url=https://www.opentable.com/new-york-restaurant-listings&timeout=10&wait=0.5' ... In [1]: response.css('div.booking::text').extract() Out[1]: ['Booked 59 times today', 'Booked 20 times today', 'Booked 17 times today', 'Booked 29 times today', 'Booked 29 times today', ... ] However, this simple spider returns an