scrapy-splash | 易学教程

Scrapy-splash - does splash:go(url) in lua_script perform GET request again?

阅读更多关于 Scrapy-splash - does splash:go(url) in lua_script perform GET request again?

问题 I'm new to Scrapy-splash and I'm trying to scrape a lazy datatable which is a table with AJAX pagination. So I need to load the website, wait until JS is executed, get html of the table and then click on the "Next" button on pagination. My approach works but I'm afraid I'm requesting the website two times. First time when I yield the SplashRequest and then when lua_script is executed. Is it true? If yes, how to make it perform request just once? class JSSpider(scrapy.Spider): name = 'js

Scrapy-splash - does splash:go(url) in lua_script perform GET request again?

阅读更多关于 Scrapy-splash - does splash:go(url) in lua_script perform GET request again?

Scrapy-Splash Session Handling

阅读更多关于 Scrapy-Splash Session Handling

问题 I have been trying to login to a website and then crawl some urls only accesible after signing in. def start_requests(self): script = """ function main(splash) splash:init_cookies(splash.args.cookies) assert(splash:go(splash.args.url)) splash:set_viewport_full() local search_input = splash:select('input[name=username]') search_input:send_text("MY_USERNAME") splash:evaljs("document.getElementById('password').value = 'MY_PASSWORD';") local submit_button = splash:select('input[name=signin]')

Splash lua script to do multiple clicks and visits

阅读更多关于 Splash lua script to do multiple clicks and visits

问题 I'm trying to crawl Google Scholar search results and get all the BiBTeX format of each result matching the search. Right now I have a Scrapy crawler with Splash. I have a lua script which will click the "Cite" link and load up the modal window before getting the href of the BibTeX format of the citation. But seeing that there are multiple search results and hence multiple "Cite" links, I need to click them all and load up the individual BibTeX pages. Here's what I have: import scrapy from

CrawlSpider with Splash getting stuck after first URL

阅读更多关于 CrawlSpider with Splash getting stuck after first URL

问题 I'm writing a scrapy spider where I need to render some of the responses with splash. My spider is based on CrawlSpider. I need to render my start_url responses to feed my crawl spider. Unfortunately my crawl spider stops after rendering of the first responds. Any idea what is going wrong? class VideoSpider(CrawlSpider): start_urls = ['https://juke.com/de/de/search?q=1+Mord+f%C3%BCr+2'] rules = ( Rule(LinkExtractor(allow=()), callback='parse_items',process_request = "use_splash",), ) def use

Scrapy Shell and Scrapy Splash

阅读更多关于 Scrapy Shell and Scrapy Splash

问题 We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container. If we want to use Splash in the spider, we configure several required project settings and yield a Request specifying specific meta arguments: yield Request(url, self.parse_result, meta={ 'splash': { 'args': { # set rendering arguments here 'html': 1, 'png': 1, # 'url' is prefilled from request url }, # optional parameters 'endpoint': 'render.json',

Splash do not render the whole page

阅读更多关于 Splash do not render the whole page

问题 I like to use scrapy and splash to grabb some data but poorly splash seems not to render the whole --> page <--. The page should look like this: But it looks like this: So some of the more important information is missing. I already tried to increase the waiting time but this had no positive effect. Does anyone have an idea what I could do, to make this work? 回答1: Try to look at Splash FAQ where common problems with page rendering are discussed. Especially, I've often seen problem with

Scrapy-splash not allowing infinite scroll to complete

阅读更多关于 Scrapy-splash not allowing infinite scroll to complete

问题 I am scraping a used car dealer website that has some javascript on the car listing pages hence using scrapy-splash. The car dealer webpages also have infinite scroll until all their cars are listed. The problem i am having is that on some occasions the code below does not let the infinite scroll continue to the end - and i am not sure why - and so i miss some of the cars. I have reduced the Concurrent requests right back to 1 in the Settings file and therefore i know that i at least start to

Scrapy/Splash Click on a button then get content from new page in new window

阅读更多关于 Scrapy/Splash Click on a button then get content from new page in new window

问题 I'm facing a problem that when I click on a button, then Javascript handle the action then it redirect to a new page with new window (It's similar to when you click on <a> with target _Blank ). In the scrapy/splash I don't know how to get content from the new page (I means I don't know how to control that new page). Anyone can help! script = """ function main(splash) assert(splash:go(splash.args.url)) splash:wait(0.5) local element = splash:select('div.result-content-columns div.result-title'

scrapy-splash active content selector works in shell but not with spider

阅读更多关于 scrapy-splash active content selector works in shell but not with spider

问题 I just started using scrapy-splash to retrieve the number of bookings from opentable.com. The following works fine in the shell: $ scrapy shell 'http://localhost:8050/render.html?url=https://www.opentable.com/new-york-restaurant-listings&timeout=10&wait=0.5' ... In [1]: response.css('div.booking::text').extract() Out[1]: ['Booked 59 times today', 'Booked 20 times today', 'Booked 17 times today', 'Booked 29 times today', 'Booked 29 times today', ... ] However, this simple spider returns an