headless-browser

How to scrape javascript injected image src and alt with phantom.js?

£可爱£侵袭症+ 提交于 2019-12-14 03:12:31
问题 I'm using the following script to scrape images using phantom.js: var page = require('webpage').create(); url = 'https://www.everlane.com/collections/mens-luxury-tees/products/mens-crew-antique' page.open(url, function(status) { if (status !== 'success') { console.log('error'); phantom.exit(); return; } var a = page.evaluate(function() { return document.getElementsByTagName('img'); }); SrcAlt = []; for (var i=0; i<a.length; i++){ var src = a[i].getAttribute('src'); var alt = a[i].getAttribute

CasperJS cannot set window.navigator object

谁说胖子不能爱 提交于 2019-12-13 20:13:49
问题 Trying to scrape a web page with CasperJS. Webpage checks to see if the browser is an IE 6/7. Passing an userAgent with casperjs doesn't seem to satisfy its condition. UserAgent passed: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Following is the check being made by the page to determine the browser agt = navigator.userAgent.toLowerCase(); browserType = navigator.appName; if( ((browserType.indexOf("xplorer") != -1) && (agt.indexOf("msie 6.") != -1)) || ((browserType.indexOf("xplorer")

How to upload file in headless browser using robot class in selenium java

浪尽此生 提交于 2019-12-13 18:31:46
问题 How to upload file in headless browser using robot class in selenium java as sendkeys() method not working in my case. I am using firefox and selenium web driver java for my script 回答1: No need to use Robot class for uploading file using selenium java. Just at first, (1) Upload your files in /tmp folder in case of linux and temp folder in case of windows OS and then, use below code to upload files String path = FILE_UPLOAD_PATH; //(Full path with file name from /tmp folder) driver.findElement

Need headless browser for Armv7 linux processor

点点圈 提交于 2019-12-13 03:54:41
问题 I need a headless browser for webscraping.Recently i tried 3 different headless browsers( PhantomJS,Firefox,Chrome ). When using phantomJS , it gives some error (i.e):Armv7 processor needs GUI . then,am using Firefox with geckodriver , it shows errors in the path and connection refused . so that i moved to chrome headless browser with chromedriver ,but it also shows same errors as Firefox . So,I need a correct headless browser for Armv7 processor. Can anyone suggest solution for that or any

How can I pause and wait for user input with Puppeteer?

人盡茶涼 提交于 2019-12-12 18:31:07
问题 I need to make Puppeteer pause and wait for user input of username and password before continuing. It is a nodejs 8.12.0 app. (async () => { const browser = await puppeteer.launch({headless: false}); const page = await browser.newPage(); await page.goto('https://www.myweb.com/login/'); //code to wait for user to enter username and password, and click `login` const first_page = await page.content(); //do something await browser.close(); )}(); Basically the program is halted and waits until the

Running chrome headless on linux without xorg

只愿长相守 提交于 2019-12-12 10:42:28
问题 Is it possible to install and run chrome headless on a headless Linux box without installing the audio and xorg dependencies? If not, then is there a special headless build of chrome/chromium which doesn't pull xorg and audio libs? 回答1: This troubleshooting doc on puppeteer should be of some help (https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md), it oultlines all the package necessary for running Chrome on a linux machine (more specifically for web servers).

How can I get Firebug to match HtmlUnitDriver's pageSource report?

…衆ロ難τιáo~ 提交于 2019-12-12 03:14:21
问题 I'm using Java with the Selenium Library to scrape a webpage. When I use Firebug on the page in Firefox, I can see that the page's source contains the following HTML structure: <div> <div> <table> <caption /> <thead /> <tbody /> </table> </div> </div> However, when I programatically download the page's source using HtmlUnitDriver, then use driver.getPageSource(), I see that the corresponding HTML structure has changed to: <div> <table> <caption /> <tbody /> </table> </div> Why does the

Limit chrome headless CPU and memory usage

≡放荡痞女 提交于 2019-12-11 17:19:26
问题 I am using selenium to run chrome headless with the following command: system "LC_ALL=C google-chrome --headless --enable-logging --hide-scrollbars --remote-debugging-port=#{debug_port} --remote-debugging-address=0.0.0.0 --disable-gpu --no-sandbox --ignore-certificate-errors &" However it appears that chrome headless is consuming too much memory and cpu,anyone know how we can limit CPU/Memory usage of chrome headless? Or if there is some workaround. Thanks in advance. 回答1: There had been a

Developing scraping script on docker image - how to overcome lack of visual browser?

倖福魔咒の 提交于 2019-12-11 15:29:42
问题 I want to scrape info from the web and a previous attempt has taught me that docker would have been useful to run my script on since I develop the script on mac os x and then run it on a vm often ubuntu it often won't run since the dependencies don't exist on ubuntu and have proven difficult to build. Docker overcomes the dependency issue, but this now leads me to a different problem in that I need to develop the script in non-headless mode on the docker image to see what it's doing (or at

Watir-Webdriver Frame Attributes Not Congurent with Other Sources

微笑、不失礼 提交于 2019-12-11 14:32:35
问题 I have an issue where if I return the some attributes of a frame they do not match those in Firebug for example. The reason is that I am looking for a way to identify the purpose of a frame. For example on www.cnet.com they load 19 frames in total and some of these are HTML with JavaScript. I want to inspect some of the frames but not all. Using Firebug I see some interesting attributes regarding the frame and I want filter the frame based on some of these attributes. I have the following