screen-scraping

What's a good tool to screen-scrape with Javascript support? [closed]

左心房为你撑大大i 提交于 2020-03-05 21:09:01
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . Is there a good test suite or tool set that can automate website navigation -- with Javascript support -- and collect the HTML from the pages? Of course I can scrape straight HTML with BeautifulSoup. But this does me no good for sites that require Javascript. :) 回答1: You could use Selenium or Watir to drive a

What's a good tool to screen-scrape with Javascript support? [closed]

核能气质少年 提交于 2020-03-05 21:06:43
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . Is there a good test suite or tool set that can automate website navigation -- with Javascript support -- and collect the HTML from the pages? Of course I can scrape straight HTML with BeautifulSoup. But this does me no good for sites that require Javascript. :) 回答1: You could use Selenium or Watir to drive a

How can I retrieve and parse just the html returned from an URL?

眉间皱痕 提交于 2020-02-25 05:15:25
问题 I want to be able to programmatically (without it displaying in the browser) send an URL such as http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=platypi&sprefix=platypi%2Caps&rh=i%3Aaps%2Ck%3Aplatypi" and get back in a string (or some more appropriate data type?) the html results of the page (the interesting part, anyway) so that I could parse that and reformat selected parts of it as matched text and images (which link to the appropriate page). I want to do

How can I retrieve and parse just the html returned from an URL?

会有一股神秘感。 提交于 2020-02-25 05:15:11
问题 I want to be able to programmatically (without it displaying in the browser) send an URL such as http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=platypi&sprefix=platypi%2Caps&rh=i%3Aaps%2Ck%3Aplatypi" and get back in a string (or some more appropriate data type?) the html results of the page (the interesting part, anyway) so that I could parse that and reformat selected parts of it as matched text and images (which link to the appropriate page). I want to do

Image scraping in Ruby

一曲冷凌霜 提交于 2020-02-22 07:03:14
问题 How do I scrape an image present on a particular URL using Nokogiri? If there are better options than Nokogiri please suggest. The css image tag is .profilePic img 回答1: If it is just an <img> with a URL: PAGE = "http://site.com/page.html" require 'nokogiri' require 'open-uri' html = Nokogiri.HTML(open(PAGE)) src = html.at('.profilePic img')['src'] File.open("foo.png", "wb") do |f| f.write(open(src).read) end If you need to turn a relative image path into an absolute, see: https:/

Image scraping in Ruby

烈酒焚心 提交于 2020-02-22 07:02:30
问题 How do I scrape an image present on a particular URL using Nokogiri? If there are better options than Nokogiri please suggest. The css image tag is .profilePic img 回答1: If it is just an <img> with a URL: PAGE = "http://site.com/page.html" require 'nokogiri' require 'open-uri' html = Nokogiri.HTML(open(PAGE)) src = html.at('.profilePic img')['src'] File.open("foo.png", "wb") do |f| f.write(open(src).read) end If you need to turn a relative image path into an absolute, see: https:/

Image scraping in Ruby

不羁岁月 提交于 2020-02-22 07:01:06
问题 How do I scrape an image present on a particular URL using Nokogiri? If there are better options than Nokogiri please suggest. The css image tag is .profilePic img 回答1: If it is just an <img> with a URL: PAGE = "http://site.com/page.html" require 'nokogiri' require 'open-uri' html = Nokogiri.HTML(open(PAGE)) src = html.at('.profilePic img')['src'] File.open("foo.png", "wb") do |f| f.write(open(src).read) end If you need to turn a relative image path into an absolute, see: https:/

Rvest html_table error - Error in out[j + k, ] : subscript out of bounds

 ̄綄美尐妖づ 提交于 2020-02-02 03:17:12
问题 I'm somewhat new to scraping with R, but I'm getting an error message that I can't make sense of. My code: url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session" leg <- read_html(url) testdata <- leg %>% html_nodes('table') %>% .[6] %>% html_table() To which I get the response: Error in out[j + k, ] : subscript out of bounds When I swap out html_table with html_text I don't get the error. Any idea what I'm doing wrong? Thanks! 回答1: Hope this helps!

Interpreting JavaScript in PHP

ぐ巨炮叔叔 提交于 2020-01-31 22:59:50
问题 I'd like to be able to run JavaScript and get the results with PHP and is wondering if there is a library for PHP that allows me to parse it out. My first thought was to use node.js, but since node.js has access to sockets, files and things I think I'd prefer to avoid that. Rationale: I'm doing screen scraping in PHP and have encountered many scenarios where the data is being produced by JavaScript on the frontend, and I would like to avoid writing specialized filtering functions to act on

Scraping in Python - Preventing IP ban

我的未来我决定 提交于 2020-01-30 13:50:07
问题 I am using Python to scrape pages. Until now I didn't have any complicated issues. The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping. Using Requests and lxml I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same. I have tried with Selenium and I got