What's a good tool to screen-scrape with Javascript support? [closed]

左心房为你撑大大i 提交于 2020-03-05 21:09:01

问题


Is there a good test suite or tool set that can automate website navigation -- with Javascript support -- and collect the HTML from the pages?

Of course I can scrape straight HTML with BeautifulSoup. But this does me no good for sites that require Javascript. :)


回答1:


You could use Selenium or Watir to drive a real browser.

Ther are also some JavaScript-based headless browsers:

  • PhantomJS is a headless Webkit browser.
    • pjscrape is a scraping framework based on PhantomJS and jQuery.
    • CasperJS is a navigation scripting & testing utility bsaed on PhantomJS, if you need to do a little more than point at URLs to be scraped.
  • Zombie for Node.js

Personally, I'm most familiar with Selenium, which has support for writing automation scripts in a good number of languagues and has more mature tooling, such as the excellent Selenium IDE extension for Firefox, which can be used to write and run testcases, and can export test scripts to many languages.




回答2:


Using HtmlUnit is also a possibility.

HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.

It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.

It is typically used for testing purposes or to retrieve information from web sites.




回答3:


Selenium now wraps htmlunit so you don´t need start a browser anymore. The new WebDriver api is very easy to use too. The first example use htmlunit driver




回答4:


It would be very difficult to code a solution that would work with any arbitrary site out there. Each navigation menu implementation can be quite unique. I've worked a great deal with scrapers, and, provided you know the site you wish to target, here is how I'd approach it.

Usually, if you analyze the particular javascript used in a nav menu, it is fairly easy to use regular expressions to pull out the entire set of variables that are used to build the navmenu. I have never used Beautiful Soup, but from your description it sounds like it may only work on HTML elements and not be able to work inside the script tags.

If you're still having problems, or need to emulate some form POSTs or ajax, get Firefox and install the LiveHttpHeaders plugin. This plugin will allow you to manually browse the site and capture the urls being navigated along with any cookies that are being passed during your manual browsing. That is what you need your scraperbot to send in a request to get a valid response from the target webserver(s). This will also capture any ajax calls being made, and in many cases the same ajax calls must be implementated in your scraper to get your desired responses.




回答5:


Mozenda is a great tool to use as well.




回答6:


You can try the open source screen scraper from Scrape.it

Update: As of April 4th, 2013 Scrape.it Screen Scraper is open source on github.




回答7:


Keep in mind that and javascript fanciness is messing with the brower's internal DOM model of the page, and does nothing to the raw HTML.




回答8:


I've been using Selenium for this and it find that it works great. Selenium runs in Browser and will work with Firefox, Webkit and IE. http://selenium.openqa.org/




回答9:


@insin Watir is not IE only.

https://stackoverflow.com/questions/81566#83387



来源:https://stackoverflow.com/questions/5272338/is-there-a-python-library-that-allows-you-to-screen-scrape-a-web-site-that-relie

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!