Capybara + PhantomJS
My favorite Ruby-controlled headless browser is PhantomJS. PhantomJS is a headless WebKit-based browser. It includes Poltergeist which is a driver for Capybara.
In summary, the stack looks like this:
Capybara -> Poltergeist -> PhantomJS -> WebKit
Note that you can use PhantomJS directly with selenium-webdriver, but the Capybara API is nicer (IMHO).
Being a minimal WebKit implementation, PhantomJS has a faster startup time than a full browser like Chrome or IE.
Sample code to scrape google result links:
module Test
class Google
include Capybara::DSL
def get_results
visit('/')
fill_in "q", :with => "Capybara"
click_button "Google Search"
all(:xpath, "//li[@class='g']/h3/a").each { |a| puts a[:href] }
end
end
end
scraper = Test::Google.new
scraper.get_results
In addition to the standard Capybara features, Poltergeist can do some very convenient things:
- Inject and run your own javascript with
page.evaluate_script
and page.execute_script
page.within_frame
and page.within_window
page.status_code
and page.response_headers
page.save_screenshot
<- This is really nice when things go wrong!
page.driver.render_base64(format, options)
page.driver.scroll_to(left, top)
page.driver.basic_authorize(user, password)
element.native.send_keys(*keys)
- cookie handling
- drag-and-drop
These features are listed on the Poltergeist GitHub page: https://github.com/teampoltergeist/poltergeist.
Celerity
If you really want to eke out as much performance as possible, and don't mind switching to JRuby to do so, I have found Celerity to be super fast.
Celerity is a wrapper around Java's HTMLUnit. It is speedy because HTMLUnit is not a full browser, it is more of an emulator that executes JavaScript. The downside is that it doesn't support all the JavaScript that a full browser does, so it won't support very JS-heavy sites, but it is sufficient for most sites and getting better all the time.
Another advantage is the multithreaded nature of JRuby. With the Peach (parallel each) gem, you can fire off many browsers in parallel. I have done this with a test suite in the past and drastically reduced the time to finish. In fact, we made a load tester using Celerity + Peach that was much more sophisticated than your typical JMeter, Grinder, apachebench, etc. It could really exercise our site!