mechanize-ruby

Getting, visiting and limiting the number of links using Nokogiri and Mechanize?

╄→尐↘猪︶ㄣ 提交于 2020-01-06 08:18:13
问题 I am trying to scrape the five latest stories from CNN.com and retrieve their links along with the first paragraph of each story. I have this simple script: url = "http://edition.cnn.com/?refresh=1" agent = Mechanize.new agent.get("http://edition.cnn.com/?refresh=1").search("//div[@id='cnn_maintt2bul']/div/div/ul/li[count(*)=3]/a").each do |headline| article = headline.text link = URI.join(url, headline[:href]).to_s page = headline.click(link) paragraph1 = page.at_css(".adtag15090+ p").text

Mechanize on Ruby 1.9.3 encoding issue

﹥>﹥吖頭↗ 提交于 2019-12-25 04:19:52
问题 Using the following code (from the Mechanize site but in a rake task).. namespace :ans do task :grab => :environment do a = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari' } begin a.get('http://google.com/') do |page| search_result = page.form_with(:name => 'f') do |search| search.q = 'Hello world' end.submit search_result.links.each do |link| puts link.text end end end end end I get an encoding error.. rake aborted! "\x8B" from ASCII-8BIT to UTF-8 This is whilst using the

Is there anyway to search and get the value of <a>.. </a>?

人走茶凉 提交于 2019-12-24 12:44:28
问题 In a webpage suppose i have the below values: <td> <a href="https://www.test.com/test123/a.html"> test11 </a> </td> <td> <a href="https://www.test.com/test12333/r.html"> test12 </a> </td> <td> <a href="https://www.test.com/testaa123/t.html"> test21 </a> </td> <td> <a href="https://www.test.com/test123123/b.html"> test31 </a> </td> Is there anyway to find the value test21 using Ruby? Or is there anyway to find the href values which has a substring /testaa123/t.html ? 回答1: Try out this tutorial

Installing mechanize gem on Mac OS X 10.4.11 gives 'Failed to build gem native extension'

房东的猫 提交于 2019-12-24 07:38:58
问题 I'm trying to install mechanize gem on a MAC OS X but I keep getting the following error : ERROR: Error installing mechanize: ERROR: Failed to build gem native extension. /usr/local/bin/ruby extconf.rb install mechanize checking for #include ... yes checking for #include ... yes checking for #include ... yes checking for #include ... yes checking for xmlParseDoc() in -lxml2... yes checking for xsltParseStylesheetDoc() in -lxslt... yes checking for exsltFuncRegister() in -lexslt... yes

Ruby Mechanize: Follow a Link

我只是一个虾纸丫 提交于 2019-12-22 06:48:21
问题 In Mechanize on Ruby, I have to assign a new variable to every new page I come to. For example: page2 = page1.link_with(:text => "Continue").click page3 = page2.link_with(:text => "About").click ...etc Is there a way to run Mechanize without a variable holding every page state? like my_only_page.link_with(:text => "Continue").click! my_only_page.link_with(:text => "About").click! 回答1: I don't know if I understand your question correctly, but if it's a matter of looping through a lot of pages

Regulating / rate limiting ruby mechanize

主宰稳场 提交于 2019-12-21 17:15:06
问题 I need to regulate how often a Mechanize instance connects with an API (once every 2 seconds, so limit connections to that or more) So this: instance.pre_connect_hooks << Proc.new { sleep 2 } I had thought this would work, and it sort of does BUT now every method in that class sleeps for 2 seconds, as if the mechanize instance is touched and told to hold 2 seconds. I'm going to try a post connect hook, but it is obvious I need something a bit more elaborate, but what I don't know what at this

Clicking link with JavaScript in Mechanize

纵饮孤独 提交于 2019-12-18 17:05:05
问题 I have this: <a class="top_level_active" href="javascript:Submit('menu_home')">Account Summary</a> I want to click that link but I get an error when using link_to. I've tried: bot.click(page.link_with(:href => /menu_home/)) bot.click(page.link_with(:class => 'top_level_active')) bot.click(page.link_with(:href => /Account Summary/)) The error I get is: NoMethodError: undefined method `[]' for nil:NilClass 回答1: That's a javascript link. Mechanize will not be able to click it, since it does not

getaddrinfo error with Mechanize

自作多情 提交于 2019-12-18 16:50:03
问题 I wrote a script that will go through all of the customers in our database, verify that their website URL works, and try to find a twitter link on their homepage. We have a little over 10,000 URLs to verify. After a fraction of if the urls are verified, we start getting getaddrinfo errors for every URL. Here's a copy of the code that scrapes a single URL: def scrape_url(url) url_found = false twitter_name = nil begin agent = Mechanize.new do |a| a.follow_meta_refresh = true end agent.get

getaddrinfo error with Mechanize

和自甴很熟 提交于 2019-12-18 16:49:26
问题 I wrote a script that will go through all of the customers in our database, verify that their website URL works, and try to find a twitter link on their homepage. We have a little over 10,000 URLs to verify. After a fraction of if the urls are verified, we start getting getaddrinfo errors for every URL. Here's a copy of the code that scrapes a single URL: def scrape_url(url) url_found = false twitter_name = nil begin agent = Mechanize.new do |a| a.follow_meta_refresh = true end agent.get

Is it possible to find the <td> .. </td> text, when any of the <td>..</td> value is known?

给你一囗甜甜゛ 提交于 2019-12-13 03:39:24
问题 I have an webpage which has the similar kind of html format as below: <form name="test"> <td> .... </td> . . . <td> <A HREF="http://www.edu/st/file.html">alo</A> </td> <td> <A HREF="http://www.dom/st/file.html">foo</A> </td> <td> bla bla </td> </form> Now, I know only the value bla bla , base on the value can we track or find the 3rd last .. value(which is here alo )? I can track those,with the help of HREF values,but the HREF values are not fixed always, they can be anything anytime. 回答1: