getaddrinfo error with Mechanize

和自甴很熟 提交于 2019-12-18 16:49:26

问题


I wrote a script that will go through all of the customers in our database, verify that their website URL works, and try to find a twitter link on their homepage. We have a little over 10,000 URLs to verify. After a fraction of if the urls are verified, we start getting getaddrinfo errors for every URL.

Here's a copy of the code that scrapes a single URL:

def scrape_url(url) 
  url_found = false 
  twitter_name = nil 

  begin 
    agent = Mechanize.new do |a| 
      a.follow_meta_refresh = true 
    end 

    agent.get(normalize_url(url)) do |page| 
      url_found = true 
      twitter_name = find_twitter_name(page) 
    end 

    @err << "[#{@current_record}] SUCCESS\n" 
  rescue Exception => e 
    @err << "[#{@current_record}] ERROR (#{url}): " 
    @err << e.message 
    @err << "\n" 
  end 

  [url_found, twitter_name] 
end

Note: I've also run a version of this code that creates a single Mechanize instance that gets shared across all calls to scrape_url. It failed in exactly the same fashion.

When I run this on EC2, it gets through almost exactly 1,000 urls, then returns this error for the remaining 9,000+:

getaddrinfo: Temporary failure in name resolution

Note, I've tried using both Amazon's DNS servers and Google's DNS servers, thinking it might be a legitimate DNS issue. I got exactly the same result in both cases.

Then, I tried running it on my local MacBook Pro. It only got through about 250 before returning this error for the remainder of the records:

getaddrinfo: nodename nor servname provided, or not known

Does anyone know how I can get the script to make it through all of the records?


回答1:


I found the solution. Mechanize was leaving the connection open and relying on GC to clean them up. After a certain point, there were enough open connections that no additional outbound connection could be established to do a DNS lookup. Here's the code that caused it to work:

agent = Mechanize.new do |a| 
  a.follow_meta_refresh = true
  a.keep_alive = false
end

By setting keep_alive to false, the connection is immediately closed and cleaned up.




回答2:


See if this helps:

agent.history.max_size = 10

It will keep the history from using too much memory



来源:https://stackoverflow.com/questions/13186289/getaddrinfo-error-with-mechanize

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!