问题
I'm trying to do a little bit of webscraping, but the WWW:Mechanize gem doesn't seem to like the encoding and crashes.
The post request results in a 302 redirect (which mechanize follows, so far so good) and the resulting page seems to crash it.
I googled quite a bit, but nothing came up so far how to solve this. Any of you got an idea?
Code:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
agent.user_agent_alias = 'Mac Safari'
answer = agent.post('https://www.budget.de/de/reservierung/privatkunden/step1/schnellbuchung',
{"Country" => "Deutschland",
"Abholstation" => "Aalen",
"Abgabestation" => "Aalen",
"Abholdatum" => "26.02.2009",
"Abholzeit_stunde" => "13",
"Abholzeit_minute" => "30",
"Abgabedatum" => "28.02.2009",
"Abgabezeit_stunde" => "13",
"Abgabezeit_minute" => "30",
"CountryID" => "DE",
"AbholstationID"=>"AA1",
"AbgabestationID"=>"AA1"
}
)
puts answer.body
Error:
D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `iconv': "\204nderungen vorbe"... (Iconv::IllegalSequence)
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `to_native_charset'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_header_handler.rb:29:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_body_parser.rb:35:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/pre_connect_hook.rb:14:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:25:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:494:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:545:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:403:in `post_form'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:322:in `post'
from test.rb:7
回答1:
That page is most certainly UTF-8, however Mechanize uses NKF (a core Ruby library) to guess the encoding and for some reason it comes up as Shift JIS. The quickest way to work around the problem is to override the encoding mapping for Mechanize, so that when it attempts to convert the body to UTF-8 using Iconv it passes in the source encoding as UTF-8 as well. You can do it like this:
WWW::Mechanize::Util::CODE_DIC[:SJIS] = "UTF-8"
Place that just after the line where you require
the Mechanize library. You may want to set the value back immediately after, or even better, find the root cause of the problem and submit a patch if necessary.
Note: The way I solved this was by debugging the Mechanize library by using the backtrace. The to_native_charset
method calls detect_charset
which is where the problem was.
回答2:
In my case a Mechanize::File
was returned by the get method which doesn't use encoding at all.
I was able to fix it by manually converting with Iconv
, but this only works if you know the encoding already.
result = @agent.get uri
# Mechanize::File instead of Mechanize::Page is returned
# so we have to convert manually
result = Iconv.conv("utf-8", "iso-8859-1", result.body)
来源:https://stackoverflow.com/questions/586163/iconvillegalsequence-when-using-wwwmechanize