Parsing a CSV file using different encodings and libraries

好久不见. 提交于 2019-12-03 15:13:51

Looking at the file in question:

 $ curl -s http://jamesabbottdd.com/examples/testfile.csv | xxd | head -n3
0000000: fffe 4300 6100 6d00 7000 6100 6900 6700  ..C.a.m.p.a.i.g.
0000010: 6e00 0900 4300 7500 7200 7200 6500 6e00  n...C.u.r.r.e.n.
0000020: 6300 7900 0900 4200 7500 6400 6700 6500  c.y...B.u.d.g.e.

The byte order markffee at the start suggests the file encoding is little endian UTF-16, and the 00 bytes at every other position back this up.

This would suggest that you should be able to do this:

CSV.foreach('./testfile.csv', :encoding => 'utf-16le') do |row| ...

However that gives me invalid byte sequence in UTF-16LE (ArgumentError) coming from inside the CSV library. I think this is due to IO#gets only returning a single byte for some reason when faced with the BOM when called in CSV, resulting in the invalid UTF-16.

You can get CSV to strip of the BOM, by using bom|utf-16-le as the encoding:

CSV.foreach('./testfile.csv', :encoding => 'bom|utf-16le') do |row| ...

You might prefer to convert the string to a more familiar encoding instead, in which case you could do:

CSV.foreach('./testfile.csv', :encoding => 'utf-16le:utf-8') do |row| ...

Both of these appear to work okay.

Converting the file to UTF8 first and then reading it also works nicely:

iconv -f utf-16 -t utf8 testfile.csv | ruby -rcsv -e 'CSV(STDIN).each {|row| puts row}'

Iconv seems to understand correctly that the file has a BOM at the start and strips it off when converting.

There are two things to solve here when dealing with the AdWords Keyword Planner download. For one is the encoding.

$ file Keyword\ Stats\ 2019-02-12\ at\ 19_04_53.csv
Keyword Stats 2019-02-12 at 19_04_53.csv: Little-endian UTF-16 Unicode text, with very long lines

And the fact that the delimiters are tabs and not commas!

So stepping over the CSV file is as simple as this:

CSV.foreach('Keyword Stats 2019-02-12 at 19_04_53.csv', col_sep: "\t", encoding: 'utf-16le:utf-8') do |row|
  puts row
end

FYI: The \t must be in double quotes so it would be interpreted as a tab and not the string \t.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!