Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8

后端 未结 3 2167
醉酒成梦
醉酒成梦 2020-12-04 15:39

I\'m using ruby 1.9.2

I\'m trying to parse a CSV file that contains some French words (e.g. spécifié) and place the contents in a MySQL database.

相关标签:
3条回答
  • 2020-12-04 15:44

    With ruby >= 1.9 you can use

    file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1:utf-8")
    

    The ISO8859-1:utf-8 is meaning: The csv-file is ISO8859-1 - encoded, but convert the content to utf-8

    If you prefer a more verbose code, you can use:

    file_contents = CSV.read("csvfile.csv", col_sep: "$", 
        external_encoding: "ISO8859-1", 
        internal_encoding: "utf-8"
      )
    
    0 讨论(0)
  • 2020-12-04 15:55

    I have been dealing with this issue for a while and not any of the other solutions worked for me.

    The thing that made the trick was to store the conflictive string in a binary File, then read the File normally and using this string to feed the CSV module:

    tempfile = Tempfile.new("conflictive_string")
    tempfile.binmode
    tempfile.write(conflictive_string)
    tempfile.close
    cleaned_string = File.read(tempfile.path)
    File.delete(tempfile.path)
    csv = CSV.new(cleaned_string)
    
    0 讨论(0)
  • 2020-12-04 15:59

    deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:

    file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")
    

    And if that doesn't work, you can use Iconv to fix up the individual strings with something like this:

    require 'iconv'
    utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first
    

    If latin1_string is "Non sp\xE9cifi\xE9", then utf8_string will be "Non spécifié". Also, Iconv.iconv can unmangle whole arrays at a time:

    utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)
    

    With newer Rubies, you can do things like this:

    utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')
    

    where latin1_string thinks it is in ASCII-8BIT but is really in ISO-8859-1.

    0 讨论(0)
提交回复
热议问题