Delete non-UTF characters from a string in Ruby?

前端 未结 7 1619
不思量自难忘°
不思量自难忘° 2021-02-05 01:24

How do I delete non-UTF8 characters from a ruby string? I have a string that has for example \"xC2\" in it. I want to remove that char from the string so that it becomes a valid

相关标签:
7条回答
  • 2021-02-05 01:54

    You can use encode for that. text.encode('UTF-8', :invalid => :replace, :undef => :replace)

    For more info look into Ruby-Docs

    0 讨论(0)
  • 2021-02-05 01:54

    You can use /n, as in

    text.gsub!(/\xC2/n, '')
    

    to force the Regexp to operate on bytes.

    Are you sure this is what you want, though? Any Unicode character in the range [U+80, U+BF] will have a \xC2 in its UTF-8 encoded form.

    0 讨论(0)
  • 2021-02-05 01:56

    The best solution to this problem that I've found is this answer to the same question: https://stackoverflow.com/a/8711118/363293.

    In short: "€foo\xA0".chars.select(&:valid_encoding?).join

    0 讨论(0)
  • 2021-02-05 02:15

    Try Iconv

    1.9.3p194 :001 > require 'iconv'
    # => true 
    1.9.3p194 :002 > string = "testing\xC2 a non UTF-8 string"
    # => "testing\xC2 a non UTF-8 string" 
    1.9.3p194 :003 > ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
    # => #<Iconv:0x000000026c9290> 
    1.9.3p194 :004 > ic.iconv string
    # => "testing a non UTF-8 string" 
    
    0 讨论(0)
  • 2021-02-05 02:15
    data = '' if not (data.force_encoding("UTF-8").valid_encoding?)
    
    0 讨论(0)
  • 2021-02-05 02:17

    You text have ASCII-8BIT encoding, instead you should use this:

    String.delete!("^\u{0000}-\u{007F}"); 
    

    It will serve the same purpose.

    0 讨论(0)
提交回复
热议问题