How to remove all non - ASCII characters from a string in Ruby

后端 未结 3 956
小蘑菇
小蘑菇 2021-02-01 20:38

I seems to be a very simple and much needed method. I need to remove all non ASCII characters from a string. e.g © etc. See the following example.

#coding: utf-         


        
相关标签:
3条回答
  • 2021-02-01 21:10

    You can just literally translate what you asked into a Regexp. You wrote:

    I want to get rid of all non ASCII characters

    We can rephrase that a little bit:

    I want to substitue all characters which don't thave the ASCII property with nothing

    And that's a statement that can be directly expressed in a Regexp:

    s.gsub!(/\P{ASCII}/, '')
    

    As an alternative, you could also use String#delete!:

    s.delete!("^\u{0000}-\u{007F}")
    
    0 讨论(0)
  • 2021-02-01 21:20

    Strip out the characters using regex. This example is in C# but the regex should be the same: How can you strip non-ASCII characters from a string? (in C#)

    Translating it into ruby using gsub should not be difficult.

    0 讨论(0)
  • 2021-02-01 21:36

    UTF-8 is a variable-length encoding. When a character occupies one byte, its value coincides with 7-bit ASCII. So why don't you just look for bytes with a '1' in the MSB, and then remove both them and their trailers? A byte beginning with '110' will be followed by one additional byte. A byte beginning with '1110' will be followed by two. And a byte beginning with '11110' will be followed by three, the maximum supported by UTF-8.

    This is all just off the top of my head. I could be wrong.

    0 讨论(0)
提交回复
热议问题