发表新帖

发表新帖

Equivalent of Iconv.conv(“UTF-8//IGNORE”,…) in Ruby 1.9.X?

后端未结

关注

 6  1619

情歌与酒 2021-02-04 09:47

I\'m reading data from a remote source, and occassionally get some characters in another encoding. They\'re not important.

I\'d like to get get a \"best guess\" utf-8 st

6条回答

深忆病人 (楼主)

2021-02-04 10:29
With a bit of help from @masakielastic I have solved this problem for my personal purposes using the #chars method.

The trick is to break down each character into its own separate block so that ruby can fail.

Ruby needs to fail when it confronts binary code etc. If you don't allow ruby to go ahead and fail its a tough road when it comes to this stuff. So I use the String#chars method to break the given string into an array of characters. Then I pass that code into a sanitizing method that allows the code to have "microfailures" (my coinage) within the string.

So, given a "dirty" string, lets say you used File#read on a picture. (my case)
```
dirty = File.open(filepath).read    
clean_chars = dirty.chars.select do |c|
  begin
    num_or_letter?(c)
  rescue ArgumentError
    next
  end
end
clean = clean_chars.join("")

def num_or_letter?(char)
  if char =~ /[a-zA-Z0-9]/
    true
  elsif char =~ Regexp.union(" ", ".", "?", "-", "+", "/", ",", "(", ")")
    true
  end
end
```
allowing the code to fail somewhere along in the process seems to be the best way to move through it. So long as you contain those failures within blocks you can grab what is readable by the UTF-8-only-accepting parts of ruby
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题