Is there a way in ruby 1.9 to remove invalid byte sequences from strings?

前提是你 提交于 2019-12-17 18:26:16

问题


Suppose you have a string like "€foo\xA0", encoded UTF-8, Is there a way to remove invalid byte sequences from this string? ( so you get "€foo" )

In ruby-1.8 you could use Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "€foo\xA0") but that is now deprecated. "€foo\xA0".encode('UTF-8') doesn't do anything, since it is already UTF-8. I tried:

"€foo\xA0".force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')

which yields

"foo"

But that also loses the valid multibyte character €


回答1:


"€foo\xA0".chars.select(&:valid_encoding?).join



回答2:


"€foo\xA0".encode('UTF-16le', invalid: :replace, replace: '').encode('UTF-8')



回答3:


Ruby 2.0 and 1.9.3

"€foo\xA0".encode(Encoding::UTF_8, Encoding::UTF_8, :invalid => :replace)

Ruby 2.1+

"€foo\xA0".scrub



回答4:


    data = '' if not (data.force_encoding("UTF-8").valid_encoding?)


来源:https://stackoverflow.com/questions/8710444/is-there-a-way-in-ruby-1-9-to-remove-invalid-byte-sequences-from-strings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!