I have a Ruby CGI (not rails) that picks photos and captions from a web form. My users are very keen on using smart quotes and ligatures, they are pasting from other sources
Here's my suggestion using Iconv.
class String
def remove_non_ascii
require 'iconv'
Iconv.conv('ASCII//IGNORE', 'UTF8', self)
end
end
class String
def remove_non_ascii(replacement="")
self.gsub(/[\u0080-\u00ff]/, replacement)
end
end
No there isn't short of removing all characters beside the basic ones (which is recommended above). The best slution would be handling these names properly (since most filesystems today do not have any problems with Unicode names). If your users paste in ligatures they sure as hell will want to get them back too. If filesystem is your problem, abstract it away and set the filename to some md5 (this also allows you to easily shard uploads into buckets which scan very quickly since they never have too many entries).
Quick GS revealed this discussion which suggests the following method:
class String
def remove_nonascii(replacement)
n=self.split("")
self.slice!(0..self.size)
n.each { |b|
if b[0].to_i< 33 || b[0].to_i>127 then
self.concat(replacement)
else
self.concat(b)
end
}
self.to_s
end
end
The official way to convert between string encodings as of Ruby 1.9 is to use String#encode.
To simply remove non-ASCII characters, you could do this:
some_ascii = "abc"
some_unicode = "áëëçüñżλφθΩ
class String
def strip_control_characters
self.chars.reject { |char| char.ascii_only? and (char.ord < 32 or char.ord == 127) }.join
end
end