Sample input:
\"I was 09809 home -- Yes! yes! You was\"
and output:
{ \'yes\' => 2, \'was\' => 2, \'i\' => 1, \'home\
You can look at my code that splits the text into words. The basic code would look as follows:
sentence = "Ala ma kota za 5zł i 10$."
splitter = SRX::Polish::WordSplitter.new(sentence)
histogram = Hash.new(0)
splitter.each do |word,type|
histogram[word.downcase] += 1 if type == :word
end
p histogram
You should be careful if you wish to work with languages other than English, since in Ruby 1.9 the downcase won't work as you expected for letters such as 'Ł'.