Counting words in Ruby with some exceptions

≡放荡痞女 提交于 2019-12-24 13:42:34

问题


Say that we want to count the number of words in a document. I know we can do the following:

text.each_line(){ |line| totalWords = totalWords + line.split.size }

Say, that I just want to add some exceptions, such that, I don't want to count the following as words:

(1) numbers

(2) standalone letters

(3) email addresses

How can we do that?

Thanks.


回答1:


You can wrap this up pretty neatly:

text.each_line do |line|
  total_words += line.split.reject do |word|
    word.match(/\A(\d+|\w|\S*\@\S+\.\S+)\z/)
  end.length
end

Roughly speaking that defines an approximate email address.

Remember Ruby strongly encourages the use of variables with names like total_words and not totalWords.




回答2:


assuming you can represent all the exceptions in a single regular expression regex_variable, you could do:

text.each_line(){ |line| totalWords = totalWords + line.split.count {|wrd| wrd !~ regex_variable }

your regular expression could look something like:

regex_variable = /\d.|^[a-z]{1}$|\A([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})\Z/i

I don't claim to be a regex expert, so you may want to double check that, particularly the email validation part




回答3:


In addition to the other answers, a little gem hunting came up with this:

WordsCounted Gem

Get the following data from any string or readable file:

  • Word count
  • Unique word count
  • Word density
  • Character count
  • Average characters per word
  • A hash map of words and the number of times they occur
  • A hash map of words and their lengths
  • The longest word(s) and its length
  • The most occurring word(s) and its number of occurrences.
  • Count invividual strings for occurrences.
  • A flexible way to exclude words (or anything) from the count. You can pass a string, a regexp, an array, or a lambda.
  • Customisable criteria. Pass your own regexp rules to split strings if you prefer. The default regexp has two features:
  • Filters special characters but respects hyphens and apostrophes.
  • Plays nicely with diacritics (UTF and unicode characters): "São Paulo" is treated as ["São", "Paulo"] and not ["S", "", "o", "Paulo"].
  • Opens and reads files. Pass in a file path or a url instead of a string.



回答4:


Have you ever started answering a question and found yourself wandering, exploring interesting, but tangential issues, or concepts you didn't fully understand? That's what happened to me here. Perhaps some of the ideas might prove useful in other settings, if not for the problem at hand.

For readability, we might define some helpers in the class String, but to avoid contamination, I'll use Refinements.

Code

module StringHelpers
  refine String do
    def count_words
      remove_punctuation.split.count { |w|
        !(w.is_number? || w.size == 1 || w.is_email_address?) }
    end

    def remove_punctuation
      gsub(/[.!?,;:)](?:\s|$)|(?:^|\s)\(|\-|\n/,' ')
    end

    def is_number?
      self =~ /\A-?\d+(?:\.\d+)?\z/
    end

    def is_email_address?
      include?('@') # for testing only
    end
  end
end

module CountWords
   using StringHelpers

   def self.count_words_in_file(fname)
     IO.foreach(fname).reduce(0) { |t,l| t+l.count_words }
   end
end

Note that using must be in a module (possibly a class). It does not work in main, presumably because that would make the methods available in the class self.class #=> Object, which would defeat the purpose of Refinements. (Readers: please correct me if I'm wrong about the reason using must be in a module.)

Example

Let's first informally check that the helpers are working correctly:

module CheckHelpers
  using StringHelpers

  s = "You can reach my dog, a 10-year-old golden, at fido@dogs.org."
  p s = s.remove_punctuation
    #=> "You can reach my dog a 10 year old golden at fido@dogs.org."

  p words = s.split
    #=> ["You", "can", "reach", "my", "dog", "a", "10",
    #    "year", "old", "golden", "at", "fido@dogs.org."]

  p '123'.is_number?  #=> 0
  p '-123'.is_number? #=> 0
  p '1.23'.is_number? #=> 0
  p '123.'.is_number? #=> nil

  p "fido@dogs.org".is_email_address?    #=> true
  p "fido(at)dogs.org".is_email_address? #=> false 

  p s.count_words     #=> 9 (`'a'`, `'10'` and "fido@dogs.org" excluded)

  s = "My cat, who has 4 lives remaining, is at abbie(at)felines.org."
  p s = s.remove_punctuation
  p s.count_words

end

All looks OK. Next, put I'll put some text in a file:

FName = "pets"

text =<<_
My cat, who has 4 lives remaining, is at abbie(at)felines.org.
You can reach my dog, a 10-year-old golden, at fido@dogs.org.
_


File.write(FName, text)
  #=> 125

and confirm the file contents:

File.read(FName)
  #=> "My cat, who has 4 lives remaining, is at  abbie(at)felines.org.\n
  #   You can reach my dog, a 10-year-old golden, at fido@dogs.org.\n" 

Now, count the words:

CountWords.count_words_in_file(FName)
  #=> 18 (9 in ech line)

Note that there is at least one problem with the removal of punctuation. It has to do with the hyphen. Any idea what that might be?




回答5:


Something like...?

def is_countable(word)
  return false if word.size < 2
  return false if word ~= /^[0-9]+$/
  return false if is_an_email_address(word) # you need a gem for this...
  return true
end

wordCount = text.split().inject(0) {|count,word| count += 1 if is_countable(word) }

Or, since I am jumping to the conclusion that you can just split your entire text into an array with split(), you might need:

wordCount = 0
text.each_line do |line|
  line.split.each{|word| wordCount += 1 if is_countable(word) }
end


来源:https://stackoverflow.com/questions/31146079/counting-words-in-ruby-with-some-exceptions

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!