Word counts in Python using regular expression

人走茶凉 提交于 2019-12-10 17:44:11

问题


What is the correct way to count English words in a document using regular expression?

I tried with:

words=re.findall('\w+', open('text.txt').read().lower())
len(words)

but it seems I am missing few words (compares to the word count in gedit). Am I doing it right?

Thanks a lot!


回答1:


Using \w+ won't correctly count words containing apostrophes or hyphens, eg "can't" will be counted as 2 words. It will also count numbers (strings of digits); "12,345" and "6.7" will each count as 2 words ("12" and "345", "6" and "7").




回答2:


This seems to work as expected.

>>> import re
>>> words=re.findall('\w+', open('/usr/share/dict/words').read().lower())
>>> len(words)
234936
>>> 
bash-3.2$ wc /usr/share/dict/words
  234936  234936 2486813 /usr/share/dict/words

Why are you lowercasing your words? What does that have to do with the count?

I'd submit that the following would be more efficient:

words=re.findall(r'\w+', open('/usr/share/dict/words').read())


来源:https://stackoverflow.com/questions/6017948/word-counts-in-python-using-regular-expression

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!