NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

梦想的初衷 提交于 2019-12-05 10:44:14

The point is that your \b was a backspace character, you need to use a raw string literal. Also, you have literal pipes in the character classes that would also mess your output.

This works as expected:

>>> pattern = r'[\d.,]+|[A-Z][.A-Z]+\b\.*|\w+|\S'
>>> tokenizer = RegexpTokenizer(pattern)
>>> print(tokenizer.tokenize(line))

['U.S.A', 'Count', 'U.S.A.', 'Sec', '.', 'of', 'U.S.', 'Name', ':', 'Dr', '.', 'John', 'Doe', 'J.', 'Doe', '1.11', '1,000', '10', '-', '-', '20', '10', '-', '20']

Note that putting a single \w into a character class is pointless. Also, you do not need to escape every non-word char (like a dot) in the character class as they are mostly treated as literal chars there (only ^, ], - and \ require special attention).

If you mod your regex

pattern = '[USA\.]{4,}|[\w]+|[\S]'

Then

pattern = '[USA\.]{4,}|[\w]+'
tokenizer = RegexpTokenizer(pattern)
print (''+str(tokenizer.tokenize(line)))

You get the output that you wanted

['U.S.A', 'Count', 'U.S.A.', 'Sec', '.', 'of', 'U.S.', 'Name', ':', 'Dr', '.', 'John', 'Doe', 'J', '.', 'Doe', '1', '.', '11', '1', ',', '000', '10', '-', '-', '20', '10', '-', '20']
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!