NLTK regexp tokenizer not playing nice with decimal point in regex

泄露秘密 提交于 2019-11-30 23:08:56

The culprit is:

\w+([-']\w+)*

\w+ will match numbers and since there's no . there, it will match only 3 in 3.14. Move the options around a bit so that \$?\d+(\.\d+)?%? is before the above regex part (so that the match is attempted first on the number format):

(?x)([A-Z]\.)+|\$?\d+(\.\d+)?%?|\w+([-']\w+)*|[+/\-@&*]

regex101 demo

Or in expanded form:

pattern = r'''(?x)               # set flag to allow verbose regexps
              ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
              | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
              | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
              | [+/\-@&*]        # special characters with meanings
            '''

Try this regex:

\b\$?\d+(\.\d+)?%?\b

I surround the initial regex with word boundaries matching: \b.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!