NLTK regexp tokenizer not playing nice with decimal point in regex

自作多情 提交于 2019-12-19 03:22:42

问题


I'm trying to write a text normalizer, and one of the basic cases that needs to be handled is turning something like 3.14 to three point one four or three point fourteen.

I'm currently using the pattern \$?\d+(\.\d+)?%? with nltk.regexp_tokenize, which I believe should handle numbers as well as currency and percentages. However, at the moment, something like $23.50 is handled perfectly (it parses to ['$23.50']), but 3.14 is parsing to ['3', '14'] - the decimal point is being dropped.

I've tried adding a pattern separate \d+.\d+ to my regexp, but that didn't help (and shouldn't my current pattern match that already?)

Edit 2: I also just discovered that the % part doesn't seem to be working correctly either - 20% returns just ['20']. I feel like there must be something wrong with my regexp, but I've tested it in Pythex and it seems fine?

Edit: Here is my code.

import nltk
import re

pattern = r'''(?x)    # set flag to allow verbose regexps
            ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
            | \w+([-']\w+)*        # words w/ optional internal hyphens/apostrophe
            | \$?\d+(\.\d+)?%?  # numbers, incl. currency and percentages
            | [+/\-@&*]         # special characters with meanings
            '''
    words = nltk.regexp_tokenize(line, pattern)
    words = [string.lower(w) for w in words]
    print words

Here are some of my test strings:

32188
2598473
26 letters from A to Z
3.14 is pi.                         <-- ['3', '14', 'is', 'pi']
My weight is about 68 kg, +/- 10 grams.
Good muffins cost $3.88 in New York <-- ['good', 'muffins', 'cost', '$3.88', 'in', 'new', 'york']

回答1:


The culprit is:

\w+([-']\w+)*

\w+ will match numbers and since there's no . there, it will match only 3 in 3.14. Move the options around a bit so that \$?\d+(\.\d+)?%? is before the above regex part (so that the match is attempted first on the number format):

(?x)([A-Z]\.)+|\$?\d+(\.\d+)?%?|\w+([-']\w+)*|[+/\-@&*]

regex101 demo

Or in expanded form:

pattern = r'''(?x)               # set flag to allow verbose regexps
              ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
              | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
              | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
              | [+/\-@&*]        # special characters with meanings
            '''



回答2:


Try this regex:

\b\$?\d+(\.\d+)?%?\b

I surround the initial regex with word boundaries matching: \b.



来源:https://stackoverflow.com/questions/22175923/nltk-regexp-tokenizer-not-playing-nice-with-decimal-point-in-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!