I'm trying to write a text normalizer, and one of the basic cases that needs to be handled is turning something like 3.14
to three point one four
or three point fourteen
.
I'm currently using the pattern \$?\d+(\.\d+)?%?
with nltk.regexp_tokenize
, which I believe should handle numbers as well as currency and percentages. However, at the moment, something like $23.50
is handled perfectly (it parses to ['$23.50']
), but 3.14
is parsing to ['3', '14']
- the decimal point is being dropped.
I've tried adding a pattern separate \d+.\d+
to my regexp, but that didn't help (and shouldn't my current pattern match that already?)
Edit 2: I also just discovered that the %
part doesn't seem to be working correctly either - 20%
returns just ['20']
. I feel like there must be something wrong with my regexp, but I've tested it in Pythex and it seems fine?
Edit: Here is my code.
import nltk
import re
pattern = r'''(?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
| \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
| [+/\-@&*] # special characters with meanings
'''
words = nltk.regexp_tokenize(line, pattern)
words = [string.lower(w) for w in words]
print words
Here are some of my test strings:
32188
2598473
26 letters from A to Z
3.14 is pi. <-- ['3', '14', 'is', 'pi']
My weight is about 68 kg, +/- 10 grams.
Good muffins cost $3.88 in New York <-- ['good', 'muffins', 'cost', '$3.88', 'in', 'new', 'york']
The culprit is:
\w+([-']\w+)*
\w+
will match numbers and since there's no .
there, it will match only 3
in 3.14
. Move the options around a bit so that \$?\d+(\.\d+)?%?
is before the above regex part (so that the match is attempted first on the number format):
(?x)([A-Z]\.)+|\$?\d+(\.\d+)?%?|\w+([-']\w+)*|[+/\-@&*]
Or in expanded form:
pattern = r'''(?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
| \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
| [+/\-@&*] # special characters with meanings
'''
Try this regex:
\b\$?\d+(\.\d+)?%?\b
I surround the initial regex with word boundaries matching: \b
.
来源:https://stackoverflow.com/questions/22175923/nltk-regexp-tokenizer-not-playing-nice-with-decimal-point-in-regex