问题
I wish to parse decimal numbers regardless of their format, which is unknown. Language of the original text is unknown and may vary. In addition, the source string can contain some extra text before or after, like currency or units.
I'm using the following:
# NOTE: Do not use, this algorithm is buggy. See below.
def extractnumber(value):
if (isinstance(value, int)): return value
if (isinstance(value, float)): return value
result = re.sub(r'&#\d+', '', value)
result = re.sub(r'[^0-9\,\.]', '', result)
if (len(result) == 0): return None
numPoints = result.count('.')
numCommas = result.count(',')
result = result.replace(",", ".")
if ((numPoints > 0 and numCommas > 0) or (numPoints == 1) or (numCommas == 1)):
decimalPart = result.split(".")[-1]
integerPart = "".join ( result.split(".")[0:-1] )
else:
integerPart = result.replace(".", "")
result = int(integerPart) + (float(decimalPart) / pow(10, len(decimalPart) ))
return result
This kind of works...
>>> extractnumber("2")
2
>>> extractnumber("2.3")
2.3
>>> extractnumber("2,35")
2.35
>>> extractnumber("-2 000,5")
-2000.5
>>> extractnumber("EUR 1.000,74 €")
1000.74
>>> extractnumber("20,5 20,8") # Testing failure...
ValueError: invalid literal for int() with base 10: '205 208'
>>> extractnumber("20.345.32.231,50") # Returns false positive
2034532231.5
So my method seems very fragile to me, and returns lots of false positives.
Is there any library or smart function that can handle this? Ideally 20.345.32.231,50
shall not pass, but numbers in other languages like 1.200,50
or 1 200'50
would be extracted, regardless the amount of other text and characters (including newlines) around.
(Updated implementation according to accepted answer: https://github.com/jjmontesl/cubetl/blob/master/cubetl/text/functions.py#L91)
回答1:
You can do this with a suitably fancy regular expression. Here's my best attempt at one. I use named capturing groups, as with a pattern this complex, numeric ones would be much more confusing to use in backreferences.
First, the regexp pattern:
_pattern = r"""(?x) # enable verbose mode (which ignores whitespace and comments)
^ # start of the input
[^\d+-\.]* # prefixed junk
(?P<number> # capturing group for the whole number
(?P<sign>[+-])? # sign group (optional)
(?P<integer_part> # capturing group for the integer part
\d{1,3} # leading digits in an int with a thousands separator
(?P<sep> # capturing group for the thousands separator
[ ,.] # the allowed separator characters
)
\d{3} # exactly three digits after the separator
(?: # non-capturing group
(?P=sep) # the same separator again (a backreference)
\d{3} # exactly three more digits
)* # repeated 0 or more times
| # or
\d+ # simple integer (just digits with no separator)
)? # integer part is optional, to allow numbers like ".5"
(?P<decimal_part> # capturing group for the decimal part of the number
(?P<point> # capturing group for the decimal point
(?(sep) # conditional pattern, only tested if sep matched
(?! # a negative lookahead
(?P=sep) # backreference to the separator
)
)
[.,] # the accepted decimal point characters
)
\d+ # one or more digits after the decimal point
)? # the whole decimal part is optional
)
[^\d]* # suffixed junk
$ # end of the input
"""
And here's a function to use it:
def parse_number(text):
match = re.match(_pattern, text)
if match is None or not (match.group("integer_part") or
match.group("decimal_part")): # failed to match
return None # consider raising an exception instead
num_str = match.group("number") # get all of the number, without the junk
sep = match.group("sep")
if sep:
num_str = num_str.replace(sep, "") # remove thousands separators
if match.group("decimal_part"):
point = match.group("point")
if point != ".":
num_str = num_str.replace(point, ".") # regularize the decimal point
return float(num_str)
return int(num_str)
Some numeric strings with exactly one comma or period and exactly three digits following it (like "1,234"
and "1.234"
) are ambiguous. This code will parse both of them as integers with a thousand separator (1234
), rather than floating point values (1.234
) regardless of the actual separator character used. It's possible you could handle this with a special case, if you want a different outcome for those numbers (e.g. if you'd prefer to make a float out of 1.234
).
Some test output:
>>> test_cases = ["2", "2.3", "2,35", "-2 000,5", "EUR 1.000,74 €",
"20,5 20,8", "20.345.32.231,50", "1.234"]
>>> for s in test_cases:
print("{!r:20}: {}".format(s, parse_number(s)))
'2' : 2
'2.3' : 2.3
'2,35' : 2.35
'-2 000,5' : -2000.5
'EUR 1.000,74 €' : 1000.74
'20,5 20,8' : None
'20.345.32.231,50' : None
'1.234' : 1234
回答2:
I refacored your code a bit. This, together with the valid_number
function below should do the trick.
The main reason I took time to write this awful piece of code though, is to show future readers how awful parsing regular expressions can get if you don't know how to use regexp (like me for instance).
Hopefully, someone who know regexp better than me can show us how it should be done :)
Constrains
.
,,
and'
is accepted as both thousand separator and decimal separator- Not more than two different separators
- Maximum one separators with more than one occurrence
- Separator treated as decimal separators if only one separator present, and only one of that kind. (i.e.
123,456
are interpreted as123.456
, not123456
) - String is split up to list of numbers by double space (
' '
) - All parts of a thousand-separated number except for the first part, is required to be 3 digits long (
123,456.00
and1,345.00
are both considered valid, but2345,11.00
is not considered vald)
Code
import re
from itertools import combinations
def extract_number(value):
if (isinstance(value, int)) or (isinstance(value, float)):
yield float(value)
else:
#Strip the string for leading and trailing whitespace
value = value.strip()
if len(value) == 0:
raise StopIteration
for s in value.split(' '):
s = re.sub(r'&#\d+', '', s)
s = re.sub(r'[^\-\s0-9\,\.]', ' ', s)
s = s.replace(' ', '')
if len(s) == 0:
continue
if not valid_number(s):
continue
if not sum(s.count(sep) for sep in [',', '.', '\'']):
yield float(s)
else:
s = s.replace('.', '@').replace('\'', '@').replace(',', '@')
integer, decimal = s.rsplit('@', 1)
integer = integer.replace('@', '')
s = '.'.join([integer, decimal])
yield float(s)
Well - here comes the code that could probably be replaced by a couple of regexp statements.
def valid_number(s):
def _correct_integer(integer):
# First number should have length of 1-3
if not (0 < len(integer[0].replace('-', '')) < 4):
return False
# All the rest of the integers should be of length 3
for num in integer[1:]:
if len(num) != 3:
return False
return True
seps = ['.', ',', '\'']
n_seps = [s.count(k) for k in seps]
# If no separator is present
if sum(n_seps) == 0:
return True
# If all separators are present
elif all(n_seps):
return False
# If two separators are present
elif any(all(c) for c in combinations(n_seps, 2)):
# Find thousand separator
for c in s:
if c in seps:
tho_sep = c
break
# Find decimal separator:
for c in reversed(s):
if c in seps:
dec_sep = c
break
s = s.split(dec_sep)
# If it is more than one decimal separator
if len(s) != 2:
return False
integer = s[0].split(tho_sep)
return _correct_integer(integer)
# If one separator is present, and it is more than one of it
elif sum(n_seps) > 1:
for sep in seps:
if sep in s:
s = s.split(sep)
break
return _correct_integer(s)
# Otherwise, this is a regular decimal number
else:
return True
Output
extract_number('2' ): [2.0]
extract_number('.2' ): [0.2]
extract_number(2 ): [2.0]
extract_number(0.2 ): [0.2]
extract_number('EUR 200' ): [200.0]
extract_number('EUR 200.00 -11.2' ): [200.0, -11.2]
extract_number('EUR 200 EUR 300' ): [200.0, 300.0]
extract_number('$ -1.000,22' ): [-1000.22]
extract_number('EUR 100.2345,3443' ): []
extract_number('111,145,234.345.345'): []
extract_number('20,5 20,8' ): [20.5, 20.8]
extract_number('20.345.32.231,50' ): []
来源:https://stackoverflow.com/questions/20157375/fuzzy-smart-number-parsing-in-python