I\'m trying to write a function in python, which will determine what type of value is in string; for example
if in string is 1 or 0 or True or False the value is BI
In reply to
For example it doesn't work for 2010-00-10 which should be Text, but is INT or 20.90, which should be float but is int
>>> import re
>>> patternINT=re.compile('[0-9]+')
>>> print patternINT.match('2010-00-10')
<_sre.SRE_Match object at 0x7fa17bc69850>
>>> patternINT=re.compile('[0-9]+$')
>>> print patternINT.match('2010-00-10')
None
>>> print patternINT.match('2010')
<_sre.SRE_Match object at 0x7fa17bc69850>
Don't forget $
to limit ending of string.
Before you go too far down the regex route, have you considered using ast.literal_eval
Examples:
In [35]: ast.literal_eval('1')
Out[35]: 1
In [36]: type(ast.literal_eval('1'))
Out[36]: int
In [38]: type(ast.literal_eval('1.0'))
Out[38]: float
In [40]: type(ast.literal_eval('[1,2,3]'))
Out[40]: list
May as well use Python to parse it for you!
OK, here is a bigger example:
import ast, re
def dataType(str):
str=str.strip()
if len(str) == 0: return 'BLANK'
try:
t=ast.literal_eval(str)
except ValueError:
return 'TEXT'
except SyntaxError:
return 'TEXT'
else:
if type(t) in [int, long, float, bool]:
if t in set((True,False)):
return 'BIT'
if type(t) is int or type(t) is long:
return 'INT'
if type(t) is float:
return 'FLOAT'
else:
return 'TEXT'
testSet=[' 1 ', ' 0 ', 'True', 'False', #should all be BIT
'12', '34l', '-3','03', #should all be INT
'1.2', '-20.4', '1e66', '35.','- .2','-.2e6', #should all be FLOAT
'10-1', 'def', '10,2', '[1,2]','35.9.6','35..','.']
for t in testSet:
print "{:10}:{}".format(t,dataType(t))
Output:
1 :BIT
0 :BIT
True :BIT
False :BIT
12 :INT
34l :INT
-3 :INT
03 :INT
1.2 :FLOAT
-20.4 :FLOAT
1e66 :FLOAT
35. :FLOAT
- .2 :FLOAT
-.2e6 :FLOAT
10-1 :TEXT
def :TEXT
10,2 :TEXT
[1,2] :TEXT
35.9.6 :TEXT
35.. :TEXT
. :TEXT
And if you positively MUST have a regex solution, which produces the same results, here it is:
def regDataType(str):
str=str.strip()
if len(str) == 0: return 'BLANK'
if re.match(r'True$|^False$|^0$|^1$', str):
return 'BIT'
if re.match(r'([-+]\s*)?\d+[lL]?$', str):
return 'INT'
if re.match(r'([-+]\s*)?[1-9][0-9]*\.?[0-9]*([Ee][+-]?[0-9]+)?$', str):
return 'FLOAT'
if re.match(r'([-+]\s*)?[0-9]*\.?[0-9][0-9]*([Ee][+-]?[0-9]+)?$', str):
return 'FLOAT'
return 'TEXT'
I cannot recommend the regex over the ast version however; just let Python do the interpretation of what it thinks these data types are rather than interpret them with a regex...
You could also use json.
import json
converted_val = json.loads('32.45')
type(converted_val)
Outputs
type <'float'>
EDIT
To answer your question, however:
re.match()
returns partial matches, starting from the beginning of the string.
Since you keep evaluating every pattern match the sequence for "2010-00-10" goes like this:
if patternTEXT.match(str_obj): #don't use 'string' as a variable name.
it matches, so odp
is set to "text"
then, your script does:
if patternFLOAT.match(str_obj):
no match, odp
still equals "text"
if patternINT.match(str_obj):
partial match odp
is set to "INT"
Because match returns partial matches, multiple if
statements are evaluated and the last one evaluated determines which string is returned in odp
.
You can do one of several things:
rearrange the order of your if statements so that the last one to match is the correct one.
use if
and elif
for the rest of your if
statements so that only the first statement to match is evaluated.
check to make sure the match object is matching the entire string:
...
match = patternINT.match(str_obj)
if match:
if match.end() == match.endpos:
#do stuff
...
You said that you used these for input:
Your original code:
def dataType(string):
odp=''
patternBIT=re.compile('[01]')
patternINT=re.compile('[0-9]+')
patternFLOAT=re.compile('[0-9]+\.[0-9]+')
patternTEXT=re.compile('[a-zA-Z0-9]+')
if patternTEXT.match(string):
odp= "text"
if patternFLOAT.match(string):
odp= "FLOAT"
if patternINT.match(string):
odp= "INT"
if patternBIT.match(string):
odp= "BIT"
return odp
Your if
statements would be sequentially executed - that is:
if patternTEXT.match(string):
odp= "text"
if patternFLOAT.match(string):
odp= "FLOAT"
if patternINT.match(string)
odp= "INT"
if patternBIT.match(string):
odp= "BIT"
"2010-00-10" matches your text pattern, but then it will then try to match against your float pattern (fails because there's not .
), then matches against the int
pattern, which works because it does contain [0-9]+
.
You should use:
if patternTEXT.match(string):
odp = "text"
elif patternFLOAT.match(string):
...
Though for your situation, you probably want to go more specific to less specific, because as you've seen, stuff that is text might also be int (and vice versa). You would need to improve your regular expressions too, as your 'text' pattern only matches for alphanumeric input, but doesn't match against special symbols.
I will offer my own suggestion, though I do like the AST solution more:
def get_type(string):
if len(string) == 1 and string in ['0', '1']:
return "BIT"
# int has to come before float, because integers can be
# floats.
try:
long(string)
return "INT"
except ValueError, ve:
pass
try:
float(string)
return "FLOAT"
except ValueError, ve:
pass
return "TEXT"
Run example:
In [27]: get_type("034")
Out[27]: 'INT'
In [28]: get_type("3-4")
Out[28]: 'TEXT'
In [29]: get_type("20.90")
Out[29]: 'FLOAT'
In [30]: get_type("u09pweur909ru20")
Out[30]: 'TEXT'