determine “type of value” from a string in python

后端 未结 4 639
难免孤独
难免孤独 2021-01-14 02:39

I\'m trying to write a function in python, which will determine what type of value is in string; for example

if in string is 1 or 0 or True or False the value is BI

相关标签:
4条回答
  • 2021-01-14 03:11

    In reply to

    For example it doesn't work for 2010-00-10 which should be Text, but is INT or 20.90, which should be float but is int

    >>> import re
    >>> patternINT=re.compile('[0-9]+')
    >>> print patternINT.match('2010-00-10')
    <_sre.SRE_Match object at 0x7fa17bc69850>
    >>> patternINT=re.compile('[0-9]+$')
    >>> print patternINT.match('2010-00-10')
    None
    >>> print patternINT.match('2010')
    <_sre.SRE_Match object at 0x7fa17bc69850>
    

    Don't forget $ to limit ending of string.

    0 讨论(0)
  • 2021-01-14 03:15

    Before you go too far down the regex route, have you considered using ast.literal_eval

    Examples:

    In [35]: ast.literal_eval('1')
    Out[35]: 1
    
    In [36]: type(ast.literal_eval('1'))
    Out[36]: int
    
    In [38]: type(ast.literal_eval('1.0'))
    Out[38]: float
    
    In [40]: type(ast.literal_eval('[1,2,3]'))
    Out[40]: list
    

    May as well use Python to parse it for you!

    OK, here is a bigger example:

    import ast, re
    def dataType(str):
        str=str.strip()
        if len(str) == 0: return 'BLANK'
        try:
            t=ast.literal_eval(str)
    
        except ValueError:
            return 'TEXT'
        except SyntaxError:
            return 'TEXT'
    
        else:
            if type(t) in [int, long, float, bool]:
                if t in set((True,False)):
                    return 'BIT'
                if type(t) is int or type(t) is long:
                    return 'INT'
                if type(t) is float:
                    return 'FLOAT'
            else:
                return 'TEXT' 
    
    
    
    testSet=['   1  ', ' 0 ', 'True', 'False',   #should all be BIT
             '12', '34l', '-3','03',              #should all be INT
             '1.2', '-20.4', '1e66', '35.','-   .2','-.2e6',      #should all be FLOAT
             '10-1', 'def', '10,2', '[1,2]','35.9.6','35..','.']
    
    for t in testSet:
        print "{:10}:{}".format(t,dataType(t))
    

    Output:

       1      :BIT
     0        :BIT
    True      :BIT
    False     :BIT
    12        :INT
    34l       :INT
    -3        :INT
    03        :INT
    1.2       :FLOAT
    -20.4     :FLOAT
    1e66      :FLOAT
    35.       :FLOAT
    -   .2    :FLOAT
    -.2e6     :FLOAT
    10-1      :TEXT
    def       :TEXT
    10,2      :TEXT
    [1,2]     :TEXT
    35.9.6    :TEXT
    35..      :TEXT
    .         :TEXT
    

    And if you positively MUST have a regex solution, which produces the same results, here it is:

    def regDataType(str):
        str=str.strip()
        if len(str) == 0: return 'BLANK'
    
        if re.match(r'True$|^False$|^0$|^1$', str):
            return 'BIT'
        if re.match(r'([-+]\s*)?\d+[lL]?$', str): 
            return 'INT'
        if re.match(r'([-+]\s*)?[1-9][0-9]*\.?[0-9]*([Ee][+-]?[0-9]+)?$', str): 
            return 'FLOAT'
        if re.match(r'([-+]\s*)?[0-9]*\.?[0-9][0-9]*([Ee][+-]?[0-9]+)?$', str): 
            return 'FLOAT'
    
        return 'TEXT' 
    

    I cannot recommend the regex over the ast version however; just let Python do the interpretation of what it thinks these data types are rather than interpret them with a regex...

    0 讨论(0)
  • 2021-01-14 03:21

    You could also use json.

    import json
    converted_val = json.loads('32.45')
    type(converted_val)
    

    Outputs

    type <'float'>
    

    EDIT

    To answer your question, however:

    re.match() returns partial matches, starting from the beginning of the string. Since you keep evaluating every pattern match the sequence for "2010-00-10" goes like this:

    if patternTEXT.match(str_obj): #don't use 'string' as a variable name.
    

    it matches, so odp is set to "text"

    then, your script does:

    if patternFLOAT.match(str_obj):
    

    no match, odp still equals "text"

    if patternINT.match(str_obj):
    

    partial match odp is set to "INT"

    Because match returns partial matches, multiple if statements are evaluated and the last one evaluated determines which string is returned in odp.

    You can do one of several things:

    1. rearrange the order of your if statements so that the last one to match is the correct one.

    2. use if and elif for the rest of your if statements so that only the first statement to match is evaluated.

    3. check to make sure the match object is matching the entire string:

      ...
      match = patternINT.match(str_obj)
      if match:
          if match.end() == match.endpos:
              #do stuff
      ...
      
    0 讨论(0)
  • 2021-01-14 03:24

    You said that you used these for input:

    • 2010-00-10 (was int, not text)
    • 20.90 (was int, not float)

    Your original code:

    def dataType(string):
    
     odp=''
     patternBIT=re.compile('[01]')
     patternINT=re.compile('[0-9]+')
     patternFLOAT=re.compile('[0-9]+\.[0-9]+')
     patternTEXT=re.compile('[a-zA-Z0-9]+')
     if patternTEXT.match(string):
         odp= "text"
     if patternFLOAT.match(string):
         odp= "FLOAT"
     if patternINT.match(string):
         odp= "INT"
     if patternBIT.match(string):
         odp= "BIT"
    
     return odp 
    

    The Problem

    Your if statements would be sequentially executed - that is:

    if patternTEXT.match(string):
        odp= "text"
    if patternFLOAT.match(string):
        odp= "FLOAT"
    if patternINT.match(string)
        odp= "INT"
    if patternBIT.match(string):
        odp= "BIT"
    

    "2010-00-10" matches your text pattern, but then it will then try to match against your float pattern (fails because there's not .), then matches against the int pattern, which works because it does contain [0-9]+.

    You should use:

    if patternTEXT.match(string):
        odp = "text"
    elif patternFLOAT.match(string):
        ...
    

    Though for your situation, you probably want to go more specific to less specific, because as you've seen, stuff that is text might also be int (and vice versa). You would need to improve your regular expressions too, as your 'text' pattern only matches for alphanumeric input, but doesn't match against special symbols.

    I will offer my own suggestion, though I do like the AST solution more:

    def get_type(string):
    
        if len(string) == 1 and string in ['0', '1']:
            return "BIT"
    
        # int has to come before float, because integers can be
        # floats.
        try:
            long(string)
            return "INT"
        except ValueError, ve:
            pass
    
        try:
            float(string)
            return "FLOAT"
        except ValueError, ve:
            pass
    
        return "TEXT"
    

    Run example:

    In [27]: get_type("034")
    Out[27]: 'INT'
    
    In [28]: get_type("3-4")
    Out[28]: 'TEXT'
    
    
    In [29]: get_type("20.90")
    Out[29]: 'FLOAT'
    
    In [30]: get_type("u09pweur909ru20")
    Out[30]: 'TEXT'
    
    0 讨论(0)
提交回复
热议问题