In Python, how to check if a string only contains certain characters?

后端 未结 7 2044
悲&欢浪女
悲&欢浪女 2020-12-02 12:36

In Python, how to check if a string only contains certain characters?

I need to check a string containing only a..z, 0..9, and . (period) and no other character.

相关标签:
7条回答
  • 2020-12-02 13:16

    Use python Sets when you need to compare hm... sets of data. Strings can be represented as sets of characters quite fast. Here I test if string is allowed phone number. First string is allowed, second not. Works fast and simple.

    In [17]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(898) 64-901-63 ');p.issubset(allowed)").timeit()
    
    Out[17]: 0.8106249139964348
    
    In [18]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(950) 64-901-63 фыв');p.issubset(allowed)").timeit()
    
    Out[18]: 0.9240323599951807
    

    Never use regexps if you can avoid them.

    0 讨论(0)
  • 2020-12-02 13:21

    EDIT: Changed the regular expression to exclude A-Z

    Regular expression solution is the fastest pure python solution so far

    reg=re.compile('^[a-z0-9\.]+$')
    >>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
    True
    >>> timeit.Timer("reg.match('jsdlfjdsf12324..3432jsdflsdf')", "import re; reg=re.compile('^[a-z0-9\.]+$')").timeit()
    0.70509696006774902
    

    Compared to other solutions:

    >>> timeit.Timer("set('jsdlfjdsf12324..3432jsdflsdf') <= allowed", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
    3.2119350433349609
    >>> timeit.Timer("all(c in allowed for c in 'jsdlfjdsf12324..3432jsdflsdf')", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
    6.7066690921783447
    

    If you want to allow empty strings then change it to:

    reg=re.compile('^[a-z0-9\.]*$')
    >>>reg.match('')
    False
    

    Under request I'm going to return the other part of the answer. But please note that the following accept A-Z range.

    You can use isalnum

    test_str.replace('.', '').isalnum()
    
    >>> 'test123.3'.replace('.', '').isalnum()
    True
    >>> 'test123-3'.replace('.', '').isalnum()
    False
    

    EDIT Using isalnum is much more efficient than the set solution

    >>> timeit.Timer("'jsdlfjdsf12324..3432jsdflsdf'.replace('.', '').isalnum()").timeit()
    0.63245487213134766
    

    EDIT2 John gave an example where the above doesn't work. I changed the solution to overcome this special case by using encode

    test_str.replace('.', '').encode('ascii', 'replace').isalnum()
    

    And it is still almost 3 times faster than the set solution

    timeit.Timer("u'ABC\u0131\u0661'.encode('ascii', 'replace').replace('.','').isalnum()", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
    1.5719811916351318
    

    In my opinion using regular expressions is the best to solve this problem

    0 讨论(0)
  • 2020-12-02 13:24

    A different approach, because in my case I needed to also check whether it contained certain words (like 'test' in this example), not characters alone:

    input_string = 'abc test'
    input_string_test = input_string
    allowed_list = ['a', 'b', 'c', 'test', ' ']
    
    for allowed_list_item in allowed_list:
        input_string_test = input_string_test.replace(allowed_list_item, '')
    
    if not input_string_test:
        # test passed
    

    So, the allowed strings (char or word) are cut from the input string. If the input string only contained strings that were allowed, it should leave an empty string and therefore should pass if not input_string.

    0 讨论(0)
  • 2020-12-02 13:27

    This has already been answered satisfactorily, but for people coming across this after the fact, I have done some profiling of several different methods of accomplishing this. In my case I wanted uppercase hex digits, so modify as necessary to suit your needs.

    Here are my test implementations:

    import re
    
    hex_digits = set("ABCDEF1234567890")
    hex_match = re.compile(r'^[A-F0-9]+\Z')
    hex_search = re.compile(r'[^A-F0-9]')
    
    def test_set(input):
        return set(input) <= hex_digits
    
    def test_not_any(input):
        return not any(c not in hex_digits for c in input)
    
    def test_re_match1(input):
        return bool(re.compile(r'^[A-F0-9]+\Z').match(input))
    
    def test_re_match2(input):
        return bool(hex_match.match(input))
    
    def test_re_match3(input):
        return bool(re.match(r'^[A-F0-9]+\Z', input))
    
    def test_re_search1(input):
        return not bool(re.compile(r'[^A-F0-9]').search(input))
    
    def test_re_search2(input):
        return not bool(hex_search.search(input))
    
    def test_re_search3(input):
        return not bool(re.match(r'[^A-F0-9]', input))
    

    And the tests, in Python 3.4.0 on Mac OS X:

    import cProfile
    import pstats
    import random
    
    # generate a list of 10000 random hex strings between 10 and 10009 characters long
    # this takes a little time; be patient
    tests = [ ''.join(random.choice("ABCDEF1234567890") for _ in range(l)) for l in range(10, 10010) ]
    
    # set up profiling, then start collecting stats
    test_pr = cProfile.Profile(timeunit=0.000001)
    test_pr.enable()
    
    # run the test functions against each item in tests. 
    # this takes a little time; be patient
    for t in tests:
        for tf in [test_set, test_not_any, 
                   test_re_match1, test_re_match2, test_re_match3,
                   test_re_search1, test_re_search2, test_re_search3]:
            _ = tf(t)
    
    # stop collecting stats
    test_pr.disable()
    
    # we create our own pstats.Stats object to filter 
    # out some stuff we don't care about seeing
    test_stats = pstats.Stats(test_pr)
    
    # normally, stats are printed with the format %8.3f, 
    # but I want more significant digits
    # so this monkey patch handles that
    def _f8(x):
        return "%11.6f" % x
    
    def _print_title(self):
        print('   ncalls     tottime     percall     cumtime     percall', end=' ', file=self.stream)
        print('filename:lineno(function)', file=self.stream)
    
    pstats.f8 = _f8
    pstats.Stats.print_title = _print_title
    
    # sort by cumulative time (then secondary sort by name), ascending
    # then print only our test implementation function calls:
    test_stats.sort_stats('cumtime', 'name').reverse_order().print_stats("test_*")
    

    which gave the following results:

             50335004 function calls in 13.428 seconds
    
       Ordered by: cumulative time, function name
       List reduced from 20 to 8 due to restriction 
    
       ncalls     tottime     percall     cumtime     percall filename:lineno(function)
        10000    0.005233    0.000001    0.367360    0.000037 :1(test_re_match2)
        10000    0.006248    0.000001    0.378853    0.000038 :1(test_re_match3)
        10000    0.010710    0.000001    0.395770    0.000040 :1(test_re_match1)
        10000    0.004578    0.000000    0.467386    0.000047 :1(test_re_search2)
        10000    0.005994    0.000001    0.475329    0.000048 :1(test_re_search3)
        10000    0.008100    0.000001    0.482209    0.000048 :1(test_re_search1)
        10000    0.863139    0.000086    0.863139    0.000086 :1(test_set)
        10000    0.007414    0.000001    9.962580    0.000996 :1(test_not_any)
    

    where:

    ncalls
    The number of times that function was called
    tottime
    the total time spent in the given function, excluding time made to sub-functions
    percall
    the quotient of tottime divided by ncalls
    cumtime
    the cumulative time spent in this and all subfunctions
    percall
    the quotient of cumtime divided by primitive calls

    The columns we actually care about are cumtime and percall, as that shows us the actual time taken from function entry to exit. As we can see, regex match and search are not massively different.

    It is faster not to bother compiling the regex if you would have compiled it every time. It is about 7.5% faster to compile once than every time, but only 2.5% faster to compile than to not compile.

    test_set was twice as slow as re_search and thrice as slow as re_match

    test_not_any was a full order of magnitude slower than test_set

    TL;DR: Use re.match or re.search

    0 讨论(0)
  • 2020-12-02 13:30

    Simpler approach? A little more Pythonic?

    >>> ok = "0123456789abcdef"
    >>> all(c in ok for c in "123456abc")
    True
    >>> all(c in ok for c in "hello world")
    False
    

    It certainly isn't the most efficient, but it's sure readable.

    0 讨论(0)
  • 2020-12-02 13:35

    Final(?) edit

    Answer, wrapped up in a function, with annotated interactive session:

    >>> import re
    >>> def special_match(strg, search=re.compile(r'[^a-z0-9.]').search):
    ...     return not bool(search(strg))
    ...
    >>> special_match("")
    True
    >>> special_match("az09.")
    True
    >>> special_match("az09.\n")
    False
    # The above test case is to catch out any attempt to use re.match()
    # with a `$` instead of `\Z` -- see point (6) below.
    >>> special_match("az09.#")
    False
    >>> special_match("az09.X")
    False
    >>>
    

    Note: There is a comparison with using re.match() further down in this answer. Further timings show that match() would win with much longer strings; match() seems to have a much larger overhead than search() when the final answer is True; this is puzzling (perhaps it's the cost of returning a MatchObject instead of None) and may warrant further rummaging.

    ==== Earlier text ====
    

    The [previously] accepted answer could use a few improvements:

    (1) Presentation gives the appearance of being the result of an interactive Python session:

    reg=re.compile('^[a-z0-9\.]+$')
    >>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
    True
    

    but match() doesn't return True

    (2) For use with match(), the ^ at the start of the pattern is redundant, and appears to be slightly slower than the same pattern without the ^

    (3) Should foster the use of raw string automatically unthinkingly for any re pattern

    (4) The backslash in front of the dot/period is redundant

    (5) Slower than the OP's code!

    prompt>rem OP's version -- NOTE: OP used raw string!
    
    prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
    re;reg=re.compile(r'[^a-z0-9\.]')" "not bool(reg.search(t))"
    1000000 loops, best of 3: 1.43 usec per loop
    
    prompt>rem OP's version w/o backslash
    
    prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
    re;reg=re.compile(r'[^a-z0-9.]')" "not bool(reg.search(t))"
    1000000 loops, best of 3: 1.44 usec per loop
    
    prompt>rem cleaned-up version of accepted answer
    
    prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
    re;reg=re.compile(r'[a-z0-9.]+\Z')" "bool(reg.match(t))"
    100000 loops, best of 3: 2.07 usec per loop
    
    prompt>rem accepted answer
    
    prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
    re;reg=re.compile('^[a-z0-9\.]+$')" "bool(reg.match(t))"
    100000 loops, best of 3: 2.08 usec per loop
    

    (6) Can produce the wrong answer!!

    >>> import re
    >>> bool(re.compile('^[a-z0-9\.]+$').match('1234\n'))
    True # uh-oh
    >>> bool(re.compile('^[a-z0-9\.]+\Z').match('1234\n'))
    False
    
    0 讨论(0)
提交回复
热议问题