In Python, how to check if a string only contains certain characters?
I need to check a string containing only a..z, 0..9, and . (period) and no other character.
Use python Sets when you need to compare hm... sets of data. Strings can be represented as sets of characters quite fast. Here I test if string is allowed phone number. First string is allowed, second not. Works fast and simple.
In [17]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(898) 64-901-63 ');p.issubset(allowed)").timeit()
Out[17]: 0.8106249139964348
In [18]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(950) 64-901-63 фыв');p.issubset(allowed)").timeit()
Out[18]: 0.9240323599951807
Never use regexps if you can avoid them.
EDIT: Changed the regular expression to exclude A-Z
Regular expression solution is the fastest pure python solution so far
reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True
>>> timeit.Timer("reg.match('jsdlfjdsf12324..3432jsdflsdf')", "import re; reg=re.compile('^[a-z0-9\.]+$')").timeit()
0.70509696006774902
Compared to other solutions:
>>> timeit.Timer("set('jsdlfjdsf12324..3432jsdflsdf') <= allowed", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
3.2119350433349609
>>> timeit.Timer("all(c in allowed for c in 'jsdlfjdsf12324..3432jsdflsdf')", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
6.7066690921783447
If you want to allow empty strings then change it to:
reg=re.compile('^[a-z0-9\.]*$')
>>>reg.match('')
False
Under request I'm going to return the other part of the answer. But please note that the following accept A-Z range.
You can use isalnum
test_str.replace('.', '').isalnum()
>>> 'test123.3'.replace('.', '').isalnum()
True
>>> 'test123-3'.replace('.', '').isalnum()
False
EDIT Using isalnum is much more efficient than the set solution
>>> timeit.Timer("'jsdlfjdsf12324..3432jsdflsdf'.replace('.', '').isalnum()").timeit()
0.63245487213134766
EDIT2 John gave an example where the above doesn't work. I changed the solution to overcome this special case by using encode
test_str.replace('.', '').encode('ascii', 'replace').isalnum()
And it is still almost 3 times faster than the set solution
timeit.Timer("u'ABC\u0131\u0661'.encode('ascii', 'replace').replace('.','').isalnum()", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
1.5719811916351318
In my opinion using regular expressions is the best to solve this problem
A different approach, because in my case I needed to also check whether it contained certain words (like 'test' in this example), not characters alone:
input_string = 'abc test'
input_string_test = input_string
allowed_list = ['a', 'b', 'c', 'test', ' ']
for allowed_list_item in allowed_list:
input_string_test = input_string_test.replace(allowed_list_item, '')
if not input_string_test:
# test passed
So, the allowed strings (char or word) are cut from the input string. If the input string only contained strings that were allowed, it should leave an empty string and therefore should pass if not input_string
.
This has already been answered satisfactorily, but for people coming across this after the fact, I have done some profiling of several different methods of accomplishing this. In my case I wanted uppercase hex digits, so modify as necessary to suit your needs.
Here are my test implementations:
import re
hex_digits = set("ABCDEF1234567890")
hex_match = re.compile(r'^[A-F0-9]+\Z')
hex_search = re.compile(r'[^A-F0-9]')
def test_set(input):
return set(input) <= hex_digits
def test_not_any(input):
return not any(c not in hex_digits for c in input)
def test_re_match1(input):
return bool(re.compile(r'^[A-F0-9]+\Z').match(input))
def test_re_match2(input):
return bool(hex_match.match(input))
def test_re_match3(input):
return bool(re.match(r'^[A-F0-9]+\Z', input))
def test_re_search1(input):
return not bool(re.compile(r'[^A-F0-9]').search(input))
def test_re_search2(input):
return not bool(hex_search.search(input))
def test_re_search3(input):
return not bool(re.match(r'[^A-F0-9]', input))
And the tests, in Python 3.4.0 on Mac OS X:
import cProfile
import pstats
import random
# generate a list of 10000 random hex strings between 10 and 10009 characters long
# this takes a little time; be patient
tests = [ ''.join(random.choice("ABCDEF1234567890") for _ in range(l)) for l in range(10, 10010) ]
# set up profiling, then start collecting stats
test_pr = cProfile.Profile(timeunit=0.000001)
test_pr.enable()
# run the test functions against each item in tests.
# this takes a little time; be patient
for t in tests:
for tf in [test_set, test_not_any,
test_re_match1, test_re_match2, test_re_match3,
test_re_search1, test_re_search2, test_re_search3]:
_ = tf(t)
# stop collecting stats
test_pr.disable()
# we create our own pstats.Stats object to filter
# out some stuff we don't care about seeing
test_stats = pstats.Stats(test_pr)
# normally, stats are printed with the format %8.3f,
# but I want more significant digits
# so this monkey patch handles that
def _f8(x):
return "%11.6f" % x
def _print_title(self):
print(' ncalls tottime percall cumtime percall', end=' ', file=self.stream)
print('filename:lineno(function)', file=self.stream)
pstats.f8 = _f8
pstats.Stats.print_title = _print_title
# sort by cumulative time (then secondary sort by name), ascending
# then print only our test implementation function calls:
test_stats.sort_stats('cumtime', 'name').reverse_order().print_stats("test_*")
which gave the following results:
50335004 function calls in 13.428 seconds Ordered by: cumulative time, function name List reduced from 20 to 8 due to restriction ncalls tottime percall cumtime percall filename:lineno(function) 10000 0.005233 0.000001 0.367360 0.000037 :1(test_re_match2) 10000 0.006248 0.000001 0.378853 0.000038 :1(test_re_match3) 10000 0.010710 0.000001 0.395770 0.000040 :1(test_re_match1) 10000 0.004578 0.000000 0.467386 0.000047 :1(test_re_search2) 10000 0.005994 0.000001 0.475329 0.000048 :1(test_re_search3) 10000 0.008100 0.000001 0.482209 0.000048 :1(test_re_search1) 10000 0.863139 0.000086 0.863139 0.000086 :1(test_set) 10000 0.007414 0.000001 9.962580 0.000996 :1(test_not_any)
where:
The columns we actually care about are cumtime and percall, as that shows us the actual time taken from function entry to exit. As we can see, regex match and search are not massively different.
It is faster not to bother compiling the regex if you would have compiled it every time. It is about 7.5% faster to compile once than every time, but only 2.5% faster to compile than to not compile.
test_set was twice as slow as re_search and thrice as slow as re_match
test_not_any was a full order of magnitude slower than test_set
TL;DR: Use re.match or re.search
Simpler approach? A little more Pythonic?
>>> ok = "0123456789abcdef"
>>> all(c in ok for c in "123456abc")
True
>>> all(c in ok for c in "hello world")
False
It certainly isn't the most efficient, but it's sure readable.
Final(?) edit
Answer, wrapped up in a function, with annotated interactive session:
>>> import re
>>> def special_match(strg, search=re.compile(r'[^a-z0-9.]').search):
... return not bool(search(strg))
...
>>> special_match("")
True
>>> special_match("az09.")
True
>>> special_match("az09.\n")
False
# The above test case is to catch out any attempt to use re.match()
# with a `$` instead of `\Z` -- see point (6) below.
>>> special_match("az09.#")
False
>>> special_match("az09.X")
False
>>>
Note: There is a comparison with using re.match() further down in this answer. Further timings show that match() would win with much longer strings; match() seems to have a much larger overhead than search() when the final answer is True; this is puzzling (perhaps it's the cost of returning a MatchObject instead of None) and may warrant further rummaging.
==== Earlier text ====
The [previously] accepted answer could use a few improvements:
(1) Presentation gives the appearance of being the result of an interactive Python session:
reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True
but match() doesn't return True
(2) For use with match(), the ^
at the start of the pattern is redundant, and appears to be slightly slower than the same pattern without the ^
(3) Should foster the use of raw string automatically unthinkingly for any re pattern
(4) The backslash in front of the dot/period is redundant
(5) Slower than the OP's code!
prompt>rem OP's version -- NOTE: OP used raw string!
prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9\.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.43 usec per loop
prompt>rem OP's version w/o backslash
prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.44 usec per loop
prompt>rem cleaned-up version of accepted answer
prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[a-z0-9.]+\Z')" "bool(reg.match(t))"
100000 loops, best of 3: 2.07 usec per loop
prompt>rem accepted answer
prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile('^[a-z0-9\.]+$')" "bool(reg.match(t))"
100000 loops, best of 3: 2.08 usec per loop
(6) Can produce the wrong answer!!
>>> import re
>>> bool(re.compile('^[a-z0-9\.]+$').match('1234\n'))
True # uh-oh
>>> bool(re.compile('^[a-z0-9\.]+\Z').match('1234\n'))
False