Strip all non-numeric characters (except for “.”) from a string in Python

后端 未结 6 1459
死守一世寂寞
死守一世寂寞 2020-12-07 22:26

I\'ve got a pretty good working snippit of code, but I was wondering if anyone has any better suggestions on how to do this:

val = \'\'.join([c for c in val          


        
相关标签:
6条回答
  • 2020-12-07 22:54

    Here's some sample code:

    $ cat a.py
    a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
    for i in xrange(1000000):
        ''.join([c for c in a if c in '1234567890.'])
    

    $ cat b.py
    import re
    
    non_decimal = re.compile(r'[^\d.]+')
    
    a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
    for i in xrange(1000000):
        non_decimal.sub('', a)
    

    $ cat c.py
    a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
    for i in xrange(1000000):
        ''.join([c for c in a if c.isdigit() or c == '.'])
    

    $ cat d.py
    a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
    for i in xrange(1000000):
        b = []
        for c in a:
            if c.isdigit() or c == '.': continue
            b.append(c)
    
        ''.join(b)
    

    And the timing results:


    $ time python a.py
    real    0m24.735s
    user    0m21.049s
    sys     0m0.456s
    
    $ time python b.py
    real    0m10.775s
    user    0m9.817s
    sys     0m0.236s
    
    $ time python c.py
    real    0m38.255s
    user    0m32.718s
    sys     0m0.724s
    
    $ time python d.py
    real    0m46.040s
    user    0m41.515s
    sys     0m0.832s
    

    Looks like the regex is the winner so far.

    Personally, I find the regex just as readable as the list comprehension. If you're doing it just a few times then you'll probably take a bigger hit on compiling the regex. Do what jives with your code and coding style.

    0 讨论(0)
  • 2020-12-07 22:59

    You can use a regular expression (using the re module) to accomplish the same thing. The example below matches runs of [^\d.] (any character that's not a decimal digit or a period) and replaces them with the empty string. Note that if the pattern is compiled with the UNICODE flag the resulting string could still include non-ASCII numbers. Also, the result after removing "non-numeric" characters is not necessarily a valid number.

    >>> import re
    >>> non_decimal = re.compile(r'[^\d.]+')
    >>> non_decimal.sub('', '12.34fe4e')
    '12.344'
    
    0 讨论(0)
  • 2020-12-07 23:04

    If the set of characters were larger, using sets as below might be faster. As it is, this is a bit slower than a.py.

    dec = set('1234567890.')
    
    a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
    for i in xrange(1000000):
        ''.join(ch for ch in a if ch in dec)

    At least on my system, you can save a tiny bit of time (and memory if your string were long enough to matter) by using a generator expression instead of a list comprehension in a.py:

    a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
    for i in xrange(1000000):
        ''.join(c for c in a if c in '1234567890.')

    Oh, and here's the fastest way I've found by far on this test string (much faster than regex) if you are doing this many, many times and are willing to put up with the overhead of building a couple of character tables.

    chrs = ''.join(chr(i) for i in xrange(256))
    deletable = ''.join(ch for ch in chrs if ch not in '1234567890.')
    
    a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw'
    for i in xrange(1000000):
        a.translate(chrs, deletable)

    On my system, that runs in ~1.0 seconds where the regex b.py runs in ~4.3 seconds.

    0 讨论(0)
  • 2020-12-07 23:06

    A simple solution is to use regular expessions

    import re 
    re.sub("[^0-9^.]", "", data)
    
    0 讨论(0)
  • 2020-12-07 23:11
    import string
    filter(lambda c: c in string.digits + '.', s)
    
    0 讨论(0)
  • 2020-12-07 23:20

    Another 'pythonic' approach

    filter( lambda x: x in '0123456789.', s )

    but regex is faster.

    0 讨论(0)
提交回复
热议问题