Is there a simple way to remove multiple spaces in a string?

后端 未结 29 1431
星月不相逢
星月不相逢 2020-11-22 08:17

Suppose this string:

The   fox jumped   over    the log.

Turning into:



        
相关标签:
29条回答
  • 2020-11-22 08:58

    The fastest you can get for user-generated strings is:

    if '  ' in text:
        while '  ' in text:
            text = text.replace('  ', ' ')
    

    The short circuiting makes it slightly faster than pythonlarry's comprehensive answer. Go for this if you're after efficiency and are strictly looking to weed out extra whitespaces of the single space variety.

    0 讨论(0)
  • 2020-11-22 09:00
    def unPretty(S):
       # Given a dictionary, JSON, list, float, int, or even a string...
       # return a string stripped of CR, LF replaced by space, with multiple spaces reduced to one.
       return ' '.join(str(S).replace('\n', ' ').replace('\r', '').split())
    
    0 讨论(0)
  • 2020-11-22 09:01

    This does and will do: :)

    # python... 3.x
    import operator
    ...
    # line: line of text
    return " ".join(filter(lambda a: operator.is_not(a, ""), line.strip().split(" ")))
    
    0 讨论(0)
  • 2020-11-22 09:02
    import re
    s = "The   fox jumped   over    the log."
    re.sub("\s\s+" , " ", s)
    

    or

    re.sub("\s\s+", " ", s)
    

    since the space before comma is listed as a pet peeve in PEP 8, as mentioned by user Martin Thoma in the comments.

    0 讨论(0)
  • 2020-11-22 09:03

    Using regexes with "\s" and doing simple string.split()'s will also remove other whitespace - like newlines, carriage returns, tabs. Unless this is desired, to only do multiple spaces, I present these examples.

    I used 11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum to get realistic time tests and used random-length extra spaces throughout:

    original_string = ''.join(word + (' ' * random.randint(1, 10)) for word in lorem_ipsum.split(' '))
    

    The one-liner will essentially do a strip of any leading/trailing spaces, and it preserves a leading/trailing space (but only ONE ;-).

    # setup = '''
    
    import re
    
    def while_replace(string):
        while '  ' in string:
            string = string.replace('  ', ' ')
    
        return string
    
    def re_replace(string):
        return re.sub(r' {2,}' , ' ', string)
    
    def proper_join(string):
        split_string = string.split(' ')
    
        # To account for leading/trailing spaces that would simply be removed
        beg = ' ' if not split_string[ 0] else ''
        end = ' ' if not split_string[-1] else ''
    
        # versus simply ' '.join(item for item in string.split(' ') if item)
        return beg + ' '.join(item for item in split_string if item) + end
    
    original_string = """Lorem    ipsum        ... no, really, it kept going...          malesuada enim feugiat.         Integer imperdiet    erat."""
    
    assert while_replace(original_string) == re_replace(original_string) == proper_join(original_string)
    
    #'''
    

    # while_replace_test
    new_string = original_string[:]
    
    new_string = while_replace(new_string)
    
    assert new_string != original_string
    

    # re_replace_test
    new_string = original_string[:]
    
    new_string = re_replace(new_string)
    
    assert new_string != original_string
    

    # proper_join_test
    new_string = original_string[:]
    
    new_string = proper_join(new_string)
    
    assert new_string != original_string
    

    NOTE: The "while version" made a copy of the original_string, as I believe once modified on the first run, successive runs would be faster (if only by a bit). As this adds time, I added this string copy to the other two so that the times showed the difference only in the logic. Keep in mind that the main stmt on timeit instances will only be executed once; the original way I did this, the while loop worked on the same label, original_string, thus the second run, there would be nothing to do. The way it's set up now, calling a function, using two different labels, that isn't a problem. I've added assert statements to all the workers to verify we change something every iteration (for those who may be dubious). E.g., change to this and it breaks:

    # while_replace_test
    new_string = original_string[:]
    
    new_string = while_replace(new_string)
    
    assert new_string != original_string # will break the 2nd iteration
    
    while '  ' in original_string:
        original_string = original_string.replace('  ', ' ')
    

    Tests run on a laptop with an i5 processor running Windows 7 (64-bit).
    
    timeit.Timer(stmt = test, setup = setup).repeat(7, 1000)
    
    test_string = 'The   fox jumped   over\n\t    the log.' # trivial
    
    Python 2.7.3, 32-bit, Windows
                    test |      minum |    maximum |    average |     median
    ---------------------+------------+------------+------------+-----------
      while_replace_test |   0.001066 |   0.001260 |   0.001128 |   0.001092
         re_replace_test |   0.003074 |   0.003941 |   0.003357 |   0.003349
        proper_join_test |   0.002783 |   0.004829 |   0.003554 |   0.003035
    
    Python 2.7.3, 64-bit, Windows
                    test |      minum |    maximum |    average |     median
    ---------------------+------------+------------+------------+-----------
      while_replace_test |   0.001025 |   0.001079 |   0.001052 |   0.001051
         re_replace_test |   0.003213 |   0.004512 |   0.003656 |   0.003504
        proper_join_test |   0.002760 |   0.006361 |   0.004626 |   0.004600
    
    Python 3.2.3, 32-bit, Windows
                    test |      minum |    maximum |    average |     median
    ---------------------+------------+------------+------------+-----------
      while_replace_test |   0.001350 |   0.002302 |   0.001639 |   0.001357
         re_replace_test |   0.006797 |   0.008107 |   0.007319 |   0.007440
        proper_join_test |   0.002863 |   0.003356 |   0.003026 |   0.002975
    
    Python 3.3.3, 64-bit, Windows
                    test |      minum |    maximum |    average |     median
    ---------------------+------------+------------+------------+-----------
      while_replace_test |   0.001444 |   0.001490 |   0.001460 |   0.001459
         re_replace_test |   0.011771 |   0.012598 |   0.012082 |   0.011910
        proper_join_test |   0.003741 |   0.005933 |   0.004341 |   0.004009
    

    test_string = lorem_ipsum
    # Thanks to http://www.lipsum.com/
    # "Generated 11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum"
    
    Python 2.7.3, 32-bit
                    test |      minum |    maximum |    average |     median
    ---------------------+------------+------------+------------+-----------
      while_replace_test |   0.342602 |   0.387803 |   0.359319 |   0.356284
         re_replace_test |   0.337571 |   0.359821 |   0.348876 |   0.348006
        proper_join_test |   0.381654 |   0.395349 |   0.388304 |   0.388193    
    
    Python 2.7.3, 64-bit
                    test |      minum |    maximum |    average |     median
    ---------------------+------------+------------+------------+-----------
      while_replace_test |   0.227471 |   0.268340 |   0.240884 |   0.236776
         re_replace_test |   0.301516 |   0.325730 |   0.308626 |   0.307852
        proper_join_test |   0.358766 |   0.383736 |   0.370958 |   0.371866    
    
    Python 3.2.3, 32-bit
                    test |      minum |    maximum |    average |     median
    ---------------------+------------+------------+------------+-----------
      while_replace_test |   0.438480 |   0.463380 |   0.447953 |   0.446646
         re_replace_test |   0.463729 |   0.490947 |   0.472496 |   0.468778
        proper_join_test |   0.397022 |   0.427817 |   0.406612 |   0.402053    
    
    Python 3.3.3, 64-bit
                    test |      minum |    maximum |    average |     median
    ---------------------+------------+------------+------------+-----------
      while_replace_test |   0.284495 |   0.294025 |   0.288735 |   0.289153
         re_replace_test |   0.501351 |   0.525673 |   0.511347 |   0.508467
        proper_join_test |   0.422011 |   0.448736 |   0.436196 |   0.440318
    

    For the trivial string, it would seem that a while-loop is the fastest, followed by the Pythonic string-split/join, and regex pulling up the rear.

    For non-trivial strings, seems there's a bit more to consider. 32-bit 2.7? It's regex to the rescue! 2.7 64-bit? A while loop is best, by a decent margin. 32-bit 3.2, go with the "proper" join. 64-bit 3.3, go for a while loop. Again.

    In the end, one can improve performance if/where/when needed, but it's always best to remember the mantra:

    1. Make It Work
    2. Make It Right
    3. Make It Fast

    IANAL, YMMV, Caveat Emptor!

    0 讨论(0)
  • 2020-11-22 09:03

    You can also use the string splitting technique in a Pandas DataFrame without needing to use .apply(..), which is useful if you need to perform the operation quickly on a large number of strings. Here it is on one line:

    df['message'] = (df['message'].str.split()).str.join(' ')
    
    0 讨论(0)
提交回复
热议问题