Is str.replace(..).replace(..) ad nauseam a standard idiom in Python?

前端未结

关注

 9  1367

For instance, say I wanted a function to escape a string for use in HTML (as in Django\'s escape filter):

    def escape(string):
        \"\"\"
        Retu


                      
              相关标签:


      
      
        
          9条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-12-20 11:22
              
            
            
                                                                       
That's what Django does:

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  有刺的猬        
                
              
                            
                2020-12-20 11:23
              
            
            
                                                                       
In accordance with bebraw's suggestion, here is what I ended up using (in a separate module, of course):

import re

class Subs(object):
    """
    A container holding strings to be searched for and replaced in
    replace_multi().

    Holds little relation to the sandwich.
    """
    def __init__(self, needles_and_replacements):
        """
        Returns a new instance of the Subs class, given a dictionary holding 
        the keys to be searched for and the values to be used as replacements.
        """
        self.lookup = needles_and_replacements
        self.regex = re.compile('|'.join(map(re.escape,
                                             needles_and_replacements)))

def replace_multi(string, subs):
    """
    Replaces given items in string efficiently in a single-pass.

    "string" should be the string to be searched.
    "subs" can be either:
        A.) a dictionary containing as its keys the items to be
            searched for and as its values the items to be replaced.
        or B.) a pre-compiled instance of the Subs class from this module
               (which may have slightly better performance if this is
                called often).
    """
    if not isinstance(subs, Subs): # Assume dictionary if not our class.
        subs = Subs(subs)
    lookup = subs.lookup
    return subs.regex.sub(lambda match: lookup[match.group(0)], string)


Example usage:

def escape(string):
    """
    Returns the given string with ampersands, quotes and angle 
    brackets encoded.
    """
    # Note that ampersands must be escaped first; the rest can be escaped in 
    # any order.
    escape.subs = Subs({'<': '&lt;', '>': '&gt;', "'": '&#39;', '"': '&quot;'})
    return replace_multi(string.replace('&', '&amp;'), escape.subs)


Much better :). Thanks for the help.

Edit

Nevermind, Mike Graham was right. I benchmarked it and the replacement ends up actually being much slower.

Code:

from urllib2 import urlopen
import timeit

def escape1(string):
    """
    Returns the given string with ampersands, quotes and angle
    brackets encoded.
    """
    return string.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace("'", '&#39;').replace('"', '&quot;')

def escape2(string):
    """
    Returns the given string with ampersands, quotes and angle
    brackets encoded.
    """
    # Note that ampersands must be escaped first; the rest can be escaped in
    # any order.
    escape2.subs = Subs({'<': '&lt;', '>': '&gt;', "'": '&#39;', '"': '&quot;'})
    return replace_multi(string.replace('&', '&amp;'), escape2.subs)

# An example test on the stackoverflow homepage.
request = urlopen('http://stackoverflow.com')
test_string = request.read()
request.close()

test1 = timeit.Timer('escape1(test_string)',
                     setup='from __main__ import escape1, test_string')
test2 = timeit.Timer('escape2(test_string)',
                     setup='from __main__ import escape2, test_string')
print 'multi-pass:', test1.timeit(2000)
print 'single-pass:', test2.timeit(2000)


Output: 

multi-pass: 15.9897229671
single-pass: 66.5422530174


So much for that.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遥遥无期        
                
              
                            
                2020-12-20 11:26
              
            
            
                                                                       
How about we just test various ways of doing this and see which comes out faster (assuming we are only caring about the fastest way to do it).

def escape1(input):
        return input.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace("'", '&#39;').replace('"', '&quot;')

translation_table = {
    '&': '&amp;',
    '<': '&lt;',
    '>': '&gt;',
    "'": '&#39;',
    '"': '&quot;',
}

def escape2(input):
        return ''.join(translation_table.get(char, char) for char in input)

import re
_escape3_re = re.compile(r'[&<>\'"]')
def _escape3_repl(x):
    s = x.group(0)
    return translation_table.get(s, s)
def escape3(x):
    return _escape3_re.sub(_escape3_repl, x)

def escape4(x):
    return unicode(x).translate(translation_table)

test_strings = (
    'Nothing in there.',
    '<this is="not" a="tag" />',
    'Something & Something else',
    'This one is pretty long. ' * 50
)

import time

for test_i, test_string in enumerate(test_strings):
    print repr(test_string)
    for func in escape1, escape2, escape3, escape4:
        start_time = time.time()
        for i in xrange(1000):
            x = func(test_string)
        print '\t%s done in %.3fms' % (func.__name__, (time.time() - start_time))
    print


Running this gives you:

'Nothing in there.'
    escape1 done in 0.002ms
    escape2 done in 0.009ms
    escape3 done in 0.001ms
    escape4 done in 0.005ms

'<this is="not" a="tag" />'
    escape1 done in 0.002ms
    escape2 done in 0.012ms
    escape3 done in 0.009ms
    escape4 done in 0.007ms

'Something & Something else'
    escape1 done in 0.002ms
    escape2 done in 0.012ms
    escape3 done in 0.003ms
    escape4 done in 0.007ms

'This one is pretty long. <snip>'
    escape1 done in 0.008ms
    escape2 done in 0.386ms
    escape3 done in 0.011ms
    escape4 done in 0.310ms


Looks like just replacing them one after another goes the fastest.

Edit: Running the tests again with 1000000 iterations gives the following for the first three strings (the fourth would take too long on my machine for me to wait =P):

'Nothing in there.'
    escape1 done in 0.001ms
    escape2 done in 0.008ms
    escape3 done in 0.002ms
    escape4 done in 0.005ms

'<this is="not" a="tag" />'
    escape1 done in 0.002ms
    escape2 done in 0.011ms
    escape3 done in 0.009ms
    escape4 done in 0.007ms

'Something & Something else'
    escape1 done in 0.002ms
    escape2 done in 0.011ms
    escape3 done in 0.003ms
    escape4 done in 0.007ms


The numbers are pretty much the same. In the first case they are actually even more consistent as the direct string replacement is fastest now.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  粉色の甜心        
                
              
                            
                2020-12-20 11:28
              
            
            
                                                                       
ok so i sat down and did the math. pls do not get mad at me i answer specifically discussing ΤΖΩΤΖΙΟΥ’s solution, but this would be somewhat hard to shoehorn inside a comment, so let me do it this way. i will, in fact, also air some considerations that are relevant to the OP’s question. 

first up, i have been discussing with ΤΖΩΤΖΙΟΥ the elegance, correctness, and viability of his approach. turns out it looks like the proposal, while it does use an (inherently unordered) dictionary as a register to store the substitution pairs, does in fact consistently return correct results, where i had claimed it wouldn’t. this is because the call to itertools.starmap() in line 11, below, gets as its second argument an iterator over pairs of single characters/bytes (more on that later) that looks like [ ( 'h', 'h', ), ( 'e', 'e', ), ( 'l', 'l', ), ... ]. these pairs of characters/bytes is what the first argument, replacer.get, is repeatedly called with. there is not a chance to run into a situation where first '>' is transformed into '&gt;' and then inadvertently again into '&amp;gt;', because each character/byte is considered only once for substitution. so this part is in principle fine and algorithmically correct.

the next question is viability, and that would include a look at performance. if a vital task gets correctly completed in 0.01s using an awkward code but 1s using awesome code, then awkward might be considered preferable in practice (but only if the 1 second loss is in fact intolerable). here is the code i used for testing; it includes a number of different implementations. it is written in python 3.1 so we can use unicode greek letters for identifiers which in itself is awesome (zip in py3k returns the same as itertools.izip in py2):

import itertools                                                                  #01
                                                                                  #02
_replacements = {                                                                 #03
  '&': '&amp;',                                                                   #04
  '<': '&lt;',                                                                    #05
  '>': '&gt;',                                                                    #06
  '"': '&quot;',                                                                  #07
  "'": '&#39;', }                                                                 #08
                                                                                  #09
def escape_ΤΖΩΤΖΙΟΥ( a_string ):                                                  #10
  return ''.join(                                                                 #11
    itertools.starmap(                                                            #12
      _replacements.get,                                                          #13
      zip( a_string, a_string ) ) )                                               #14
                                                                                  #15
def escape_SIMPLE( text ):                                                        #16
  return ''.join( _replacements.get( chr, chr ) for chr in text )                 #17
                                                                                  #18
def escape_SIMPLE_optimized( text ):                                              #19
  get = _replacements.get                                                         #20
  return ''.join( get( chr, chr ) for chr in text )                               #21
                                                                                  #22
def escape_TRADITIONAL( text ):                                                   #23
  return text.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;')\    #24
    .replace("'", '&#39;').replace('"', '&quot;')                                 #25


these are the timing results:

escaping with SIMPLE            took 5.74664253sec for 100000 items
escaping with SIMPLE_optimized  took 5.11457801sec for 100000 items
escaping TRADITIONAL in-situ    took 0.57543013sec for 100000 items
escaping with TRADITIONAL       took 0.62347413sec for 100000 items
escaping a la ΤΖΩΤΖΙΟΥ          took 2.66592320sec for 100000 items


turns out the original poster’s concern that the ‘traditional’ method gets ‘ugly quickly and appears to have poor algorithmic performance’ appears partially unwarranted when put into this context. it actually performs best; when stashed away into a function call, we do get to see a 8% performance penalty (‘calling methods is expensive’, but in general you should still do it). in comparison, ΤΖΩΤΖΙΟΥ’s implementation takes around 5 times as long as the traditional method, which, given it’s higher complexity that has to compete with python’s long-honed, optimized string methods is no surprise. 

there is yet another algorithm here, the SIMPLE one. as far as i can see, this very much does exactly what ΤΖΩΤΖΙΟΥ’s method does: it iterates over the characters/bytes in the text and performs a lookup for each, then joins all the characters/bytes together and returns the resulting escaped text. you can see that where one way to do that involves a fairly lengthy and myterious formulation, the SIMPLE implementation is actually understandable at a glance. 

what really trips me up here, though, is how badly the SIMPLE approach is in performance: it is around 10 times as slow as the traditional one, and also twice as slow as ΤΖΩΤΖΙΟΥ’s method. i am completely at a loss here, maybe someone can come up with an idea why this should be so. it uses only the most basic building blocks of python and works with two implicit iterations, so it avoids to build throw-away lists and everything, but it still slow, and i don’t know why.

let me conclude this code review with a remark on the merit of ΤΖΩΤΖΙΟΥ’s solution. i have made it sufficiently clear i find the code hard to read and too overblown for the task at hand. more critical than that, however, i find the way he treats characters and makes sure that for a given small range of characters they will behave in a byte-like fashion a little irritating. sure it works for the task at hand, but as soon as i iterate e.g. over the bytestring 'ΤΖΩΤΖΙΟΥ' what i do is iterate over adjacent bytes representing single characters. in most situations this is exactly what you should avoid; this is precisely the reason why in py3k ‘strings’ are now the ‘unicode objects’ of old, and the ‘strings’ of old have become ‘bytes’ and ‘bytearray’. if i was to nominate the one feature of py3k that could warrant a possibly expensive migration of code from the 2 series to the 3 series, it would be this single property of py3k. 98% of all my encoding issues have just dissolved ever since, period, and there is no clever hack that could have me seriously doubt my move. said algorithm is not ‘conceptually 8bit-clean and unicode safe’, which to me is a seriously shortcome, given this is 2010.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2020-12-20 11:35
              
            
            
                                                                       
I prefer something clean like:

substitutions = [
    ('<', '&lt;'),
    ('>', '&gt;'),
    ...]

for search, replacement in substitutions:
    string = string.replace(search, replacement)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2020-12-20 11:37
              
            
            
                                                                       
If you work with non-Unicode strings and Python < 3.0, try an alternate translate method:
# Python < 3.0
import itertools

def escape(a_string):
    replacer= dict( (chr(c),chr(c)) for c in xrange(256))
    replacer.update(
        {'&': '&amp;',
         '<': '&lt;',
         '>': '&gt;',
         '"': '&quot;',
         "'": '&#39;'}
    )
    return ''.join(itertools.imap(replacer.__getitem__, a_string))

if __name__ == "__main__":
    print escape('''"Hello"<i> to George's friend&co.''')

$ python so2484156.py 
&quot;Hello&quot;&lt;i&gt; to George&#39;s friend&amp;co.

This is closer to a "single scan" of the input string, as per your wish.
EDIT
My intention was to create a unicode.translate equivalent that was not restricted to single-character replacements, so I came up with the answer above; I got a comment by user "flow" that was almost completely out of context, with a single correct point: the code above, as is, is intended to work with byte strings and not unicode strings. There is an obvious update (i.e. unichr() … xrange(sys.maxunicode+1)) which I strongly dislike, so I came up with another function that works on both unicode and byte strings, given that Python guarantees:
all( (chr(i)==unichr(i) and hash(chr(i))==hash(unichr(i)))
    for i in xrange(128)) is True

The new function follows:
def escape(a_string):
    replacer= {
        '&': '&amp;',
        '<': '&lt;',
        '>': '&gt;',
        '"': '&quot;',
        "'": '&#39;',
    }
    return ''.join(
        itertools.starmap(
            replacer.get, # .setdefault *might* be faster
            itertools.izip(a_string, a_string)
        )
    )

Notice the use of starmap with a sequence of tuples: for any character not in the replacer dict, return said character.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复