Mass string replace in python?

前端 未结 13 2107
我寻月下人不归
我寻月下人不归 2020-11-28 20:14

Say I have a string that looks like this:

str = \"The &yquick &cbrown &bfox &Yjumps over the &ulazy dog\"

You\'ll notic

相关标签:
13条回答
  • 2020-11-28 20:52

    try this

    tr.replace("&y",dict["&y"])

    tr.replace("&c",dict["&c"])

    tr.replace("&b",dict["&b"])

    tr.replace("&Y",dict["&Y"])

    tr.replace("&u",dict["&u"])

    0 讨论(0)
  • 2020-11-28 20:55

    Not sure about the speed of this solution either, but you could just loop through your dictionary and repeatedly call the built-in

    str.replace(old, new)

    This might perform decently well if the original string isn't too long, but it would obviously suffer as the string got longer.

    0 讨论(0)
  • 2020-11-28 21:00

    If you really want to dig into the topic take a look at this: http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

    The obvious solution by iterating over the dictionary and replacing each element in the string takes O(n*m) time, where n is the size of the dictionary, m is the length of the string.

    Whereas the Aho-Corasick-Algorithm finds all entries of the dictionary in O(n+m+f) where f is the number of found elements.

    0 讨论(0)
  • 2020-11-28 21:01

    Since someone mentioned using a simple parser, I thought I'd cook one up using pyparsing. By using pyparsing's transformString method, pyparsing internally scans through the source string, and builds a list of the matching text and intervening text. When all is done, transformString then ''.join's this list, so there is no performance problem in building up strings by increments. (The parse action defined for ANSIreplacer does the conversion from the matched &_ characters to the desired escape sequence, and replaces the matched text with the output of the parse action. Since only matching sequences will satisfy the parser expression, there is no need for the parse action to handle undefined &_ sequences.)

    The FollowedBy('&') is not strictly necessary, but it shortcuts the parsing process by verifying that the parser is actually positioned at an ampersand before doing the more expensive checking of all of the markup options.

    from pyparsing import FollowedBy, oneOf
    
    escLookup = {"&y":"\033[0;30m",
                "&c":"\033[0;31m",
                "&b":"\033[0;32m",
                "&Y":"\033[0;33m",
                "&u":"\033[0;34m"}
    
    # make a single expression that will look for a leading '&', then try to 
    # match each of the escape expressions
    ANSIreplacer = FollowedBy('&') + oneOf(escLookup.keys())
    
    # add a parse action that will replace the matched text with the 
    # corresponding ANSI sequence
    ANSIreplacer.setParseAction(lambda toks: escLookup[toks[0]])
    
    # now use the replacer to transform the test string; throw in some extra
    # ampersands to show what happens with non-matching sequences
    src = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog & &Zjumps back"
    out = ANSIreplacer.transformString(src)
    print repr(out)
    

    Prints:

    'The \x1b[0;30mquick \x1b[0;31mbrown \x1b[0;32mfox \x1b[0;33mjumps over 
     the \x1b[0;34mlazy dog & &Zjumps back'
    

    This will certainly not win any performance contests, but if your markup starts to get more complicated, then having a parser foundation will make it easier to extend.

    0 讨论(0)
  • 2020-11-28 21:04

    Here is a version using split/join

    mydict = {"y":"\033[0;30m",
              "c":"\033[0;31m",
              "b":"\033[0;32m",
              "Y":"\033[0;33m",
              "u":"\033[0;34m"}
    mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
    
    myparts = mystr.split("&")
    myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]]
    print "".join(myparts)
    

    In case there are ampersands with invalid codes you can use this to preserve them

    myparts[1:]=[mydict.get(x[0],"&"+x[0])+x[1:] for x in myparts[1:]]
    

    Peter Hansen pointed out that this fails when there is double ampersand. In that case use this version

    mystr = "The &yquick &cbrown &bfox &Yjumps over the &&ulazy dog"
    myparts = mystr.split("&")
    myparts[1:]=[mydict.get(x[:1],"&"+x[:1])+x[1:] for x in myparts[1:]]
    print "".join(myparts)
    
    0 讨论(0)
  • 2020-11-28 21:05

    Try this, making use of regular expression substitution, and standard string formatting:

    # using your stated values for str and dict:
    >>> import re
    >>> str = re.sub(r'(&[a-zA-Z])', r'%(\1)s', str)
    >>> str % dict
    'The \x1b[0;30mquick \x1b[0;31mbrown \x1b[0;32mfox \x1b[0;33mjumps over the \x1b[0;34mlazy dog'
    

    The re.sub() call replaces all sequences of ampersand followed by single letter with the pattern %(..)s containing the same pattern.

    The % formatting takes advantage of a feature of string formatting that can take a dictionary to specify the substitution, rather than the more commonly occurring positional arguments.

    An alternative can do this directly in the re.sub, using a callback:

    >>> import re
    >>> def dictsub(m):
    >>>    return dict[m.group()]
    >>> str = re.sub(r'(&[a-zA-Z])', dictsub, str)
    

    This time I'm using a closure to reference the dictionary from inside the callback function. This approach could give you a little more flexibility. For example, you could use something like dict.get(m.group(), '??') to avoid raising exceptions if you had strings with unrecognized code sequences.

    (By the way, both "dict" and "str" are builtin functions, and you'll get into trouble if you use those names in your own code much. Just in case you didn't know that. They're fine for a question like this of course.)

    Edit: I decided to check Tor's test code, and concluded that it's nowhere near representative, and in fact buggy. The string generated doesn't even have ampersands in it (!). The revised code below generates a representative dictionary and string, similar to the OP's example inputs.

    I also wanted to verify that each algorithm's output was the same. Below is a revised test program, with only Tor's, mine, and Claudiu's code -- because the others were breaking on the sample input. (I think they're all brittle unless the dictionary maps basically all possible ampersand sequences, which Tor's test code was doing.) This one properly seeds the random number generator so each run is the same. Finally, I added a minor variation using a generator which avoids some function call overhead, for a minor performance improvement.

    from time import time
    import string
    import random
    import re
    
    random.seed(1919096)  # ensure consistent runs
    
    # build dictionary with 40 mappings, representative of original question
    mydict = dict(('&' + random.choice(string.letters), '\x1b[0;%sm' % (30+i)) for i in range(40))
    # build simulated input, with mix of text, spaces, ampersands in reasonable proportions
    letters = string.letters + ' ' * 12 + '&' * 6
    mystr = ''.join(random.choice(letters) for i in range(1000))
    
    # How many times to run each solution
    rep = 10000
    
    print('Running %d times with string length %d and %d ampersands'
        % (rep, len(mystr), mystr.count('&')))
    
    # Tor Valamo
    # fixed from Tor's test, so it actually builds up the final string properly
    t = time()
    for x in range(rep):
        output = mystr
        for k, v in mydict.items():
            output = output.replace(k, v)
    print('%-30s' % 'Tor fixed & variable dict', time() - t)
    # capture "known good" output as expected, to verify others
    expected = output
    
    # Peter Hansen
    
    # build charset to use in regex for safe dict lookup
    charset = ''.join(x[1] for x in mydict.keys())
    # grab reference to method on regex, for speed
    patsub = re.compile(r'(&[%s])' % charset).sub
    
    t = time()
    for x in range(rep):
        output = patsub(r'%(\1)s', mystr) % mydict
    print('%-30s' % 'Peter fixed & variable dict', time()-t)
    assert output == expected
    
    # Peter 2
    def dictsub(m):
        return mydict[m.group()]
    
    t = time()
    for x in range(rep):
        output = patsub(dictsub, mystr)
    print('%-30s' % 'Peter fixed dict', time() - t)
    assert output == expected
    
    # Peter 3 - freaky generator version, to avoid function call overhead
    def dictsub(d):
        m = yield None
        while 1:
            m = yield d[m.group()]
    
    dictsub = dictsub(mydict).send
    dictsub(None)   # "prime" it
    t = time()
    for x in range(rep):
        output = patsub(dictsub, mystr)
    print('%-30s' % 'Peter generator', time() - t)
    assert output == expected
    
    # Claudiu - Precompiled
    regex_sub = re.compile("(%s)" % "|".join(mydict.keys())).sub
    
    t = time()
    for x in range(rep):
        output = regex_sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
    print('%-30s' % 'Claudio fixed dict', time() - t)
    assert output == expected
    

    I forgot to include benchmark results before:

        Running 10000 times with string length 1000 and 96 ampersands
        ('Tor fixed & variable dict     ', 2.9890000820159912)
        ('Peter fixed & variable dict   ', 2.6659998893737793)
        ('Peter fixed dict              ', 1.0920000076293945)
        ('Peter generator               ', 1.0460000038146973)
        ('Claudio fixed dict            ', 1.562000036239624)
    

    Also, snippets of the inputs and correct output:

    mystr = 'lTEQDMAPvksk k&z Txp vrnhQ GHaO&GNFY&&a...'
    mydict = {'&p': '\x1b[0;37m', '&q': '\x1b[0;66m', '&v': ...}
    output = 'lTEQDMAPvksk k←[0;57m Txp vrnhQ GHaO←[0;67mNFY&&a P...'
    

    Comparing with what I saw from Tor's test code output:

    mystr = 'VVVVVVVPPPPPPPPPPPPPPPXXXXXXXXYYYFFFFFFFFFFFFEEEEEEEEEEE...'
    mydict = {'&p': '112', '&q': '113', '&r': '114', '&s': '115', ...}
    output = # same as mystr since there were no ampersands inside
    
    0 讨论(0)
提交回复
热议问题