Mass string replace in python?

前端 未结 13 2108
我寻月下人不归
我寻月下人不归 2020-11-28 20:14

Say I have a string that looks like this:

str = \"The &yquick &cbrown &bfox &Yjumps over the &ulazy dog\"

You\'ll notic

相关标签:
13条回答
  • 2020-11-28 21:05

    If the number of keys in the list is large, and the number of the occurences in the string is low (and mostly zero), then you could iterate over the occurences of the ampersands in the string, and use the dictionary keyed by the first character of the substrings. I don't code often in python so the style might be a bit off, but here is my take at it:

    str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
    
    dict = {"&y":"\033[0;30m",
            "&c":"\033[0;31m",
            "&b":"\033[0;32m",
            "&Y":"\033[0;33m",
            "&u":"\033[0;34m"}
    
    def rep(s):
      return dict["&"+s[0:1]] + s[1:]
    
    subs = str.split("&")
    res = subs[0] + "".join(map(rep, subs[1:]))
    
    print res
    

    Of course there is a question what happens when there is an ampersand that is coming from the string itself, you would need to escape it in some way before feeding through this process, and then unescape after this process.

    Of course, as is pretty much usual with the performance issues, timing the various approaches on your typical (and also worst-case) dataset and comparing them is a good thing to do.

    EDIT: place it into a separate function to work with arbitrary dictionary:

    def mysubst(somestr, somedict):
      subs = somestr.split("&")
      return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
    

    EDIT2: get rid of an unneeded concatenation, seems to still be a bit faster than the previous on many iterations.

    def mysubst(somestr, somedict):
      subs = somestr.split("&")
      return subs[0].join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
    
    0 讨论(0)
  • 2020-11-28 21:08

    Here is the C Extensions Approach for python

    const char *dvals[]={
        //"0-64
        "","","","","","","","","","",
        "","","","","","","","","","",
        "","","","","","","","","","",
        "","","","","","","","","","",
        "","","","","","","","","","",
        "","","","","","","","","","",
        "","","","","",
        //A-Z
        "","","","","",
        "","","","","",
        "","","","","",
        "","","","","",
        "","","","","33",
        "",
        //
        "","","","","","",
        //a-z
        "","32","31","","",
        "","","","","",
        "","","","","",
        "","","","","",
        "34","","","","30",
        ""
    };
    
    int dsub(char*d,char*s){
        char *ofs=d;
        do{
            if(*s=='&' && s[1]<='z' && *dvals[s[1]]){
    
                //\033[0;
                *d++='\\',*d++='0',*d++='3',*d++='3',*d++='[',*d++='0',*d++=';';
    
                //consider as fixed 2 digits
                *d++=dvals[s[1]][0];
                *d++=dvals[s[1]][1];
    
                *d++='m';
    
                s++; //skip
    
            //non &,invalid, unused (&) ampersand sequences will go here.
            }else *d++=*s;
    
        }while(*s++);
    
        return d-ofs-1;
    }
    

    Python codes I have tested

    from mylib import *
    import time
    
    start=time.time()
    
    instr="The &yquick &cbrown &bfox &Yjumps over the &ulazy dog, skip &Unknown.\n"*100000
    x=dsub(instr)
    
    end=time.time()
    
    print "time taken",end-start,",input str length",len(x)
    print "first few lines"
    print x[:1100]
    

    Results

    time taken 0.140000104904 ,input str length 11000000
    first few lines
    The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
    The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
    The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
    The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
    The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
    The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
    The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
    The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
    The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
    The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
    

    Its suppose to able to run at O(n), and Only took 160 ms (avg) for 11 MB string in My Mobile Celeron 1.6 GHz PC

    It will also skip unknown characters as is, for example &Unknown will return as is

    Let me know If you have any problem with compiling, bugs, etc...

    0 讨论(0)
  • 2020-11-28 21:08

    A general solution for defining replacement rules is to use regex substitution using a function to provide the map (see re.sub()).

    import re
    
    str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
    
    dict = {"&y":"\033[0;30m",
            "&c":"\033[0;31m",
            "&b":"\033[0;32m",
            "&Y":"\033[0;33m",
            "&u":"\033[0;34m"}
    
    def programmaticReplacement( match ):
        return dict[ match.group( 1 ) ]
    
    colorstring = re.sub( '(\&.)', programmaticReplacement, str )
    

    This is particularly nice for non-trivial substitutions (e.g anything requiring mathmatical operations to create the substitute).

    0 讨论(0)
  • 2020-11-28 21:14
    mydict = {"&y":"\033[0;30m",
              "&c":"\033[0;31m",
              "&b":"\033[0;32m",
              "&Y":"\033[0;33m",
              "&u":"\033[0;34m"}
    mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
    
    for k, v in mydict.iteritems():
        mystr = mystr.replace(k, v)
    
    print mystr
    The ←[0;30mquick ←[0;31mbrown ←[0;32mfox ←[0;33mjumps over the ←[0;34mlazy dog
    

    I took the liberty of comparing a few solutions:

    mydict = dict([('&' + chr(i), str(i)) for i in list(range(65, 91)) + list(range(97, 123))])
    
    # random inserts between keys
    from random import randint
    rawstr = ''.join(mydict.keys())
    mystr = ''
    for i in range(0, len(rawstr), 2):
        mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars
    
    from time import time
    
    # How many times to run each solution
    rep = 10000
    
    print 'Running %d times with string length %d and ' \
          'random inserts of lengths 0-20' % (rep, len(mystr))
    
    # My solution
    t = time()
    for x in range(rep):
        for k, v in mydict.items():
            mystr.replace(k, v)
        #print(mystr)
    print '%-30s' % 'Tor fixed & variable dict', time()-t
    
    from re import sub, compile, escape
    
    # Peter Hansen
    t = time()
    for x in range(rep):
        sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict
    print '%-30s' % 'Peter fixed & variable dict', time()-t
    
    # Claudiu
    def multiple_replace(dict, text): 
        # Create a regular expression  from the dictionary keys
        regex = compile("(%s)" % "|".join(map(escape, dict.keys())))
    
        # For each match, look-up corresponding value in dictionary
        return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
    
    t = time()
    for x in range(rep):
        multiple_replace(mydict, mystr)
    print '%-30s' % 'Claudio variable dict', time()-t
    
    # Claudiu - Precompiled
    regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))
    
    t = time()
    for x in range(rep):
        regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
    print '%-30s' % 'Claudio fixed dict', time()-t
    
    # Andrew Y - variable dict
    def mysubst(somestr, somedict):
      subs = somestr.split("&")
      return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
    
    t = time()
    for x in range(rep):
        mysubst(mystr, mydict)
    print '%-30s' % 'Andrew Y variable dict', time()-t
    
    # Andrew Y - fixed
    def repl(s):
      return mydict["&"+s[0:1]] + s[1:]
    
    t = time()
    for x in range(rep):
        subs = mystr.split("&")
        res = subs[0] + "".join(map(repl, subs[1:]))
    print '%-30s' % 'Andrew Y fixed dict', time()-t
    

    Results in Python 2.6

    Running 10000 times with string length 490 and random inserts of lengths 0-20
    Tor fixed & variable dict      1.04699993134
    Peter fixed & variable dict    0.218999862671
    Claudio variable dict          2.48400020599
    Claudio fixed dict             0.0940001010895
    Andrew Y variable dict         0.0309998989105
    Andrew Y fixed dict            0.0310001373291
    

    Both claudiu's and andrew's solutions kept going into 0, so I had to increase it to 10 000 runs.

    I ran it in Python 3 (because of unicode) with replacements of chars from 39 to 1024 (38 is ampersand, so I didn't wanna include it). String length up to 10.000 including about 980 replacements with variable random inserts of length 0-20. The unicode values from 39 to 1024 causes characters of both 1 and 2 bytes length, which could affect some solutions.

    mydict = dict([('&' + chr(i), str(i)) for i in range(39,1024)])
    
    # random inserts between keys
    from random import randint
    rawstr = ''.join(mydict.keys())
    mystr = ''
    for i in range(0, len(rawstr), 2):
        mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars
    
    from time import time
    
    # How many times to run each solution
    rep = 10000
    
    print('Running %d times with string length %d and ' \
          'random inserts of lengths 0-20' % (rep, len(mystr)))
    
    # Tor Valamo - too long
    #t = time()
    #for x in range(rep):
    #    for k, v in mydict.items():
    #        mystr.replace(k, v)
    #print('%-30s' % 'Tor fixed & variable dict', time()-t)
    
    from re import sub, compile, escape
    
    # Peter Hansen
    t = time()
    for x in range(rep):
        sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict
    print('%-30s' % 'Peter fixed & variable dict', time()-t)
    
    # Peter 2
    def dictsub(m):
        return mydict[m.group()]
    
    t = time()
    for x in range(rep):
        sub(r'(&[a-zA-Z])', dictsub, mystr)
    print('%-30s' % 'Peter fixed dict', time()-t)
    
    # Claudiu - too long
    #def multiple_replace(dict, text): 
    #    # Create a regular expression  from the dictionary keys
    #    regex = compile("(%s)" % "|".join(map(escape, dict.keys())))
    #
    #    # For each match, look-up corresponding value in dictionary
    #    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
    #
    #t = time()
    #for x in range(rep):
    #    multiple_replace(mydict, mystr)
    #print('%-30s' % 'Claudio variable dict', time()-t)
    
    # Claudiu - Precompiled
    regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))
    
    t = time()
    for x in range(rep):
        regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
    print('%-30s' % 'Claudio fixed dict', time()-t)
    
    # Separate setup for Andrew and gnibbler optimized dict
    mydict = dict((k[1], v) for k, v in mydict.items())
    
    # Andrew Y - variable dict
    def mysubst(somestr, somedict):
      subs = somestr.split("&")
      return subs[0] + "".join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))
    
    def mysubst2(somestr, somedict):
      subs = somestr.split("&")
      return subs[0].join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))
    
    t = time()
    for x in range(rep):
        mysubst(mystr, mydict)
    print('%-30s' % 'Andrew Y variable dict', time()-t)
    t = time()
    for x in range(rep):
        mysubst2(mystr, mydict)
    print('%-30s' % 'Andrew Y variable dict 2', time()-t)
    
    # Andrew Y - fixed
    def repl(s):
      return mydict[s[0:1]] + s[1:]
    
    t = time()
    for x in range(rep):
        subs = mystr.split("&")
        res = subs[0] + "".join(map(repl, subs[1:]))
    print('%-30s' % 'Andrew Y fixed dict', time()-t)
    
    # gnibbler
    t = time()
    for x in range(rep):
        myparts = mystr.split("&")
        myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]]
        "".join(myparts)
    print('%-30s' % 'gnibbler fixed & variable dict', time()-t)
    

    Results:

    Running 10000 times with string length 9491 and random inserts of lengths 0-20
    Tor fixed & variable dict      0.0 # disqualified 329 secs
    Peter fixed & variable dict    2.07799983025
    Peter fixed dict               1.53100013733 
    Claudio variable dict          0.0 # disqualified, 37 secs
    Claudio fixed dict             1.5
    Andrew Y variable dict         0.578000068665
    Andrew Y variable dict 2       0.56299996376
    Andrew Y fixed dict            0.56200003624
    gnibbler fixed & variable dict 0.530999898911
    

    (** Note that gnibbler's code uses a different dict, where keys don't have the '&' included. Andrew's code also uses this alternate dict, but it didn't make much of a difference, maybe just 0.01x speedup.)

    0 讨论(0)
  • 2020-11-28 21:15

    The problem with doing this mass replace in Python is immutability of the strings: every time you will replace one item in the string then entire new string will be reallocated again and again from the heap.

    So if you want the fastest solution you either need to use mutable container (e.g. list), or write this machinery in the plain C (or better in Pyrex or Cython). In any case I'd suggest to write simple parser based on simple finite-state machine, and feed symbols of your string one by one.

    Suggested solutions based on regexps working in similar way, because regexp working using fsm behind the scene.

    0 讨论(0)
  • 2020-11-28 21:16

    This seems like it does what you want - multiple string replace at once using RegExps. Here is the relevant code:

    def multiple_replace(dict, text): 
        # Create a regular expression  from the dictionary keys
        regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
    
        # For each match, look-up corresponding value in dictionary
        return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
    
    print multiple_replace(dict, str)
    
    0 讨论(0)
提交回复
热议问题