Fastest Python method for search and replace on a large string

前端 未结 3 1162
半阙折子戏
半阙折子戏 2020-12-30 03:49

I\'m looking for the fastest way to replace a large number of sub-strings inside a very large string. Here are two examples I\'ve used.

findall() feels simpler and

相关标签:
3条回答
  • 2020-12-30 04:22

    You can, and I think you must because it certainly is an optimized function, use

    re.sub(pattern, repl, string[, count, flags])
    

    The reason why your findall_replace() function is long is that at each match, a new string object is created, as you will see by executed the following code:

    ch = '''qskfg qmohb561687ipuygvnjoihi2576871987uuiazpoieiohoihnoipoioh
    opuihbavarfgvipauhbi277auhpuitchpanbiuhbvtaoi541987ujptoihbepoihvpoezi 
    abtvar473727tta aat tvatbvatzeouithvbop772iezubiuvpzhbepuv454524522ueh'''
    
    import re
    
    def findall_replace(text, reg, rep):
        for match in reg.findall(text):
            text = text.replace(match, rep)
            print id(text)
        return text
    
    pat = re.compile('\d+')
    rep = 'AAAAAAA'
    
    print id(ch)
    print
    print findall_replace(ch, pat, rep)
    

    Note that in this code I replaced output = text.replace(match, rep) with text = text.replace(match, rep) , otherwise only the last occurence is replaced.

    finditer_replace() is long for the same reason as for findall_replace(): repeated creation of a string object. But the former uses an iterator re.finditer() while the latter constructs beforhand a list object, so it is longer. That's the difference between iterator and not-iterator.

    0 讨论(0)
  • 2020-12-30 04:31

    By the way, your code with findall_replace() isn't safe, it can return unawaited results:

    ch = 'sea sun ABC-ABC-DEF bling ranch micABC-DEF fish'
    
    import re
    
    def findall_replace(text, reg, rep):
        for gr in reg.findall(text):
            text = text.replace(gr, rep)
            print 'group==',gr
            print 'text==',text
        return '\nresult is : '+text
    
    pat = re.compile('ABC-DE')
    rep = 'DEFINITION'
    
    print 'ch==',ch
    print
    print findall_replace(ch, pat, rep)
    

    display

    ch== sea sun ABC-ABC-DEF bling ranch micABC-DEF fish
    
    group== ABC-DE
    text== sea sun ABC-DEFINITIONF bling ranch micDEFINITIONF fish
    group== ABC-DE
    text== sea sun DEFINITIONFINITIONF bling ranch micDEFINITIONF fish
    
    result is : sea sun DEFINITIONFINITIONF bling ranch micDEFINITIONF fish
    
    0 讨论(0)
  • 2020-12-30 04:38

    The standard method is to use the built-in

    re.sub(reg, rep, text)
    

    Incidentally the reason for the performance difference between your versions is that each replacement in your first version causes the entire string to be recopied. Copies are fast, but when you're copying 10 MB at a go, enough copies will become slow.

    0 讨论(0)
提交回复
热议问题