Best way to replace multiple characters in a string?

前端 未结 14 1779
遇见更好的自我
遇见更好的自我 2020-11-22 11:15

I need to replace some characters as follows: &\\&, #\\#, ...

I coded as follows, but I guess there

相关标签:
14条回答
  • 2020-11-22 11:30

    For Python 3.8 and above, one can use assignment expressions

    (text := text.replace(s, f"\\{i}") for s in "&#" if s in text)
    

    Although, I am quite unsure if this would be considered "appropriate use" of assignment expressions as described in PEP 572, but looks clean and reads quite well (to my eyes). This would be "appropriate" if you wanted all intermediate strings as well. For example, (removing all lowercase vowels):

    text = "Lorem ipsum dolor sit amet"
    intermediates = [text := text.replace(i, "") for i in "aeiou" if i in text]
    
    ['Lorem ipsum dolor sit met',
     'Lorm ipsum dolor sit mt',
     'Lorm psum dolor st mt',
     'Lrm psum dlr st mt',
     'Lrm psm dlr st mt']
    

    On the plus side, it does seem (unexpectedly?) faster than some of the faster methods in the accepted answer, and seems to perform nicely with both increasing strings length and an increasing number of substitutions.

    The code for the above comparison is below. I am using random strings to make my life a bit simpler, and the characters to replace are chosen randomly from the string itself. (Note: I am using ipython's %timeit magic here, so run this in ipython/jupyter).

    import random, string
    
    def make_txt(length):
        "makes a random string of a given length"
        return "".join(random.choices(string.printable, k=length))
    
    def get_substring(s, num):
        "gets a substring"
        return "".join(random.choices(s, k=num))
    
    def a(text, replace): # one of the better performing approaches from the accepted answer
        for i in replace:
            if i in text:
                 text = text.replace(i, "")
    
    def b(text, replace):
        _ = (text := text.replace(i, "") for i in replace if i in text) 
    
    
    def compare(strlen, replace_length):
        "use ipython / jupyter for the %timeit functionality"
    
        times_a, times_b = [], []
    
        for i in range(*strlen):
            el = make_txt(i)
            et = get_substring(el, replace_length)
    
            res_a = %timeit -n 1000 -o a(el, et) # ipython magic
    
            el = make_txt(i)
            et = get_substring(el, replace_length)
            
            res_b = %timeit -n 1000 -o b(el, et) # ipython magic
    
            times_a.append(res_a.average * 1e6)
            times_b.append(res_b.average * 1e6)
            
        return times_a, times_b
    
    #----run
    t2 = compare((2*2, 1000, 50), 2)
    t10 = compare((2*10, 1000, 50), 10)
    
    0 讨论(0)
  • 2020-11-22 11:31
    >>> string="abc&def#ghi"
    >>> for ch in ['&','#']:
    ...   if ch in string:
    ...      string=string.replace(ch,"\\"+ch)
    ...
    >>> print string
    abc\&def\#ghi
    
    0 讨论(0)
  • 2020-11-22 11:31

    Late to the party, but I lost a lot of time with this issue until I found my answer.

    Short and sweet, translate is superior to replace. If you're more interested in funcionality over time optimization, do not use replace.

    Also use translate if you don't know if the set of characters to be replaced overlaps the set of characters used to replace.

    Case in point:

    Using replace you would naively expect the snippet "1234".replace("1", "2").replace("2", "3").replace("3", "4") to return "2344", but it will return in fact "4444".

    Translation seems to perform what OP originally desired.

    0 讨论(0)
  • 2020-11-22 11:33

    Replacing two characters

    I timed all the methods in the current answers along with one extra.

    With an input string of abc&def#ghi and replacing & -> \& and # -> \#, the fastest way was to chain together the replacements like this: text.replace('&', '\&').replace('#', '\#').

    Timings for each function:

    • a) 1000000 loops, best of 3: 1.47 μs per loop
    • b) 1000000 loops, best of 3: 1.51 μs per loop
    • c) 100000 loops, best of 3: 12.3 μs per loop
    • d) 100000 loops, best of 3: 12 μs per loop
    • e) 100000 loops, best of 3: 3.27 μs per loop
    • f) 1000000 loops, best of 3: 0.817 μs per loop
    • g) 100000 loops, best of 3: 3.64 μs per loop
    • h) 1000000 loops, best of 3: 0.927 μs per loop
    • i) 1000000 loops, best of 3: 0.814 μs per loop

    Here are the functions:

    def a(text):
        chars = "&#"
        for c in chars:
            text = text.replace(c, "\\" + c)
    
    
    def b(text):
        for ch in ['&','#']:
            if ch in text:
                text = text.replace(ch,"\\"+ch)
    
    
    import re
    def c(text):
        rx = re.compile('([&#])')
        text = rx.sub(r'\\\1', text)
    
    
    RX = re.compile('([&#])')
    def d(text):
        text = RX.sub(r'\\\1', text)
    
    
    def mk_esc(esc_chars):
        return lambda s: ''.join(['\\' + c if c in esc_chars else c for c in s])
    esc = mk_esc('&#')
    def e(text):
        esc(text)
    
    
    def f(text):
        text = text.replace('&', '\&').replace('#', '\#')
    
    
    def g(text):
        replacements = {"&": "\&", "#": "\#"}
        text = "".join([replacements.get(c, c) for c in text])
    
    
    def h(text):
        text = text.replace('&', r'\&')
        text = text.replace('#', r'\#')
    
    
    def i(text):
        text = text.replace('&', r'\&').replace('#', r'\#')
    

    Timed like this:

    python -mtimeit -s"import time_functions" "time_functions.a('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.b('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.c('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.d('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.e('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.f('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.g('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.h('abc&def#ghi')"
    python -mtimeit -s"import time_functions" "time_functions.i('abc&def#ghi')"
    

    Replacing 17 characters

    Here's similar code to do the same but with more characters to escape (\`*_{}>#+-.!$):

    def a(text):
        chars = "\\`*_{}[]()>#+-.!$"
        for c in chars:
            text = text.replace(c, "\\" + c)
    
    
    def b(text):
        for ch in ['\\','`','*','_','{','}','[',']','(',')','>','#','+','-','.','!','$','\'']:
            if ch in text:
                text = text.replace(ch,"\\"+ch)
    
    
    import re
    def c(text):
        rx = re.compile('([&#])')
        text = rx.sub(r'\\\1', text)
    
    
    RX = re.compile('([\\`*_{}[]()>#+-.!$])')
    def d(text):
        text = RX.sub(r'\\\1', text)
    
    
    def mk_esc(esc_chars):
        return lambda s: ''.join(['\\' + c if c in esc_chars else c for c in s])
    esc = mk_esc('\\`*_{}[]()>#+-.!$')
    def e(text):
        esc(text)
    
    
    def f(text):
        text = text.replace('\\', '\\\\').replace('`', '\`').replace('*', '\*').replace('_', '\_').replace('{', '\{').replace('}', '\}').replace('[', '\[').replace(']', '\]').replace('(', '\(').replace(')', '\)').replace('>', '\>').replace('#', '\#').replace('+', '\+').replace('-', '\-').replace('.', '\.').replace('!', '\!').replace('$', '\$')
    
    
    def g(text):
        replacements = {
            "\\": "\\\\",
            "`": "\`",
            "*": "\*",
            "_": "\_",
            "{": "\{",
            "}": "\}",
            "[": "\[",
            "]": "\]",
            "(": "\(",
            ")": "\)",
            ">": "\>",
            "#": "\#",
            "+": "\+",
            "-": "\-",
            ".": "\.",
            "!": "\!",
            "$": "\$",
        }
        text = "".join([replacements.get(c, c) for c in text])
    
    
    def h(text):
        text = text.replace('\\', r'\\')
        text = text.replace('`', r'\`')
        text = text.replace('*', r'\*')
        text = text.replace('_', r'\_')
        text = text.replace('{', r'\{')
        text = text.replace('}', r'\}')
        text = text.replace('[', r'\[')
        text = text.replace(']', r'\]')
        text = text.replace('(', r'\(')
        text = text.replace(')', r'\)')
        text = text.replace('>', r'\>')
        text = text.replace('#', r'\#')
        text = text.replace('+', r'\+')
        text = text.replace('-', r'\-')
        text = text.replace('.', r'\.')
        text = text.replace('!', r'\!')
        text = text.replace('$', r'\$')
    
    
    def i(text):
        text = text.replace('\\', r'\\').replace('`', r'\`').replace('*', r'\*').replace('_', r'\_').replace('{', r'\{').replace('}', r'\}').replace('[', r'\[').replace(']', r'\]').replace('(', r'\(').replace(')', r'\)').replace('>', r'\>').replace('#', r'\#').replace('+', r'\+').replace('-', r'\-').replace('.', r'\.').replace('!', r'\!').replace('$', r'\$')
    

    Here's the results for the same input string abc&def#ghi:

    • a) 100000 loops, best of 3: 6.72 μs per loop
    • b) 100000 loops, best of 3: 2.64 μs per loop
    • c) 100000 loops, best of 3: 11.9 μs per loop
    • d) 100000 loops, best of 3: 4.92 μs per loop
    • e) 100000 loops, best of 3: 2.96 μs per loop
    • f) 100000 loops, best of 3: 4.29 μs per loop
    • g) 100000 loops, best of 3: 4.68 μs per loop
    • h) 100000 loops, best of 3: 4.73 μs per loop
    • i) 100000 loops, best of 3: 4.24 μs per loop

    And with a longer input string (## *Something* and [another] thing in a longer sentence with {more} things to replace$):

    • a) 100000 loops, best of 3: 7.59 μs per loop
    • b) 100000 loops, best of 3: 6.54 μs per loop
    • c) 100000 loops, best of 3: 16.9 μs per loop
    • d) 100000 loops, best of 3: 7.29 μs per loop
    • e) 100000 loops, best of 3: 12.2 μs per loop
    • f) 100000 loops, best of 3: 5.38 μs per loop
    • g) 10000 loops, best of 3: 21.7 μs per loop
    • h) 100000 loops, best of 3: 5.7 μs per loop
    • i) 100000 loops, best of 3: 5.13 μs per loop

    Adding a couple of variants:

    def ab(text):
        for ch in ['\\','`','*','_','{','}','[',']','(',')','>','#','+','-','.','!','$','\'']:
            text = text.replace(ch,"\\"+ch)
    
    
    def ba(text):
        chars = "\\`*_{}[]()>#+-.!$"
        for c in chars:
            if c in text:
                text = text.replace(c, "\\" + c)
    

    With the shorter input:

    • ab) 100000 loops, best of 3: 7.05 μs per loop
    • ba) 100000 loops, best of 3: 2.4 μs per loop

    With the longer input:

    • ab) 100000 loops, best of 3: 7.71 μs per loop
    • ba) 100000 loops, best of 3: 6.08 μs per loop

    So I'm going to use ba for readability and speed.

    Addendum

    Prompted by haccks in the comments, one difference between ab and ba is the if c in text: check. Let's test them against two more variants:

    def ab_with_check(text):
        for ch in ['\\','`','*','_','{','}','[',']','(',')','>','#','+','-','.','!','$','\'']:
            if ch in text:
                text = text.replace(ch,"\\"+ch)
    
    def ba_without_check(text):
        chars = "\\`*_{}[]()>#+-.!$"
        for c in chars:
            text = text.replace(c, "\\" + c)
    

    Times in μs per loop on Python 2.7.14 and 3.6.3, and on a different machine from the earlier set, so cannot be compared directly.

    ╭────────────╥──────┬───────────────┬──────┬──────────────────╮
    │ Py, input  ║  ab  │ ab_with_check │  ba  │ ba_without_check │
    ╞════════════╬══════╪═══════════════╪══════╪══════════════════╡
    │ Py2, short ║ 8.81 │    4.22       │ 3.45 │    8.01          │
    │ Py3, short ║ 5.54 │    1.34       │ 1.46 │    5.34          │
    ├────────────╫──────┼───────────────┼──────┼──────────────────┤
    │ Py2, long  ║ 9.3  │    7.15       │ 6.85 │    8.55          │
    │ Py3, long  ║ 7.43 │    4.38       │ 4.41 │    7.02          │
    └────────────╨──────┴───────────────┴──────┴──────────────────┘
    

    We can conclude that:

    • Those with the check are up to 4x faster than those without the check

    • ab_with_check is slightly in the lead on Python 3, but ba (with check) has a greater lead on Python 2

    • However, the biggest lesson here is Python 3 is up to 3x faster than Python 2! There's not a huge difference between the slowest on Python 3 and fastest on Python 2!

    0 讨论(0)
  • 2020-11-22 11:34

    Are you always going to prepend a backslash? If so, try

    import re
    rx = re.compile('([&#])')
    #                  ^^ fill in the characters here.
    strs = rx.sub('\\\\\\1', strs)
    

    It may not be the most efficient method but I think it is the easiest.

    0 讨论(0)
  • 2020-11-22 11:36
    >>> a = '&#'
    >>> print a.replace('&', r'\&')
    \&#
    >>> print a.replace('#', r'\#')
    &\#
    >>> 
    

    You want to use a 'raw' string (denoted by the 'r' prefixing the replacement string), since raw strings to not treat the backslash specially.

    0 讨论(0)
提交回复
热议问题