Fastest way to replace multiple strings in a huge string

后端 未结 8 690
野的像风
野的像风 2020-12-13 04:51

I m looking for the fastest way to replace multiple (~500) substrings of a big (~1mb) string. Whatever I have tried it seems that String.Replace is the fastest way of doing

相关标签:
8条回答
  • 2020-12-13 05:05

    Using unsafe and compiled as x64

    result:

    Implementation       | Exec   | GC
    #1 Simple            | 4706ms |  0ms
    #2 Simple parallel   | 2265ms |  0ms
    #3 ParallelSubstring |  800ms | 21ms
    #4 Fredou unsafe     |  432ms | 15ms
    

    take the code of Erti-Chris Eelmaa and replace my previous one with this.

    I don't think I will do another iteration but i did learn a few thing with unsafe which is a good thing :-)

        private unsafe static void FredouImplementation(string input, int inputLength, string replace, string[] replaceBy)
        {
            var indexes = new List<int>();
    
            //input = "ABCDABCABCDABCABCDABCABCDABCD";
            //inputLength = input.Length;
            //replaceBy = new string[] { "AA", "BB", "CC", "DD", "EE" };
    
            //my own string.indexof to save a few ms
            int len = inputLength;
    
            fixed (char* i = input, r = replace)
            {
                int replaceValAsInt = *((int*)r);
    
                while (--len > -1)
                {
                    if (replaceValAsInt == *((int*)&i[len]))
                    {
                        indexes.Add(len--);
                    }
                }                
            }
    
            var idx = indexes.ToArray();
            len = indexes.Count;
    
            Parallel.For(0, replaceBy.Length, l =>
                Process(input, inputLength, replaceBy[l], idx, len)
            );
        }
    
        private unsafe static void Process(string input, int len, string replaceBy, int[] idx, int idxLen)
        {
            var output = new char[len];
    
            fixed (char* o = output, i = input, r = replaceBy)
            {
                int replaceByValAsInt = *((int*)r);
    
                //direct copy, simulate string.copy
                while (--len > -1)
                {
                    o[len] = i[len];
                }
    
                while (--idxLen > -1)
                {
                    ((int*)&o[idx[idxLen]])[0] = replaceByValAsInt;
                }
            }
    
            //Console.WriteLine(output);
        }
    
    0 讨论(0)
  • 2020-12-13 05:05

    It sounds like you are tokenising the string? I would look at producing a buffer and indexing your tokens. Or using a templating engine

    As a naive example you could use code generation to make the following method

    public string Produce(string tokenValue){
    
        var builder = new StringBuilder();
        builder.Append("A");
        builder.Append(tokenValue);
        builder.Append("D");
    
        return builder.ToString();
    
    }
    

    If your running the iterations enough times, the time to build the template will pay for itself. You can then also call that method in parallel with no side effects. Also look at interning your strings

    0 讨论(0)
  • 2020-12-13 05:07

    As I were mildly interested in this problem, I crafted few solutions. With hardcore optimizations it's possible to go down even more.

    To get the latest source: https://github.com/ChrisEelmaa/StackOverflow/blob/master/FastReplacer.cs

    And the output

    -------------------------------------------------------
    | Implementation       | Average | Separate runs      |
    |----------------------+---------+--------------------|
    | Simple               |    3485 | 9002, 4497, 443, 0 |
    | SimpleParallel       |    1298 | 3440, 1606, 146, 0 |
    | ParallelSubstring    |     470 | 1259, 558, 64, 0   |
    | Fredou unsafe        |     356 | 953, 431, 41, 0    |
    | Unsafe+unmanaged_mem |      92 | 229, 114, 18, 8    |
    -------------------------------------------------------
    

    You won't probably beat the .NET guys in crafting your own replace method, it's most likely already using unsafe. I do believe you can get it down by factor of two if you write it completely in C.

    My implementations might be buggy, but you can get the general idea.

    0 讨论(0)
  • 2020-12-13 05:08

    As your input string can be as long as 2Mb, I don't foresee any memory allocation problem. You can load everything in memory and replace your data.

    If from BC you ALWAYS needs to replace for AA, a String.Replace will be ok. But, if you need more control, you could use a Regex.Replace:

    var input  = "ABCDABCABCDABCABCDABCABCDABCD";
    var output = Regex.Replace(input, "BC", (match) =>
    {
        // here you can add some juice, like counters, etc
        return "AA";
    });
    
    0 讨论(0)
  • 2020-12-13 05:17

    You probably won't get anything faster than String.Replace (unless you go native) because iirc String.Replace is implemented in CLR itself for maximum performance. If you want 100% performance, you can conveniently interface with native ASM code via C++/CLI and go from there.

    0 讨论(0)
  • I had a similar issue on a project and I've implemented a Regex solution to perform multiple and case insensitive replacements on a file.

    For efficiency purposes, I set criteria to go through the original string only once.

    I've published a simple console app to test some strategies on https://github.com/nmcc/Spikes/tree/master/StringMultipleReplacements

    The code for the Regex solution is similar to this:

    Dictionary<string, string> replacements = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
        // Fill the dictionary with the proper replacements:
    
            StringBuilder patternBuilder = new StringBuilder();
                    patternBuilder.Append('(');
    
                    bool firstReplacement = true;
    
                    foreach (var replacement in replacements.Keys)
                    {
                        if (!firstReplacement)
                            patternBuilder.Append('|');
                        else
                            firstReplacement = false;
    
                        patternBuilder.Append('(');
                        patternBuilder.Append(Regex.Escape(replacement));
                        patternBuilder.Append(')');
                    }
                    patternBuilder.Append(')');
    
                    var regex = new Regex(patternBuilder.ToString(), RegexOptions.IgnoreCase);
    
                    return regex.Replace(sourceContent, new MatchEvaluator(match => replacements[match.Groups[1].Value]));
    

    EDIT: The execution times running the test application on my computer are:

    • Looping through the replacements calling string.Substring() (CASE SENSITIVE): 2ms
    • Single pass using Regex with multiple replacements at once (Case insensitive): 8ms
    • Looping through replacements using a ReplaceIgnoreCase Extension (Case insensitive): 55ms
    0 讨论(0)
提交回复
热议问题