Fastest way to replace multiple strings in a huge string

后端 未结 8 691
野的像风
野的像风 2020-12-13 04:51

I m looking for the fastest way to replace multiple (~500) substrings of a big (~1mb) string. Whatever I have tried it seems that String.Replace is the fastest way of doing

相关标签:
8条回答
  • 2020-12-13 05:21

    My approach is a little like templating - it takes the input string and pulls out (removes) the substrings that are to be replaced. Then it takes the remaining parts of the string (the template) and combines them with the new replacement substrings. This is done in a parallel operation (template + each replacement string), which builds the output strings.

    I think what I am explaining above may be clearer with code. This uses your sample inputs from above:

    const char splitter = '\t';   // use a char that will not appear in your string
    
    string input = "ABCDABCABCDABCABCDABCABCDABCD";
    string oldString = "BC";
    string[] newStrings = { "AA", "BB", "CC", "DD", "EE" };
    
    // In input, replace oldString with tabs, so that we can do String.Split later
    var inputTabbed = input.Replace(oldString, splitter.ToString());
    // ABCDABCABCDABCABCDABCABCDABCD --> A\tDA\tA\tDA\tA\tDA\tA\tDA\tD
    
    var inputs = inputTabbed.Split(splitter);
    /* inputs (the template) now contains:
    [0] "A" 
    [1] "DA"
    [2] "A" 
    [3] "DA"
    [4] "A" 
    [5] "DA"
    [6] "A" 
    [7] "DA"
    [8] "D" 
    */
    
    // In parallel, build the output using the template (inputs)
    // and the replacement strings (newStrings)
    var outputs = new List<string>();
    Parallel.ForEach(newStrings, iteration =>
        {
            var output = string.Join(iteration, inputs);
            // only lock the list operation
            lock (outputs) { outputs.Add(output); }
        });
    
    foreach (var output in outputs)
        Console.WriteLine(output);
    

    Output:

    AAADAAAAAADAAAAAADAAAAAADAAAD
    ABBDABBABBDABBABBDABBABBDABBD
    ACCDACCACCDACCACCDACCACCDACCD
    ADDDADDADDDADDADDDADDADDDADDD
    AEEDAEEAEEDAEEAEEDAEEAEEDAEED
    

    So you can do a comparison, here is a complete method which can be used in the test code by Erti-Chris Eelmaa:

    private static void TemplatingImp(string input, string replaceWhat, IEnumerable<string> replaceIterations)
    {
        const char splitter = '\t';   // use a char that will not appear in your string
    
        var inputTabbed = input.Replace(replaceWhat, splitter.ToString());
        var inputs = inputTabbed.Split(splitter);
    
        // In parallel, build the output using the split parts (inputs)
        // and the replacement strings (newStrings)
        //var outputs = new List<string>();
        Parallel.ForEach(replaceIterations, iteration =>
        {
            var output = string.Join(iteration, inputs);
        });
    }
    
    0 讨论(0)
  • 2020-12-13 05:31

    I made a variation on Fredou's code that requires less compares as it works on int* instead of char*. It still requires n iterations for a string of n length, it just has to do less comparing. You could have n/2 iterations if the string is neatly aligned by 2 (so the string to replace can only occur at indexes 0, 2, 4, 6, 8, etc) or even n/4 if it's aligned by 4 (you'd use long*). I'm not very good at bit fiddling like this, so someone might be able to find some obvious flaw in my code that could be more efficient. I verified that the result of my variation is the same as that of the simple string.Replace.

    Additionally, I expect that some gains could be made in the 500x string.Copy that it does, but haven't looked into that yet.

    My results (Fredou II):

    IMPLEMENTATION       |  EXEC MS | GC MS
    #1 Simple            |     6816 |     0
    #2 Simple parallel   |     4202 |     0
    #3 ParallelSubstring |    27839 |     4
    #4 Fredou I          |     2103 |   106
    #5 Fredou II         |     1334 |    91
    

    So about 2/3 of the time (x86, but x64 was about the same).

    For this code:

    private unsafe struct TwoCharStringChunk
    {
      public fixed char chars[2];
    }
    
    private unsafe static void FredouImplementation_Variation1(string input, int inputLength, string replace, TwoCharStringChunk[] replaceBy)
    {
      var output = new string[replaceBy.Length];
    
      for (var i = 0; i < replaceBy.Length; ++i)
        output[i] = string.Copy(input);
    
      var r = new TwoCharStringChunk();
      r.chars[0] = replace[0];
      r.chars[1] = replace[1];
    
      _staticToReplace = r;
    
      Parallel.For(0, replaceBy.Length, l => Process_Variation1(output[l], input, inputLength, replaceBy[l]));
    }
    
    private static TwoCharStringChunk _staticToReplace ;
    
    private static unsafe void Process_Variation1(string output, string input, int len, TwoCharStringChunk replaceBy)
    {
      int n = 0;
      int m = len - 1;
    
      fixed (char* i = input, o = output, chars = _staticToReplace .chars)
      {
        var replaceValAsInt = *((int*)chars);
        var replaceByValAsInt = *((int*)replaceBy.chars);
    
        while (n < m)
        {
          var compareInput = *((int*)&i[n]);
    
          if (compareInput == replaceValAsInt)
          {
            ((int*)&o[n])[0] = replaceByValAsInt;
            n += 2;
          }
          else
          {
            ++n;
          }
        }
      }
    }
    

    The struct with the fixed buffer is not strictly necessary here and could have been replaced with a simple int field, but expand the char[2] to char[3] and this code can be made to work with three letter strings as well, which wouldn't be possible if it was an int field.

    It required some changes to the Program.cs as well, so here's the full gist:

    https://gist.github.com/JulianR/7763857

    EDIT: I'm not sure why my ParallelSubstring is so slow. I'm running .NET 4 in Release mode, no debugger, in either x86 or x64.

    0 讨论(0)
提交回复
热议问题