I m looking for the fastest way to replace multiple (~500) substrings of a big (~1mb) string. Whatever I have tried it seems that String.Replace is the fastest way of doing
Using unsafe
and compiled as x64
result:
Implementation | Exec | GC
#1 Simple | 4706ms | 0ms
#2 Simple parallel | 2265ms | 0ms
#3 ParallelSubstring | 800ms | 21ms
#4 Fredou unsafe | 432ms | 15ms
take the code of Erti-Chris Eelmaa
and replace my previous one with this.
I don't think I will do another iteration but i did learn a few thing with unsafe which is a good thing :-)
private unsafe static void FredouImplementation(string input, int inputLength, string replace, string[] replaceBy)
{
var indexes = new List<int>();
//input = "ABCDABCABCDABCABCDABCABCDABCD";
//inputLength = input.Length;
//replaceBy = new string[] { "AA", "BB", "CC", "DD", "EE" };
//my own string.indexof to save a few ms
int len = inputLength;
fixed (char* i = input, r = replace)
{
int replaceValAsInt = *((int*)r);
while (--len > -1)
{
if (replaceValAsInt == *((int*)&i[len]))
{
indexes.Add(len--);
}
}
}
var idx = indexes.ToArray();
len = indexes.Count;
Parallel.For(0, replaceBy.Length, l =>
Process(input, inputLength, replaceBy[l], idx, len)
);
}
private unsafe static void Process(string input, int len, string replaceBy, int[] idx, int idxLen)
{
var output = new char[len];
fixed (char* o = output, i = input, r = replaceBy)
{
int replaceByValAsInt = *((int*)r);
//direct copy, simulate string.copy
while (--len > -1)
{
o[len] = i[len];
}
while (--idxLen > -1)
{
((int*)&o[idx[idxLen]])[0] = replaceByValAsInt;
}
}
//Console.WriteLine(output);
}
It sounds like you are tokenising the string? I would look at producing a buffer and indexing your tokens. Or using a templating engine
As a naive example you could use code generation to make the following method
public string Produce(string tokenValue){
var builder = new StringBuilder();
builder.Append("A");
builder.Append(tokenValue);
builder.Append("D");
return builder.ToString();
}
If your running the iterations enough times, the time to build the template will pay for itself. You can then also call that method in parallel with no side effects. Also look at interning your strings
As I were mildly interested in this problem, I crafted few solutions. With hardcore optimizations it's possible to go down even more.
To get the latest source: https://github.com/ChrisEelmaa/StackOverflow/blob/master/FastReplacer.cs
And the output
------------------------------------------------------- | Implementation | Average | Separate runs | |----------------------+---------+--------------------| | Simple | 3485 | 9002, 4497, 443, 0 | | SimpleParallel | 1298 | 3440, 1606, 146, 0 | | ParallelSubstring | 470 | 1259, 558, 64, 0 | | Fredou unsafe | 356 | 953, 431, 41, 0 | | Unsafe+unmanaged_mem | 92 | 229, 114, 18, 8 | -------------------------------------------------------
You won't probably beat the .NET guys in crafting your own replace method, it's most likely already using unsafe. I do believe you can get it down by factor of two if you write it completely in C.
My implementations might be buggy, but you can get the general idea.
As your input string can be as long as 2Mb, I don't foresee any memory allocation problem. You can load everything in memory and replace your data.
If from BC
you ALWAYS needs to replace for AA
, a String.Replace
will be ok. But, if you need more control, you could use a Regex.Replace
:
var input = "ABCDABCABCDABCABCDABCABCDABCD";
var output = Regex.Replace(input, "BC", (match) =>
{
// here you can add some juice, like counters, etc
return "AA";
});
You probably won't get anything faster than String.Replace (unless you go native) because iirc String.Replace is implemented in CLR itself for maximum performance. If you want 100% performance, you can conveniently interface with native ASM code via C++/CLI and go from there.
I had a similar issue on a project and I've implemented a Regex solution to perform multiple and case insensitive replacements on a file.
For efficiency purposes, I set criteria to go through the original string only once.
I've published a simple console app to test some strategies on https://github.com/nmcc/Spikes/tree/master/StringMultipleReplacements
The code for the Regex solution is similar to this:
Dictionary<string, string> replacements = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
// Fill the dictionary with the proper replacements:
StringBuilder patternBuilder = new StringBuilder();
patternBuilder.Append('(');
bool firstReplacement = true;
foreach (var replacement in replacements.Keys)
{
if (!firstReplacement)
patternBuilder.Append('|');
else
firstReplacement = false;
patternBuilder.Append('(');
patternBuilder.Append(Regex.Escape(replacement));
patternBuilder.Append(')');
}
patternBuilder.Append(')');
var regex = new Regex(patternBuilder.ToString(), RegexOptions.IgnoreCase);
return regex.Replace(sourceContent, new MatchEvaluator(match => replacements[match.Groups[1].Value]));
EDIT: The execution times running the test application on my computer are: