Does any one know of a faster method to do String.Split()?

后端 未结 14 1129
傲寒
傲寒 2020-12-03 10:57

I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:

values = line.Split(delimiter);


        
相关标签:
14条回答
  • 2020-12-03 11:03

    I found this implementation which is 30% faster from Dejan Pelzel's blog. I qoute from there:

    The Solution

    With this in mind, I set to create a string splitter that would use an internal buffer similarly to a StringBuilder. It uses very simple logic of going through the string and saving the value parts into the buffer as it goes along.

    public int Split(string value, char separator)
    {
        int resultIndex = 0;
        int startIndex = 0;
    
        // Find the mid-parts
        for (int i = 0; i < value.Length; i++)
        {
            if (value[i] == separator)
            {
                this.buffer[resultIndex] = value.Substring(startIndex, i - startIndex);
                resultIndex++;
                startIndex = i + 1;
            }
        }
    
        // Find the last part
        this.buffer[resultIndex] = value.Substring(startIndex, value.Length - startIndex);
        resultIndex++;
    
        return resultIndex;
    

    How To Use

    The StringSplitter class is incredibly simple to use as you can see in the example below. Just be careful to reuse the StringSplitter object and not create a new instance of it in loops or for a single time use. In this case it would be better to juse use the built in String.Split.

    var splitter = new StringSplitter(2);
    splitter.Split("Hello World", ' ');
    if (splitter.Results[0] == "Hello" && splitter.Results[1] == "World")
    {
        Console.WriteLine("It works!");
    }
    

    The Split methods returns the number of items found, so you can easily iterate through the results like this:

    var splitter = new StringSplitter(2);
    var len = splitter.Split("Hello World", ' ');
    for (int i = 0; i < len; i++)
    {
        Console.WriteLine(splitter.Results[i]);
    }
    

    This approach has advantages and disadvantages.

    0 讨论(0)
  • 2020-12-03 11:03

    As others have said, String.Split() will not always work well with CSV files. Consider a file that looks like this:

    "First Name","Last Name","Address","Town","Postcode"
    David,O'Leary,"12 Acacia Avenue",London,NW5 3DF
    June,Robinson,"14, Abbey Court","Putney",SW6 4FG
    Greg,Hampton,"",,
    Stephen,James,"""Dunroamin"" 45 Bridge Street",Bristol,BS2 6TG
    

    (e.g. inconsistent use of speechmarks, strings including commas and speechmarks, etc)

    This CSV reading framework will deal with all of that, and is also very efficient:

    LumenWorks.Framework.IO.Csv by Sebastien Lorien

    0 讨论(0)
  • 2020-12-03 11:04
        public static unsafe List<string> SplitString(char separator, string input)
        {
            List<string> result = new List<string>();
            int i = 0;
            fixed(char* buffer = input)
            {
                for (int j = 0; j < input.Length; j++)
                {
                    if (buffer[j] == separator)
                    {
                        buffer[i] = (char)0;
                        result.Add(new String(buffer));
                        i = 0;
                    }
                    else
                    {
                        buffer[i] = buffer[j];
                        i++;
                    }
                }
                buffer[i] = (char)0;
                result.Add(new String(buffer));
            }
            return result;
        }
    
    0 讨论(0)
  • 2020-12-03 11:08

    This is my solution:

    Public Shared Function FastSplit(inputString As String, separator As String) As String()
            Dim kwds(1) As String
            Dim k = 0
            Dim tmp As String = ""
    
            For l = 1 To inputString.Length - 1
                tmp = Mid(inputString, l, 1)
                If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
                kwds(k) &= tmp
            Next
    
            Return kwds
    End Function
    

    Here is a version with benchmarking:

    Public Shared Function FastSplit(inputString As String, separator As String) As String()
            Dim sw As New Stopwatch
            sw.Start()
            Dim kwds(1) As String
            Dim k = 0
            Dim tmp As String = ""
    
            For l = 1 To inputString.Length - 1
                tmp = Mid(inputString, l, 1)
                If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
                kwds(k) &= tmp
            Next
            sw.Stop()
            Dim fsTime As Long = sw.ElapsedTicks
    
            sw.Start()
            Dim strings() As String = inputString.Split(separator)
            sw.Stop()
    
            Debug.Print("FastSplit took " + fsTime.ToString + " whereas split took " + sw.ElapsedTicks.ToString)
    
            Return kwds
    End Function
    

    Here are some results on relatively small strings but with varying sizes, up to 8kb blocks. (times are in ticks)

    FastSplit took 8 whereas split took 10

    FastSplit took 214 whereas split took 216

    FastSplit took 10 whereas split took 12

    FastSplit took 8 whereas split took 9

    FastSplit took 8 whereas split took 10

    FastSplit took 10 whereas split took 12

    FastSplit took 7 whereas split took 9

    FastSplit took 6 whereas split took 8

    FastSplit took 5 whereas split took 7

    FastSplit took 10 whereas split took 13

    FastSplit took 9 whereas split took 232

    FastSplit took 7 whereas split took 8

    FastSplit took 8 whereas split took 9

    FastSplit took 8 whereas split took 10

    FastSplit took 215 whereas split took 217

    FastSplit took 10 whereas split took 231

    FastSplit took 8 whereas split took 10

    FastSplit took 8 whereas split took 10

    FastSplit took 7 whereas split took 9

    FastSplit took 8 whereas split took 10

    FastSplit took 10 whereas split took 1405

    FastSplit took 9 whereas split took 11

    FastSplit took 8 whereas split took 10

    Also, I know someone will discourage my use of ReDim Preserve instead of using a list... The reason is, the list really didn't provide any speed difference in my benchmarks so I went back to the "simple" way.

    0 讨论(0)
  • 2020-12-03 11:10

    The BCL implementation of string.Split is actually quite fast, I've done some testing here trying to out preform it and it's not easy.

    But there's one thing you can do and that's to implement this as a generator:

    public static IEnumerable<string> GetSplit( this string s, char c )
    {
        int l = s.Length;
        int i = 0, j = s.IndexOf( c, 0, l );
        if ( j == -1 ) // No such substring
        {
            yield return s; // Return original and break
            yield break;
        }
    
        while ( j != -1 )
        {
            if ( j - i > 0 ) // Non empty? 
            {
                yield return s.Substring( i, j - i ); // Return non-empty match
            }
            i = j + 1;
            j = s.IndexOf( c, i, l - i );
        }
    
        if ( i < l ) // Has remainder?
        {
            yield return s.Substring( i, l - i ); // Return remaining trail
        }
    }
    

    The above method is not necessarily faster than string.Split for small strings but it returns results as it finds them, this is the power of lazy evaluation. If you have long lines or need to conserve memory, this is the way to go.

    The above method is bounded by the performance of IndexOf and Substring which does too much index of out range checking and to be faster you need to optimize away these and implement your own helper methods. You can beat the string.Split performance but it's gonna take cleaver int-hacking. You can read my post about that here.

    0 讨论(0)
  • 2020-12-03 11:12

    Here's a very basic example using ReadOnlySpan. On my machine this takes around 150ns as opposed to string.Split() which takes around 250ns. That's a nice 40% improvement right there.

    string serialized = "1577836800;1000;1";
    ReadOnlySpan<char> span = serialized.AsSpan();
    
    Trade result = new Trade();
    
    index = span.IndexOf(';');
    result.UnixTimestamp = long.Parse(span.Slice(0, index));
    span = span.Slice(index + 1);
    
    index = span.IndexOf(';');
    result.Price = float.Parse(span.Slice(0, index));
    span = span.Slice(index + 1);
    
    index = span.IndexOf(';');
    result.Quantity = float.Parse(span.Slice(0, index));
    
    return result;
    

    Note that a ReadOnlySpan.Split() will soon be part of the framework. See https://github.com/dotnet/runtime/pull/295

    0 讨论(0)
提交回复
热议问题