Does any one know of a faster method to do String.Split()?

后端 未结 14 1131
傲寒
傲寒 2020-12-03 10:57

I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:

values = line.Split(delimiter);


        
相关标签:
14条回答
  • 2020-12-03 11:22

    CSV parsing is actually fiendishly complex to get right, I used classes based on wrapping the ODBC Text driver the one and only time I had to do this.

    The ODBC solution recommended above looks at first glance to be basically the same approach.

    I thoroughly recommend you do some research on CSV parsing before you get too far down a path that nearly-but-not-quite works (all too common). The Excel thing of only double-quoting strings that need it is one of the trickiest to deal with in my experience.

    0 讨论(0)
  • 2020-12-03 11:26

    It should be pointed out that split() is a questionable approach for parsing CSV files in case you come across commas in the file eg:

    1,"Something, with a comma",2,3
    

    The other thing I'll point out without knowing how you profiled is be careful about profiling this kind of low level detail. The granularity of the Windows/PC timer might come into play and you may have a significant overhead in just looping so use some sort of control value.

    That being said, split() is built to handle regular expressions, which are obviously more complex than you need (and the wrong tool to deal with escaped commas anyway). Also, split() creates lots of temporary objects.

    So if you want to speed it up (and I have trouble believing that performance of this part is really an issue) then you want to do it by hand and you want to reuse your buffer objects so you're not constantly creating objects and giving the garbage collector work to do in cleaning them up.

    The algorithm for that is relatively simple:

    • Stop at every comma;
    • When you hit quotes continue until you hit the next set of quotes;
    • Handle escaped quotes (ie \") and arguably escaped commas (\,).

    Oh and to give you some idea of the cost of regex, there was a question (Java not C# but the principle was the same) where someone wanted to replace every n-th character with a string. I suggested using replaceAll() on String. Jon Skeet manually coded the loop. Out of curiosity I compared the two versions and his was an order of magnitude better.

    So if you really want performance, it's time to hand parse.

    Or, better yet, use someone else's optimized solution like this fast CSV reader.

    By the way, while this is in relation to Java it concerns the performance of regular expressions in general (which is universal) and replaceAll() vs a hand-coded loop: Putting char into a java string for each N characters.

    0 讨论(0)
提交回复
热议问题