Does any one know of a faster method to do String.Split()?

后端 未结 14 1130
傲寒
傲寒 2020-12-03 10:57

I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:

values = line.Split(delimiter);


        
相关标签:
14条回答
  • 2020-12-03 11:14

    String.split is rather slow, if you want some faster methods, here you go. :)

    However CSV is much better parsed by a rule based parser.

    This guy, has made a rule based tokenizer for java. (requires some copy and pasting unfortunately)

    http://www.csdgn.org/code/rule-tokenizer

    private static final String[] fSplit(String src, char delim) {
        ArrayList<String> output = new ArrayList<String>();
        int index = 0;
        int lindex = 0;
        while((index = src.indexOf(delim,lindex)) != -1) {
            output.add(src.substring(lindex,index));
            lindex = index+1;
        }
        output.add(src.substring(lindex));
        return output.toArray(new String[output.size()]);
    }
    
    private static final String[] fSplit(String src, String delim) {
        ArrayList<String> output = new ArrayList<String>();
        int index = 0;
        int lindex = 0;
        while((index = src.indexOf(delim,lindex)) != -1) {
            output.add(src.substring(lindex,index));
            lindex = index+delim.length();
        }
        output.add(src.substring(lindex));
        return output.toArray(new String[output.size()]);
    }
    
    0 讨论(0)
  • 2020-12-03 11:16

    You might think that there are optimizations to be had, but the reality will be you'll pay for them elsewhere.

    You could, for example, do the split 'yourself' and walk through all the characters and process each column as you encounter it, but you'd be copying all the parts of the string in the long run anyhow.

    One of the optimizations we could do in C or C++, for example, is replace all the delimiters with '\0' characters, and keep pointers to the start of the column. Then, we wouldn't have to copy all of the string data just to get to a part of it. But this you can't do in C#, nor would you want to.

    If there is a big difference between the number of columns that are in the source, and the number of columns that you need, walking the string manually may yield some benefit. But that benefit would cost you the time to develop it and maintain it.

    I've been told that 90% of the CPU time is spent in 10% of the code. There are variations to this "truth". In my opinion, spending 66% of your time in Split is not that bad if processing CSV is the thing that your app needs to do.

    Dave

    0 讨论(0)
  • 2020-12-03 11:17

    Depending on use, you can speed this up by using Pattern.split instead of String.split. If you have this code in a loop (which I assume you probably do since it sounds like you are parsing lines from a file) String.split(String regex) will call Pattern.compile on your regex string every time that statement of the loop executes. To optimize this, Pattern.compile the pattern once outside the loop and then use Pattern.split, passing the line you want to split, inside the loop.

    Hope this helps

    0 讨论(0)
  • 2020-12-03 11:17

    The main problem(?) with String.Split is that it's general, in that it caters for many needs.

    If you know more about your data than Split would, it can make an improvement to make your own.

    For instance, if:

    1. You don't care about empty strings, so you don't need to handle those any special way
    2. You don't need to trim strings, so you don't need to do anything with or around those
    3. You don't need to check for quoted commas or quotes
    4. You don't need to handle quotes at all

    If any of these are true, you might see an improvement by writing your own more specific version of String.Split.

    Having said that, the first question you should ask is whether this actually is a problem worth solving. Is the time taken to read and import the file so long that you actually feel this is a good use of your time? If not, then I would leave it alone.

    The second question is why String.Split is using that much time compared to the rest of your code. If the answer is that the code is doing very little with the data, then I would probably not bother.

    However, if, say, you're stuffing the data into a database, then 66% of the time of your code spent in String.Split constitutes a big big problem.

    0 讨论(0)
  • 2020-12-03 11:18

    Some very thorough analysis on String.Slit() vs Regex and other methods.

    We are talking ms savings over very large strings though.

    0 讨论(0)
  • 2020-12-03 11:21

    You can assume that String.Split will be close to optimal; i.e. it could be quite hard to improve on it. By far the easier solution is to check whether you need to split the string at all. It's quite likely that you'll be using the individual strings directly. If you define a StringShim class (reference to String, begin & end index) you'll be able to split a String into a set of shims instead. These will have a small, fixed size, and will not cause string data copies.

    0 讨论(0)
提交回复
热议问题