Splitting a csv file with quotes as text-delimiter using String.split()

前端 未结 3 615
离开以前
离开以前 2020-11-27 11:07

I have a comma separated file with many lines similar to one below.

Sachin,,M,\"Maths,Science,English\",Need to improve in these subjects.

相关标签:
3条回答
  • 2020-11-27 11:36
    public static void main(String[] args) {
        String s = "Sachin,,M,\"Maths,Science,English\",Need to improve in these subjects.";
        String[] splitted = s.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
        System.out.println(Arrays.toString(splitted));
    }
    

    Output:

    [Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]
    
    0 讨论(0)
  • 2020-11-27 11:40

    As your problem/requirements are not all that complex a custom method can be utilized that performs over 20 times faster and produces the same results. This is variable based on the data size and number of rows parsed, and for more complicated problems using regular expressions is a must.

    import java.util.Arrays;
    import java.util.ArrayList;
    public class SplitTest {
    
    public static void main(String[] args) {
    
        String s = "Sachin,,M,\"Maths,Science,English\",Need to improve in these subjects.";
        String[] splitted = null;
    
     //Measure Regular Expression
        long startTime = System.nanoTime();
        for(int i=0; i<10; i++)
        splitted = s.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
        long endTime =   System.nanoTime();
    
        System.out.println("Took: " + (endTime-startTime));
        System.out.println(Arrays.toString(splitted));
        System.out.println("");
    
    
        ArrayList<String> sw = null;        
     //Measure Custom Method
                startTime = System.nanoTime();
        for(int i=0; i<10; i++)
        sw = customSplitSpecific(s);
        endTime =   System.nanoTime();
    
        System.out.println("Took: " + (endTime-startTime));
        System.out.println(sw);         
    }
    
    public static ArrayList<String> customSplitSpecific(String s)
    {
        ArrayList<String> words = new ArrayList<String>();
        boolean notInsideComma = true;
        int start =0, end=0;
        for(int i=0; i<s.length()-1; i++)
        {
            if(s.charAt(i)==',' && notInsideComma)
            {
                words.add(s.substring(start,i));
                start = i+1;                
            }   
            else if(s.charAt(i)=='"')
            notInsideComma=!notInsideComma;
        }
        words.add(s.substring(start));
        return words;
    }   
    

    }

    On my own computer this produces:

    Took: 6651100
    [Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]
    
    Took: 224179
    [Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]
    
    0 讨论(0)
  • 2020-11-27 11:42

    If your strings are all well-formed it is possible with the following regular expression:

    String[] res = str.split(",(?=([^\"]|\"[^\"]*\")*$)");
    

    The expression ensures that a split occurs only at commas which are followed by an even (or zero) number of quotes (and thus not inside such quotes).

    Nevertheless, it may be easier to use a simple non-regex parser.

    0 讨论(0)
提交回复
热议问题