StringTokenizer - reading lines with integers

后端 未结 2 1944
温柔的废话
温柔的废话 2021-01-15 03:32

I have a question about optimization of my code (which works but is too slow...). I am reading an input in a form

X1 Y1
X2 Y2
etc

where Xi,

相关标签:
2条回答
  • 2021-01-15 04:11

    (updated answer)

    I can say that whatever the problems in your program speed, the choice of tokenizer is not one of them. After an initial run of each method to even out initialisation quirks, I can parse 1000000 rows of "12 34" in milliseconds. You could switch to using indexOf if you like but I really think you need to look at other bits of code for the bottleneck rather than this micro-optimisation. Split was a surprise for me - it's really, really slow compared to the other methods. I've added in Guava split test and it's faster than String.split but slightly slower than StringTokenizer.

    • Split: 371ms
    • IndexOf: 48ms
    • StringTokenizer: 92ms
    • Guava Splitter.split(): 108ms
    • CsvMapper build a csv doc and parse into POJOS: 237ms (or 175 if you build the lines into one doc!)

    The difference here is pretty negligible even over millions of rows.

    There's now a write up of this on my blog: http://demeranville.com/battle-of-the-tokenizers-delimited-text-parser-performance/

    Code I ran was:

    import java.util.StringTokenizer;
    import org.junit.Test;
    
    public class TestSplitter {
    
    private static final String line = "12 34";
    private static final int RUNS = 1000000;//000000;
    
    public final void testSplit() {
        long start = System.currentTimeMillis();
        for (int i=0;i<RUNS;i++){
            String[] st = line.split(" ");
            int x = Integer.parseInt(st[0]);
            int y = Integer.parseInt(st[1]);
        }
        System.out.println("Split: "+(System.currentTimeMillis() - start)+"ms");
    }
    
    public final void testIndexOf() {
        long start = System.currentTimeMillis();
        for (int i=0;i<RUNS;i++){
            int index = line.indexOf(' ');
            int x = Integer.parseInt(line.substring(0,index));
            int y = Integer.parseInt(line.substring(index+1));
        }       
        System.out.println("IndexOf: "+(System.currentTimeMillis() - start)+"ms");      
    }
    
    public final void testTokenizer() {
        long start = System.currentTimeMillis();
        for (int i=0;i<RUNS;i++){
            StringTokenizer st = new StringTokenizer(line, " ");
            int x = Integer.parseInt(st.nextToken());
            int y = Integer.parseInt(st.nextToken());
        }
        System.out.println("StringTokenizer: "+(System.currentTimeMillis() - start)+"ms");
    }
    
    @Test
    public final void testAll() {
        this.testSplit();
        this.testIndexOf();
        this.testTokenizer();
        this.testSplit();
        this.testIndexOf();
        this.testTokenizer();
    }
    
    }
    

    eta: here's the guava code:

    public final void testGuavaSplit() {
        long start = System.currentTimeMillis();
        Splitter split = Splitter.on(" ");
        for (int i=0;i<RUNS;i++){
            Iterator<String> it = split.split(line).iterator();
            int x = Integer.parseInt(it.next());
            int y = Integer.parseInt(it.next());
        }
        System.out.println("GuavaSplit: "+(System.currentTimeMillis() - start)+"ms");
    }
    

    update

    I've added in a CsvMapper test too:

    public static class CSV{
        public int x;
        public int y;
    }
    
    public final void testJacksonSplit() throws JsonProcessingException, IOException {
        CsvMapper mapper = new CsvMapper();
        CsvSchema schema = CsvSchema.builder().addColumn("x", ColumnType.NUMBER).addColumn("y", ColumnType.NUMBER).setColumnSeparator(' ').build();
    
        long start = System.currentTimeMillis();
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < RUNS; i++) {
            builder.append(line);
            builder.append('\n');
        }       
        String input = builder.toString();
        MappingIterator<CSV> it = mapper.reader(CSV.class).with(schema).readValues(input);
        while (it.hasNext()){
            CSV csv = it.next();
        }
        System.out.println("CsvMapperSplit: " + (System.currentTimeMillis() - start) + "ms");
    }
    
    0 讨论(0)
  • 2021-01-15 04:17

    You could use regex to check if the value is numerical and then convert to integer:

    if(st.nextToken().matches("^[0-9]+$"))
            {
               int x = Integer.parseInt(st.nextToken());
            }
    
    0 讨论(0)
提交回复
热议问题