Replicating String.split with StringTokenizer

后端 未结 9 1060
心在旅途
心在旅途 2021-02-06 14:03

Encouraged by this, and the fact I have billions of string to parse, I tried to modify my code to accept StringTokenizer instead of String[]

The only t

相关标签:
9条回答
  • 2021-02-06 14:35

    Note: Having done some quick benchmarks, Scanner turns out to be about four times slower than String.split. Hence, do not use Scanner.

    (I'm leaving the post up to record the fact that Scanner is a bad idea in this case. (Read as: do not downvote me for suggesting Scanner, please...))

    Assuming you are using Java 1.5 or higher, try Scanner, which implements Iterator<String>, as it happens:

    Scanner sc = new Scanner("dog,,cat");
    sc.useDelimiter(",");
    while (sc.hasNext()) {
        System.out.println(sc.next());
    }
    

    gives:

    dog
    
    cat
    
    0 讨论(0)
  • 2021-02-06 14:48

    After tinkering with the StringTokenizer class, I could not find a way to satisfy the requirements to return ["dog", "", "cat"].

    Furthermore, the StringTokenizer class is left only for compatibility reasons, and the use of String.split is encouaged. From the API Specification for the StringTokenizer:

    StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

    Since the issue is the supposedly poor performance of the String.split method, we need to find an alternative.

    Note: I am saying "supposedly poor performance" because it's hard to determine that every use case is going to result in the StringTokenizer being superior to the String.split method. Furthermore, in many cases, unless the tokenization of the strings are indeed the bottleneck of the application determined by proper profiling, I feel that it will end up being a premature optimization, if anything. I would be inclined to say write code that is meaningful and easy to understand before venturing on optimization.

    Now, from the current requirements, probably rolling our own tokenizer wouldn't be too difficult.

    Roll our own tokenzier!

    The following is a simple tokenizer I wrote. I should note that there are no speed optimizations, nor is there error-checks to prevent going past the end of the string -- this is a quick-and-dirty implementation:

    class MyTokenizer implements Iterable<String>, Iterator<String> {
      String delim = ",";
      String s;
      int curIndex = 0;
      int nextIndex = 0;
      boolean nextIsLastToken = false;
    
      public MyTokenizer(String s, String delim) {
        this.s = s;
        this.delim = delim;
      }
    
      public Iterator<String> iterator() {
        return this;
      }
    
      public boolean hasNext() {
        nextIndex = s.indexOf(delim, curIndex);
    
        if (nextIsLastToken)
          return false;
    
        if (nextIndex == -1)
          nextIsLastToken = true;
    
        return true;
      }
    
      public String next() {
        if (nextIndex == -1)
          nextIndex = s.length();
    
        String token = s.substring(curIndex, nextIndex);
        curIndex = nextIndex + 1;
    
        return token;
      }
    
      public void remove() {
        throw new UnsupportedOperationException();
      }
    }
    

    The MyTokenizer will take a String to tokenize and a String as a delimiter, and use the String.indexOf method to perform the search for delimiters. Tokens are produced by the String.substring method.

    I would suspect there could be some performance improvements by working on the string at the char[] level rather than at the String level. But I'll leave that up as an exercise to the reader.

    The class also implements Iterable and Iterator in order to take advantage of the for-each loop construct that was introduced in Java 5. StringTokenizer is an Enumerator, and does not support the for-each construct.

    Is it any faster?

    In order to find out if this is any faster, I wrote a program to compare speeds in the following four methods:

    1. Use of StringTokenizer.
    2. Use of the new MyTokenizer.
    3. Use of String.split.
    4. Use of precompiled regular expression by Pattern.compile.

    In the four methods, the string "dog,,cat" was separated into tokens. Although the StringTokenizer is included in the comparison, it should be noted that it will not return the desired result of ["dog", "", "cat].

    The tokenizing was repeated for a total of 1 million times to give take enough time to notice the difference in the methods.

    The code used for the simple benchmark was the following:

    long st = System.currentTimeMillis();
    for (int i = 0; i < 1e6; i++) {
      StringTokenizer t = new StringTokenizer("dog,,cat", ",");
      while (t.hasMoreTokens()) {
        t.nextToken();
      }
    }
    System.out.println(System.currentTimeMillis() - st);
    
    st = System.currentTimeMillis();
    for (int i = 0; i < 1e6; i++) {
      MyTokenizer mt = new MyTokenizer("dog,,cat", ",");
      for (String t : mt) {
      }
    }
    System.out.println(System.currentTimeMillis() - st);
    
    st = System.currentTimeMillis();
    for (int i = 0; i < 1e6; i++) {
      String[] tokens = "dog,,cat".split(",");
      for (String t : tokens) {
      }
    }
    System.out.println(System.currentTimeMillis() - st);
    
    st = System.currentTimeMillis();
    Pattern p = Pattern.compile(",");
    for (int i = 0; i < 1e6; i++) {
      String[] tokens = p.split("dog,,cat");
      for (String t : tokens) {
      }
    }
    System.out.println(System.currentTimeMillis() - st);
    

    The Results

    The tests were run using Java SE 6 (build 1.6.0_12-b04), and results were the following:

                       Run 1    Run 2    Run 3    Run 4    Run 5
                       -----    -----    -----    -----    -----
    StringTokenizer      172      188      187      172      172
    MyTokenizer          234      234      235      234      235
    String.split        1172     1156     1171     1172     1156
    Pattern.compile      906      891      891      907      906
    

    So, as can be seen from the limited testing and only five runs, the StringTokenizer did in fact come out the fastest, but the MyTokenizer came in as a close 2nd. Then, String.split was the slowest, and the precompiled regular expression was slightly faster than the split method.

    As with any little benchmark, it probably isn't very representative of real-life conditions, so the results should be taken with a grain (or a mound) of salt.

    0 讨论(0)
  • 2021-02-06 14:49

    Well, the fastest thing you could do would be to manually traverse the string, e.g.

    List<String> split(String s) {
            List<String> out= new ArrayList<String>();
               int idx = 0;
               int next = 0;
            while ( (next = s.indexOf( ',', idx )) > -1 ) {
                out.add( s.substring( idx, next ) );
                idx = next + 1;
            }
            if ( idx < s.length() ) {
                out.add( s.substring( idx ) );
            }
                   return out;
        }
    

    This (informal test) looks to be something like twice as fast as split. However, it's a bit dangerous to iterate this way, for example it will break on escaped commas, and if you end up needing to deal with that at some point (because your list of a billion strings has 3 escaped commas) by the time you've allowed for it you'll probably end up losing some of the speed benefit.

    Ultimately it's probably not worth the bother.

    0 讨论(0)
  • 2021-02-06 14:51

    If your input is structured, you can have a look at the JavaCC compiler. It generates a java class reading your input. It would look like this:

    TOKEN { <CAT: "cat"> , <DOG:"gog"> }
    
    input: (cat() | dog())*
    
    
    cat: <CAT>
       {
       animals.add(new Animal("Cat"));
       }
    
    dog: <DOG>
       {
       animals.add(new Animal("Dog"));
       }
    
    0 讨论(0)
  • 2021-02-06 14:55

    Rather than StringTokenizer, you could try the StrTokenizer class from Apache Commons Lang, which I quote:

    This class can split a String into many smaller strings. It aims to do a similar job to StringTokenizer, however it offers much more control and flexibility including implementing the ListIterator interface.

    Empty tokens may be removed or returned as null.

    This sounds like what you need, I think?

    0 讨论(0)
  • 2021-02-06 14:55

    You could do something like that. It's not perfect, but it might be working for you.

    public static List<String> find(String test, char c) {
        List<String> list = new Vector<String>();
        start;
        int i=0;
        while (i<=test.length()) {
            int start = i;
            while (i<test.length() && test.charAt(i)!=c) {
                i++;
            }
            list.add(test.substring(start, i));
            i++;
        }
        return list;
    }
    

    If possible you can ommit the List thing and directly do something to the substring:

    public static void split(String test, char c) {
        int i=0;
        while (i<=test.length()) {
            int start = i;
            while (i<test.length() && test.charAt(i)!=c) {
                i++;
            }
            String s = test.substring(start,i);
             // do something with the string here
            i++;
        }
    }
    

    On my System the last method is faster than the StringTokenizer-solution, but you might want to test how it works for you. (Of course you could make this method a little shorter by ommiting the {} of the second while look and of course you could use a for-loop instead of the outer while-loop and including the last i++ into that, but I didn't do that here because I consider that bad style.

    0 讨论(0)
提交回复
热议问题