Encouraged by this, and the fact I have billions of string to parse, I tried to modify my code to accept StringTokenizer instead of String[]
The only t
Depending on what kind of strings you need to tokenize, you can write your own splitter based on String.indexOf() for example. You could also create a multi-core solution to improve performance even further, as the tokenization of strings is independent from each other. Work on batches of -lets say- 100 strings per core. Do the String.split() or watever else.
I would recommend Google's Guava Splitter
.
I compared it with coobird test and got following results:
StringTokenizer 104
Google Guava Splitter 142
String.split 446
regexp 299
Are you only actually tokenizing on commas? If so, I'd write my own tokenizer - it may well end up being even more efficient than the more general purpose StringTokenizer which can look for multiple tokens, and you can make it behave however you'd like. For such a simple use case, it can be a simple implementation.
If it would be useful, you could even implement Iterable<String>
and get enhanced-for-loop support with strong typing instead of the Enumeration
support provided by StringTokenizer
. Let me know if you want any help coding such a beast up - it really shouldn't be too hard.
Additionally, I'd try running performance tests on your actual data before leaping too far from an existing solution. Do you have any idea how much of your execution time is actually spent in String.split
? I know you have a lot of strings to parse, but if you're doing anything significant with them afterwards, I'd expect that to be much more significant than the splitting.