Replicating String.split with StringTokenizer

后端未结

关注

 9  1069

Encouraged by this, and the fact I have billions of string to parse, I tried to modify my code to accept StringTokenizer instead of String[]

The only t

相关标签:

9条回答

北荒

2021-02-06 14:58

Depending on what kind of strings you need to tokenize, you can write your own splitter based on String.indexOf() for example. You could also create a multi-core solution to improve performance even further, as the tokenization of strings is independent from each other. Work on batches of -lets say- 100 strings per core. Do the String.split() or watever else.

0 讨论(0)
发布评论:

提交评论
- 加载中...
执笔经年

2021-02-06 14:58

I would recommend Google's Guava Splitter.
I compared it with coobird test and got following results:

StringTokenizer 104
Google Guava Splitter 142
String.split 446
regexp 299

0 讨论(0)
发布评论:

提交评论
- 加载中...
再見小時候

2021-02-06 15:01

Are you only actually tokenizing on commas? If so, I'd write my own tokenizer - it may well end up being even more efficient than the more general purpose StringTokenizer which can look for multiple tokens, and you can make it behave however you'd like. For such a simple use case, it can be a simple implementation.

If it would be useful, you could even implement Iterable<String> and get enhanced-for-loop support with strong typing instead of the Enumeration support provided by StringTokenizer. Let me know if you want any help coding such a beast up - it really shouldn't be too hard.

Additionally, I'd try running performance tests on your actual data before leaping too far from an existing solution. Do you have any idea how much of your execution time is actually spent in String.split? I know you have a lot of strings to parse, but if you're doing anything significant with them afterwards, I'd expect that to be much more significant than the splitting.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2