String searching algorithms in Java

前端未结

关注

 5  2045

不知归路

I am doing string matching with big amount of data.

EDIT: I am matching words contained in a big list with some ontology text files. I take each file from ontology,

相关标签:

5条回答

梦谈多话

2021-01-01 08:00

you can use BM algorithm for search in text files for single pattern, and repeat this algorithm for all the patterns you have in your list.

the other best solution is to use multi-pattern search algorithms like: Aho–Corasick string matching algorithm

0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2021-01-01 08:01

You might find Suffix Trees useful (they are similar in concept to Tries).

Each string, you prepend with ^ and end with $ and create a suffix tree of all the strings appended. Space usage will be O(n) and will be probably worse than what you had for the trie.

If you now need to search for a string s, you can easily do in O(|s|) time, just like a trie and the match you get will be a substring match (basically, you will be matching some suffix of some string).

~~Sorry, I don't have a reference to a Java implementation handy.~~

Found a useful stackoverflow answer: Generalized Suffix Tree Java Implementation

Which has: http://illya-keeplearning.blogspot.com/2009/04/suffix-trees-java-ukkonens-algorithm.html

Which in turn has: Source Code: http://illya.yolasite.com/resources/suffix-tree.zip

0 讨论(0)
发布评论:

提交评论
- 加载中...
长发绾君心

2021-01-01 08:04

Why don't you use the indexOf method in java. As per the availability of memory, read the content. Do an indexOf and get all the lines you need. Load the next set of contents.

If reading from file use nio streams.

May be the idea is bad, But I belive in java. It will use the best algorithm.

Better if you use regular expression.

0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2021-01-01 08:05

I'm not entirely sure if I understood the question correctly, but it sounds like regular expressions would do the job

http://java.sun.com/developer/technicalArticles/releases/1.4regex/

0 讨论(0)
发布评论:

提交评论
- 加载中...
悲哀的现实

2021-01-01 08:21

Regular expressions are definitely your best bet. They can be a little bit messy to write, but they're the only way that you can have a looser matching without having an incomprehensible series of if/else or switch statements.

Plus, they'll be a lot faster than the alternative.

0 讨论(0)
发布评论:

提交评论
- 加载中...