Efficient string matching algorithm

后端未结

关注

 14  796

I\'m trying to build an efficient string matching algorithm. This will execute in a high-volume environment, so performance is critical.

Here are my requirements:

相关标签:

14条回答

不要未来只要你来

2020-12-16 07:44

I would use Regex, just make sure to have it the expression compiled once (instead of it being calculated again and again).

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-16 07:46

I'd try a combination of tries with longest-prefix matching (which is used in routing for IP networking). Directed Acyclic Word Graphs may be more appropriate than tries if space is a concern.

0 讨论(0)
发布评论:

提交评论
- 加载中...
离开以前

2020-12-16 07:48

Not sure what your ideas were for splitting and iterating, but it seems like it wouldn't be slow:

Split the domains up and reverse, like you said. Storage could essentially be a tree. Use a hashtable to store the TLDs. The key would be, for example, "com", and the values would be a hashtable of subdomains under that TLD, iterated ad nauseum.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2020-12-16 07:50

I would use a tree structure to store the rules, where each tree node is/contains a Dictionary.

Construct the tree such that "com", "net", etc are the top level entries, "example" is in the next level, and so on. You'll want a special flag to note that the node is a wildcard.

To perform the lookup, split the string by period, and iterate backwards, navigating the tree based on the input.

This seems similar to what you say you considered, but assuming the rules don't change each run, using a cached Dictionary-based tree would be faster than a list of arrays.

Additionally, I would have to bet that this approach would be faster than RegEx.

0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2020-12-16 07:52

I'm going to suggest an alternative to the tree structure approach. Create a compressed index of your domain list using a Burrows-Wheeler transform. See http://www.ddj.com/architect/184405504?pgno=1 for a full explanation of the technique.

0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2020-12-16 07:53

You seem to have a well-defined set of rules regarding what you consider to be valid input - you might consider using a hand-written LL parser for this. Such parsers are relatively easy to write and optimize. Usually you'd have the parser output a tree structure describing the input - I would use this tree as input to a matching routine that performs the work of matching the tree against the list of entries, using the rules you described above.

Here's an article on recursive descent parsers.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页