Efficient string matching algorithm

后端 未结 14 796
Happy的楠姐
Happy的楠姐 2020-12-16 07:13

I\'m trying to build an efficient string matching algorithm. This will execute in a high-volume environment, so performance is critical.

Here are my requirements:

相关标签:
14条回答
  • I would use Regex, just make sure to have it the expression compiled once (instead of it being calculated again and again).

    0 讨论(0)
  • 2020-12-16 07:46

    I'd try a combination of tries with longest-prefix matching (which is used in routing for IP networking). Directed Acyclic Word Graphs may be more appropriate than tries if space is a concern.

    0 讨论(0)
  • 2020-12-16 07:48

    Not sure what your ideas were for splitting and iterating, but it seems like it wouldn't be slow:

    Split the domains up and reverse, like you said. Storage could essentially be a tree. Use a hashtable to store the TLDs. The key would be, for example, "com", and the values would be a hashtable of subdomains under that TLD, iterated ad nauseum.

    0 讨论(0)
  • I would use a tree structure to store the rules, where each tree node is/contains a Dictionary.

    Construct the tree such that "com", "net", etc are the top level entries, "example" is in the next level, and so on. You'll want a special flag to note that the node is a wildcard.

    To perform the lookup, split the string by period, and iterate backwards, navigating the tree based on the input.

    This seems similar to what you say you considered, but assuming the rules don't change each run, using a cached Dictionary-based tree would be faster than a list of arrays.

    Additionally, I would have to bet that this approach would be faster than RegEx.

    0 讨论(0)
  • 2020-12-16 07:52

    I'm going to suggest an alternative to the tree structure approach. Create a compressed index of your domain list using a Burrows-Wheeler transform. See http://www.ddj.com/architect/184405504?pgno=1 for a full explanation of the technique.

    0 讨论(0)
  • 2020-12-16 07:53

    You seem to have a well-defined set of rules regarding what you consider to be valid input - you might consider using a hand-written LL parser for this. Such parsers are relatively easy to write and optimize. Usually you'd have the parser output a tree structure describing the input - I would use this tree as input to a matching routine that performs the work of matching the tree against the list of entries, using the rules you described above.

    Here's an article on recursive descent parsers.

    0 讨论(0)
提交回复
热议问题