Fastest possible string key lookup for known set of keys

后端未结

关注

 8  1946

Consider a lookup function with the following signature, which needs to return an integer for a given string key:

int GetValue(string key) { ... }

相关标签:

8条回答

生来不讨喜

2020-12-30 05:00
Consider using Knuth–Morris–Pratt algorithm.

Pre-process given map to a large string like below
```
String string = "{foo:1}{bar:42}{bazz:314159}";
int length = string.length();
```
According KMP preprocessing time for the string will take O(length). For searching with any word/key will take O(w) complexity, where w is length of the word/key.

You will be needed to make 2 modification to KMP algorithm:
- key should be appear ordered in the joined string
- instead of returning true/false it should parse the number and return it
Wish it can give a good hints.
0 讨论(0)
发布评论:

提交评论
- 加载中...
爱一瞬间的悲伤

2020-12-30 05:02

You've talked about a memory limitation when it comes to precomputation - is there also a time limitation?

I would consider a trie, but one where you didn't necessarily start with the first character. Instead, find the index which will cut down the search space most, and consider that first. So in your sample case ("foo", "bar", "bazz") you'd take the third character, which would immediately tell you which string it was. (If we know we'll always be given one of the input words, we can return as soon as we've found a unique potential match.)

Now assuming that there isn't a single index which will get you down to a unique string, you need to determine the character to look at after that. In theory you precompute the trie to work out for each branch what the optimal character to look at next is (e.g. "if the third character was 'a', we need to look at the second character next; if it was 'o' we need to look at the first character next) but that potentially takes a lot more time and space. On the other hand, it could save a lot of time - because having gone down one character, each of the branches may have an index to pick which will uniquely identify the final string, but be a different index each time. The amount of space required by this approach would depend on how similar the strings were, and might be hard to predict in advance. It would be nice to be able to dynamically do this for all the trie nodes you can, but then when you find you're running out of construction space, determine a single order for "everything under this node". (So you don't end up storing a "next character index" on each node underneath that node, just the single sequence.) Let me know if this isn't clear, and I can try to elaborate...

How you represent the trie will depend on the range of input characters. If they're all in the range 'a'-'z' then a simple array would be incredibly fast to navigate, and reasonably efficient for trie nodes where there are possibilities for most of the available options. Later on, when there are only two or three possible branches, that becomes wasteful in memory. I would suggest a polymorphic Trie node class, such that you can build the most appropriate type of node depending on how many sub-branches there are.

None of this performs any culling - it's not clear how much can be achieved by culling quickly. One situation where I can see it helping is when the number of branches from one trie node drops to 1 (because of the removal of a branch which is exhausted), that branch can be eliminated completely. Over time this could make a big difference, and shouldn't be too hard to compute. Basically as you build the trie you can predict how many times each branch will be taken, and as you navigate the trie you can subtract one from that count per branch when you navigate it.

That's all I've come up with so far, and it's not exactly a full implementation - but I hope it helps...

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2