Fastest possible string key lookup for known set of keys

南楼画角 提交于 2019-11-30 06:45:10

You could use the Boyer search, but I think that the Trie would be a much more effiecent method. You can modify the Trie to collapse the words as you make the hit count for a key zero, thus reducing the number of searches you would have to do the farther down the line you get. The biggest benefit you would get is that you are doing array lookups for the indexes, which is much faster than a comparison.

You've talked about a memory limitation when it comes to precomputation - is there also a time limitation?

I would consider a trie, but one where you didn't necessarily start with the first character. Instead, find the index which will cut down the search space most, and consider that first. So in your sample case ("foo", "bar", "bazz") you'd take the third character, which would immediately tell you which string it was. (If we know we'll always be given one of the input words, we can return as soon as we've found a unique potential match.)

Now assuming that there isn't a single index which will get you down to a unique string, you need to determine the character to look at after that. In theory you precompute the trie to work out for each branch what the optimal character to look at next is (e.g. "if the third character was 'a', we need to look at the second character next; if it was 'o' we need to look at the first character next) but that potentially takes a lot more time and space. On the other hand, it could save a lot of time - because having gone down one character, each of the branches may have an index to pick which will uniquely identify the final string, but be a different index each time. The amount of space required by this approach would depend on how similar the strings were, and might be hard to predict in advance. It would be nice to be able to dynamically do this for all the trie nodes you can, but then when you find you're running out of construction space, determine a single order for "everything under this node". (So you don't end up storing a "next character index" on each node underneath that node, just the single sequence.) Let me know if this isn't clear, and I can try to elaborate...

How you represent the trie will depend on the range of input characters. If they're all in the range 'a'-'z' then a simple array would be incredibly fast to navigate, and reasonably efficient for trie nodes where there are possibilities for most of the available options. Later on, when there are only two or three possible branches, that becomes wasteful in memory. I would suggest a polymorphic Trie node class, such that you can build the most appropriate type of node depending on how many sub-branches there are.

None of this performs any culling - it's not clear how much can be achieved by culling quickly. One situation where I can see it helping is when the number of branches from one trie node drops to 1 (because of the removal of a branch which is exhausted), that branch can be eliminated completely. Over time this could make a big difference, and shouldn't be too hard to compute. Basically as you build the trie you can predict how many times each branch will be taken, and as you navigate the trie you can subtract one from that count per branch when you navigate it.

That's all I've come up with so far, and it's not exactly a full implementation - but I hope it helps...

Is a binary search of the table really so awful? I would take the list of potential strings and "minimize" them, the sort them, and finally do a binary search upon the block of them.

By minimize I mean reducing them to the minimum they need to be, kind of a custom stemming.

For example if you had the strings: "alfred", "bob", "bill", "joe", I'd knock them down to "a", "bi", "bo", "j".

Then put those in to a contiguous block of memory, for example:

char *table = "a\0bi\0bo\0j\0"; // last 0 is really redundant..but
char *keys[4];
keys[0] = table;
keys[1] = table + 2;
keys[2] = table + 5;
keys[3] = table + 8;

Ideally the compiler would do all this for you if you simply go:

keys[0] = "a";
keys[1] = "bi";
keys[2] = "bo";
keys[3] = "j";

But I can't say if that's true or not.

Now you can bsearch that table, and the keys are as short as possible. If you hit the end of the key, you match. If not, then follow the standard bsearch algorithm.

The goal is to get all of the data close together and keep the code itty bitty so that it all fits in to the CPU cache. You can process the key from the program directly, no pre-processing or adding anything up.

For a reasonably large number of keys that are reasonably distributed, I think this would be quite fast. It really depends on the number of strings involved. For smaller numbers, the overhead of computing hash values etc is more than search something like this. For larger values, it's worth it. Just what those number are all depends on the algorithms etc.

This, however, is likely the smallest solution in terms of memory, if that's important.

This also has the benefit of simplicity.

Addenda:

You don't have any specifications on the inputs beyond 'strings'. There's also no discussion about how many strings you expect to use, their length, their commonality or their frequency of use. These can perhaps all be derived from the "source", but not planned upon by the algorithm designer. You're asking for an algorithm that creates something like this:

inline int GetValue(char *key) {
    return 1234;
}

For a small program that happens to use only one key all the time, all the way up to something that creates a perfect hash algorithm for millions of strings. That's a pretty tall order.

Any design going after "squeezing every single bit of performance possible" needs to know more about the inputs than "any and all strings". That problem space is simply too large if you want it the fastest possible for any condition.

An algorithm that handles strings with extremely long identical prefixes might be quite different than one that works on completely random strings. The algorithm could say "if the key starts with "a", skip the next 100 chars, since they're all a's".

But if these strings are sourced by human beings, and they're using long strings of the same letters, and not going insane trying to maintain that data, then when they complain that the algorithm is performing badly, you reply that "you're doing silly things, don't do that". But we don't know the source of these strings either.

So, you need to pick a problem space to target the algorithm. We have all sorts of algorithms that ostensibly do the same thing because they address different constraints and work better in different situations.

Hashing is expensive, laying out hashmaps is expensive. If there's not enough data involved, there are better techniques than hashing. If you have large memory budget, you could make an enormous state machine, based upon N states per node (N being your character set size -- which you don't specify -- BAUDOT? 7-bit ASCII? UTF-32?). That will run very quickly, unless the amount of memory consumed by the states smashes the CPU cache or squeezes out other things.

You could possibly generate code for all of this, but you may run in to code size limits (you don't say what language either -- Java has a 64K method byte code limit for example).

But you don't specify any of these constraints. So, it's kind of hard to get the most performant solution for your needs.

What you want is a look-up table of look-up tables. If memory cost is not an issue you can go all out.

const int POSSIBLE_CHARCODES = 256; //256 for ascii //65536 for unicode 16bit
struct LutMap {
    int value;
    LutMap[POSSIBLE_CHARCODES] next;
}
int GetValue(string key) {
    LutMap root = Global.AlreadyCreatedLutMap;
    for(int x=0; x<key.length; x++) {
        int c = key.charCodeAt(x);
        if(root.next[c] == null) {
            return root.value;
        }
        root = root.next[c];
    }
}

I reckon that it's all about finding the right hash function. As long as you know what the key-value relationship is in advance, you can do an analysis to try and find a hash function to meet your requrements. Taking the example you've provided, treat the input strings as binary integers:

foo  = 0x666F6F (hex value)
bar  = 0x626172
bazz = 0x62617A7A

The last column present in all of them is different in each. Analyse further:

foo  = 0xF = 1111
bar  = 0x2 = 0010
bazz = 0xA = 1010

Bit-shift to the right twice, discarding overflow, you get a distinct value for each of them:

foo  = 0011
bar  = 0000
bazz = 0010

Bit-shift to the right twice again, adding the overflow to a new buffer: foo = 0010 bar = 0000 bazz = 0001

You can use those to query a static 3-entry lookup table. I reckon this highly personal hash function would take 9 very basic operations to get the nibble (2), bit-shift (2), bit-shift and add (4) and query (1), and a lot of these operations can be compressed further through clever assembly usage. This might well be faster than taking run-time infomation into account.

Have you looked at TCB . Perhaps the algorithm used there can be used to retrieve your values. It sounds a lot like the problem you are trying to solve. And from experience I can say tcb is one of the fastest key store lookups I have used. It is a constant lookup time, regardless of the number of keys stored.

Consider using Knuth–Morris–Pratt algorithm.

Pre-process given map to a large string like below

String string = "{foo:1}{bar:42}{bazz:314159}";
int length = string.length();

According KMP preprocessing time for the string will take O(length). For searching with any word/key will take O(w) complexity, where w is length of the word/key.

You will be needed to make 2 modification to KMP algorithm:

  • key should be appear ordered in the joined string
  • instead of returning true/false it should parse the number and return it

Wish it can give a good hints.

Here's a feasible approach to determine the smallest subset of chars to target for your hash routine:

let:
k be the amount of distinct chars across all your keywords
c be the max keyword length
n be the number of keywords
in your example (padded shorter keywords w/spaces):

"foo "
"bar "
"bazz"

k = 7 (f,o,b,a,r,z, ), c = 4, n = 3

We can use this to compute a lower bound for our search. We need at least log_k(n) chars to uniquely identify a keyword, if log_k(n) >= c then you'll need to use the whole keyword and there's no reason to proceed.

Next, eliminate one column at a time and check if there are still n distinct values remaining. Use the distinct chars in each column as a heuristic to optimize our search:

2 2 3 2
f o o .
b a r .
b a z z

Eliminate columns with the lowest distinct chars first. If you have <= log_k(n) columns remaining you can stop. Optionally you could randomize a bit and eliminate the 2nd lowest distinct col or try to recover if the eliminated col results in less than n distinct words. This algorithm is roughly O(n!) depending on how much you try to recover. It's not guaranteed to find an optimal solution but it's a good tradeoff.

Once you have your subset of chars, proceed with the usual routines for generating a perfect hash. The result should be an optimal perfect hash.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!