What is the optimal way to choose a set of features for excluding items based on a bitmask when matching against a large set?

问题

Suppose I have a large, static set of objects, and I have an object that I want to match against all of them according to a complicated set of criteria that entails an expensive test.

Suppose also that it's possible to identify a large set of features that can be used to exclude potential matches, thereby avoiding the expensive test. If a feature is present in the object I am testing, then I can exclude any objects in the set that don't have this feature. In other words, the presence of the feature is necessary but not sufficient for the test to pass.

In that case, I can precompute a bitmask for each object in the set indicating whether each feature is present or absent in the object. I can also compute it for the object that I want to test, and then loop through the array like this (pseudo-code):

objectMask = computeObjectMask(myObject)
for(each testObject in objectSet)
{
    if((testObject.mask & objectMask) != objectMask)
    {
        // early out, some features are in objectMask
        // but not in testObject.mask, so the test can't pass
    }
    else if(doComplicatedTest(testObject, myObject)
    {
        // found a match! 
    }
}

So my question is, given a limited bitmask size, and a large list of possible features, and a table of the frequencies of each feature in object set (plus access to the object set if you want to compute correlations between features and so on), what algorithm can I use to choose the optimal set of features for inclusion in my bitmask to maximize the number of early outs and minimize the number of tests?

If I just choose the top x most common features, then chance of a feature being in both masks is higher, so it seems like the number of early outs would be reduced. However if I choose the x least common features then objectMask might frequently be zero, meaning no early outs are possible. It seems pretty easy to experiment and come up with a set of middling-frequency features that gives good performance, but I'm interested in whether there is a theoretical best way of doing it.

Note: the frequency of each feature is assumed to be the same in the set of possible myObjects as in the objectSet, although I'd be interested to know how to handle if it isn't. I'd also be interested to know if there is an algorithm for finding the best feature set given a large sample of candidate objects that are to be matched against the set.

Possible applications: matching an input string against a large number of regexes, matching a string against a large dictionary of words using a criteria such as "must contain the same letters in the same order, but possibly with extra characters inserted anywhere in the word", etc. Example features: "contains the literal character D", "contains the character F followed by the character G later in the string" etc. Obviously the set of possible features will be highly dependent on the specific application.

回答1:

You can try aho-corasick algorithm. Its the fastest multi pattern matcher. Basically its a finite state machine with failure links computed with a breadth-first search of the trie.

来源：https://stackoverflow.com/questions/30158878/what-is-the-optimal-way-to-choose-a-set-of-features-for-excluding-items-based-on

标签

algorithm

optimization

matching

feature-selection

bit-masks