Search for strings matching the pattern “abc:*:xyz” in less than O(n)

前端 未结 4 1010
有刺的猬
有刺的猬 2021-01-07 10:45

Given a bunch of strings I need to find those which match 3 kinds of patterns:

  • Prefix search - abc*
  • Glob-like pattern - abc:*:xyz
  • Suffix sear
相关标签:
4条回答
  • 2021-01-07 11:16

    If you take the possibility of storing the strings in a search tree into account, why not also store the properties "starts with abc" and "ends with xyz", using these properties as keys?

    Edit: You also might want to leave Big-O-Notation behind and rather concentrate on the actual expected search duration in your particular use case. That's probably a more realistic approach; O(f(n)) style grades for your algorithm / implementation probably don't give you much usable information when it comes to your real-life search efficiency.

    0 讨论(0)
  • 2021-01-07 11:20

    Wouldn't an intersection of the results from the other two queries give you exactly that? And since each of the results is O(log N) and an intersect over that result set is O(N) in the result-set size, wouldn't the total also be an O(log N) over the original problem?

    0 讨论(0)
  • 2021-01-07 11:33

    If "abc" and "xyz" are fixed values, you can maintain three counters with your collection indicating the number of strings:

    • starting with "abc" but not ending with "xyz".
    • not starting with "abc" but ending with "xyz".
    • starting with "abc" and ending with "xyz".

    That gives you an O(1) time complexity for searching at the cost of extra calculation when inserting into, or deleting from, the collection.

    If the "abc" and "xyz" are arbitrary strings, it's O(n) for all operations, including the "abc..." one. You only have to consider what happens when your collections consists of items that all start with "abc" to see this. That's not bounded by O(logN) at all since you have to process all items in the tree (both branches of every non-leaf node).

    I think your ideal solution is to maintain the two ordered trees, one for the normal strings and one for the reversed strings. But don't worry about trying to do an intersection between the two. All you need to do is minimize the search space as much as practicable.

    • To find "abc...", use the normal tree to find the strings starting with that value.
    • To find "...xyz", use the reverse tree to find the strings ending with the reverse of that that value (zyx...).
    • To find "abc...xyz", use the normal tree to find the strings starting with that value and then filter out those that don't end in "xyz".

    That way you don't have to worry about intersecting values between the two trees and you still get a performance improvement over the simplistic linear search.

    0 讨论(0)
  • 2021-01-07 11:37

    Generate rotations of each word and put each rotation in a suffix tree with indication of "rotation index'.

    For example, to put the string "hello", put

    hello, 0
    elloh, 1
    llohe, 2
    lohel, 3
    ohell, 4
    

    Also, you put "hero" as

    hero, 0
    eroh, 1
    rohe, 2
    oher, 3
    

    Also, you put "ohe" (don't ask me what's ohe)

    ohe, 0
    heo, 1
    eoh, 2
    

    Then, if you need to search for pattern "he*o", you need to rotate it until you get a prefixed string: "ohe*"

    In the suffix tree you find the candidates: (ohell, 4), (oher, 3), (ohe, 0). Then you restore their original versions (by un-rotating them) and pick the right ones - "hello" and "hero".

    0 讨论(0)
提交回复
热议问题