Find all concatenations of two string in a huge set

问题

Given a set of 50k strings, I need to find all pairs (s, t), such that s, t and s + t are all contained in this set.

What I've tried

, there's an additional constraint: s.length() >= 4 && t.length() >= 4. This makes it possible to group the strings by length 4 prefixes and, separately, suffixes. Then for every string composed of length at least 8, I look up the set of candidates for s using the first four characters of composed and the set of candidates for t using its last four characters. This works, but it needs to look at 30M candidate pairs (s, t) for finding the 7k results.

This surprisingly high number of candidates comes from the fact, that the string are (mostly German) words from a limited vocabulary and the word starts and ends often the same. It's still much better than trying all 2.5G pairs, but much worse than I hoped.

What I need

As the additional constraint may get dropped and the set will grow, I'm looking for a better algorithm.

The "missing" question

There were complaints about me not asking a question. So the missing question mark is at the end of the next sentence. How can this be done more efficiently, ideally without using the constraint?

回答1:

Algorithm 1: Test pairs, not singles

One way could be, instead of working from all possible pairs to all possible composite strings containing those pairs, work from all possible composite strings and see if they contain pairs. This changes the problem from n^2 lookups (where n is the number of strings >= 4 characters) to m * n lookups (where m is the average length of all strings >= 8 characters, minus 7, and n is now the number of strings >= 8 characters). Here's one implementation of that:

int minWordLength = 4;
int minPairLength = 8;

Set<String> strings = Stream
   .of(
      "a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
      "bear", "hug", "bearhug", "cur", "curlique", "curl",
      "down", "downstream", "stream"
   )
   .filter(s -> s.length() >= minWordLength)
   .collect(ImmutableSet.toImmutableSet());

strings
   .stream()
   .filter(s -> s.length() >= minPairLength)
   .flatMap(s -> IntStream
      .rangeClosed(minWordLength, s.length() - minWordLength)
      .mapToObj(splitIndex -> ImmutableList.of(
         s.substring(0, splitIndex),
         s.substring(splitIndex)
      ))
      .filter(pair ->
          strings.contains(pair.get(0))
          && strings.contains(pair.get(1))
      )
   )
   .map(pair ->
      pair.get(0) + pair.get(1) + " = " + pair.get(0) + " + " + pair.get(1)
   )
   .forEach(System.out::println);

Gives the result:

downstream = down + stream

This has average algorithmic complexity of m * n as shown above. So in effect, O(n). In the worst case, O(n^2). See hash table for more on the algorithmic complexity.

Explanation

Put all strings four or more characters long into a hash set (which takes average O(1) complexity for search). I used Guava's ImmutableSet for convenience. Use whatever you like.
filter: Restrict to only the items that are eight or more characters in length, representing our candidates for being a composition of two other words in the list.
flatMap: For each candidate, compute all possible pairs of sub-words, ensuring each is at least 4 characters long. Since there can be more than one result, this is in effect a list of lists, so flatten it into a single-deep list.
1. rangeClosed: Generate all integers representing the number of characters that will be in the first word of the pair we will check.
2. mapToObj: Use each integer combined with our candidate string to output a list of two items (in production code you'd probably want something more clear like a two-property value class, or an appropriate existing class).
3. filter: Restrict to only pairs where both are in the list.
map: Pretty up the results a little.
forEach: Output to the console.

Algorithm Choice

This algorithm is tuned to words that are way shorter than the number of items in the list. If the list were very short and the words were very long, then switching back to a composition task instead of a decomposition task would work better. Given that the list is 50,000 strings in size, and German words while long are very unlikely to exceed 50 characters, that is a 1:1000 factor in favor of this algorithm.

If on the other hand, you had 50 strings that were on average 50,000 characters long, a different algorithm would be far more efficient.

Algorithm 2: Sort and keep a candidate list

One algorithm I thought about for a little while was to sort the list, with the knowledge that if a string represents the start of a pair, all candidate strings that could be one of its pairs will be immediately after it in order, among the set of items that start with that string. Sorting my tricky data above, and adding some confounders (downer, downs, downregulate) we get:

a
abc
abcdef
bear
bearhug
cur
curl
curlique
def
down ---------\
downs         |
downer        | not far away now!
downregulate  |
downstream ---/
hug
shine
stream
sun
sunshine

Thus if a running set of all items to check were kept, we could find candidate composites in essentially constant time per word, then probe directly into a hash table for the remainder word:

int minWordLength = 4;

Set<String> strings = Stream
   .of(
      "a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
      "bear", "hug", "bearhug", "cur", "curlique", "curl",
      "down", "downs", "downer", "downregulate", "downstream", "stream")
   .filter(s -> s.length() >= minWordLength)
   .collect(ImmutableSet.toImmutableSet());

ImmutableList<String> orderedList = strings
   .stream()
   .sorted()
   .collect(ImmutableList.toImmutableList());
List<String> candidates = new ArrayList<>();
List<Map.Entry<String, String>> pairs = new ArrayList<>();

for (String currentString : orderedList) {
   List<String> nextCandidates = new ArrayList<>();
   nextCandidates.add(currentString);
   for (String candidate : candidates) {
      if (currentString.startsWith(candidate)) {
         nextCandidates.add(candidate);
         String remainder = currentString.substring(candidate.length());
         if (remainder.length() >= minWordLength && strings.contains(remainder)) {
            pairs.add(new AbstractMap.SimpleEntry<>(candidate, remainder));
         }
      }
   }
   candidates = nextCandidates;
}
pairs.forEach(System.out::println);

Result:

down=stream

The algorithmic complexity on this one is a little more complicated. The searching part I think is O(n) average, with O(n^2) worst case. The most expensive part might be the sorting—which depends on the algorithm used and the characteristics of the unsorted data. So use this one with a grain of salt, but it has possibility. It seems to me that this is going to be way less expensive than building a Trie out of an enormous data set, because you only probe it once comprehensively and don’t get any amortization of the build cost.

Also, this time I chose a Map.Entry to hold the pair. It's completely arbitrary how you do it. Making a custom Pair class or using some existing Java class would be fine.

回答2:

You can improve Erik’s answer by avoiding most of the sub-String creation using CharBuffer views and altering their position and limit:

Set<CharBuffer> strings = Stream.of(
    "a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
    "bear", "hug", "bearhug", "cur", "curlique", "curl",
    "down", "downstream", "stream"
 )
.filter(s -> s.length() >= 4) // < 4 is irrelevant
.map(CharBuffer::wrap)
.collect(Collectors.toSet());

strings
    .stream()
    .filter(s -> s.length() >= 8)
    .map(CharBuffer::wrap)
    .flatMap(cb -> IntStream.rangeClosed(4, cb.length() - 4)
        .filter(i -> strings.contains(cb.clear().position(i))&&strings.contains(cb.flip()))
        .mapToObj(i -> cb.clear()+" = "+cb.limit(i)+" + "+cb.clear().position(i))
    )
    .forEach(System.out::println);

This is the same algorithm, hence doesn’t change the time complexity, unless you incorporate the hidden character data copying costs, which would be another factor (times the average string length).

Of course, the differences become significant only if you use a different terminal operation than printing the matches, as printing is quiet an expensive operation. Likewise, when the source is a stream over a large file, the I/O will dominate the operation. Unless you go into an entirely different direction, like using memory mapping and refactor this operation to operate over ByteBuffers.

回答3:

A possible solution could be this. You start with the first string as your prefix and the second string as your suffix. You go through each string. If the string begins with the first string, you check if it ends in the second string. And keep going until the end. To save some time before checking if the letters themselves are the same you could make a length check. It's pretty much what you made, but with this added length check you might be able to trim off a few. That's my take on it at least.

回答4:

Not sure if this is better than your solution but I think it's worth a try.

Build two Tries, one with the candidates in normal order, the other with the words reversed.

Walk the forwards Trie from depth 4 inwards and use the remainder of the leaf to determine the suffix (or something like that) and look it up in the backwards Trie.

I've posted a Trie implementation in the past here https://stackoverflow.com/a/9320920/823393.

来源：https://stackoverflow.com/questions/52251203/find-all-concatenations-of-two-string-in-a-huge-set

标签

java

algorithm

string-algorithm