My problem below is NP-complete, however, I\'m trying to find at least a marginally faster string search function or module that might help in reducing some of the computation t
I applied the Dijkstra algorithm (width-search) and have a solution giving an answer to this task in less than a second. I optimized it a bit in terms of memory usage, but I think concerning the algorithm this is a better approach than the one in the other answer. Unless we run out of memory this should be a better solution.
from collections import defaultdict
def dijkSuperstring(originalSeqs):
paths = defaultdict(set)
paths[0] = { '' }
while paths:
minLength = min(paths.keys())
while paths[minLength]:
candidate = paths[minLength].pop()
seqAdded = False
for seq in originalSeqs:
if seq in candidate:
continue
seqAdded = True
for i in reversed(range(len(seq)+1)):
if candidate.endswith(seq[:i]):
newCandidate = candidate + seq[i:]
paths[len(newCandidate)].add(newCandidate)
if not seqAdded: # nothing added, so all present?
return candidate
del paths[minLength]
print dijkSuperstring(
[ 'AGG', 'AGT', 'CCG', 'CGT', 'GAG', 'GGA', 'GGT', 'GTA', 'GTG', 'TAG', 'TGG' ])
I also tried using random sequences as input:
seqs = [ ''.join(random.choice('GATC')
for i in range(3))
for j in range(11) ]
print dijkSuperstring(deqs)
I soon found out that the solving time greatly depends on the size of the result(!) not of the input's size (so it isn't predictable). This isn't too surprising, but it makes comparing different algorithms a little difficult as others don't necessarily also have this property. In particular, the set of sequences from the OP seems to pose a comparatively lightweight problem. Other sets of 11 sequences of 3 characters are much harder to solve.
So I made some statistical measurements; I solved 1000 sets of 8 sequences. This I did for sequences of 3 and of 4 characters. Then I grouped the durations in 100 groups (equally spaced from 0s to the highest duration) and counted how many fell into each group. To smoothen the graph I always used the sum of three neighboring groups.
The diagrams below each show two such experiments, performed with an earlier (non-optimized) version of my algorithm (but the shape of the curves are the same as now); I did it twice to at least have an idea whether a strange ditch in the graph could have a reason or was just by pure chance.
I'd be interested to see similar graphs for the same kind of input for other algorithms. This could be interesting because my algorithm clearly has a memory issue. Solving 11 sequences of 3 characters stalled my machine several times due to memory exhaustion, so having another algorithm could make sense even if it is slower.