I have a large several hudred thousand lines text file. I have to extract 30,000 specific lines that are all in the text file in random spots. This is the program I have to extr
This method assumes the special values appear in the same position on the line in gbigfile
def mydict(iterable):
d = {}
for k, v in iterable:
if k in d:
d[k].append(v)
else:
d[k] = [v]
return d
with open("C:\\to_find.txt", "r") as t:
tofind = mydict([(x[0], x) for x in t.readlines()])
with open("C:\\gbigfile.txt", "r") as bigfile:
with open("C:\\outfile.txt", "w") as outfile:
for line in bigfile:
seq = line[4:9]
if seq in tofind[seq[0]]:
outfile.write(line)
Depending on what the distribution of the starting letter in those targets you can cut your comparisons down by a significant amount. If you don't know where the values will appear you're talking about a LONG operation because you'll have to compare hundreds of thousands - let's say 300,000 -- 30,000 times. That's 9 million comparisons which is going to take a long time.