I have a large several hudred thousand lines text file. I have to extract 30,000 specific lines that are all in the text file in random spots. This is the program I have to extr
What are the criteria that define the 30000 lines you want to extract? The more information you give, the more likely you are to get a useful answer.
If you want all the lines containing a certain string, or more generally containing any of a given set of strings, or an occurrence of a regular expression, use grep
. It's likely to be significantly faster for large data sets.
Aha! So your real problem is how to test many conditions per line and if one of them is satisfied, to output that line. Easiest will be using regular expression, me thinks:
import re
keywords = ['S0414', 'GT213', 'AT3423', 'PR342'] # etc - you probably get those from some source
pattern = re.compile('|'.join(keywords))
for line in inf:
if pattern.search(ln):
outf.write(line)
The best bet to speed it up would be if the specific string S0414
always appears at the same character position, so instead of having to make several failed comparisons per line (you said they start with different names) it could just do one and done.
e.g. if you're file has lines like
GLY S0414 GCT
ASP S0435 AGG
LEU S0432 CCT
do an if line[4:9] == 'S0414': small.write(line)
.
This reminds me of a problem described by Tim Bray, who attempted to extract data from web server log files using multi-core machines. The results are described in The Wide Finder Project and Wide Finder 2. So, if serial optimizations don't go fast enough for you, this may be a place to start. There are examples of this sort of problem contributed in many languages, including python. Key quote from that last link:
Summary
In this article, we took a relatively fast Python implementation and optimized it, using a number of tricks:
- Pre-compiled RE patterns
- Fast filtering of candidate lines
- Chunked reading
- Multiple processes
- Memory mapping, combined with support for RE operations on mapped buffers
This reduced the time needed to parse 200 megabytes of log data from 6.7 seconds to 0.8 seconds on the test machine. Or in other words, the final version is over 8 times faster than the original Python version, and (potentially) 600 times faster than Tim’s original Erlang version.
Having said this, 30,000 lines isn't that many so you may want to at least start by investigating your disk read/write performance. Does it help if you write the output to something other than the disk that you are reading the input from or read the whole file in one go before processing?