I have a large several hudred thousand lines text file. I have to extract 30,000 specific lines that are all in the text file in random spots. This is the program I have to extr
This reminds me of a problem described by Tim Bray, who attempted to extract data from web server log files using multi-core machines. The results are described in The Wide Finder Project and Wide Finder 2. So, if serial optimizations don't go fast enough for you, this may be a place to start. There are examples of this sort of problem contributed in many languages, including python. Key quote from that last link:
Summary
In this article, we took a relatively fast Python implementation and optimized it, using a number of tricks:
- Pre-compiled RE patterns
- Fast filtering of candidate lines
- Chunked reading
- Multiple processes
- Memory mapping, combined with support for RE operations on mapped buffers
This reduced the time needed to parse 200 megabytes of log data from 6.7 seconds to 0.8 seconds on the test machine. Or in other words, the final version is over 8 times faster than the original Python version, and (potentially) 600 times faster than Tim’s original Erlang version.
Having said this, 30,000 lines isn't that many so you may want to at least start by investigating your disk read/write performance. Does it help if you write the output to something other than the disk that you are reading the input from or read the whole file in one go before processing?