Extract specific text lines?

前端未结

关注

 10  1101

I have a large several hudred thousand lines text file. I have to extract 30,000 specific lines that are all in the text file in random spots. This is the program I have to extr

相关标签:

10条回答

悲哀的现实

2021-02-03 15:35

What are the criteria that define the 30000 lines you want to extract? The more information you give, the more likely you are to get a useful answer.

If you want all the lines containing a certain string, or more generally containing any of a given set of strings, or an occurrence of a regular expression, use grep. It's likely to be significantly faster for large data sets.

0 讨论(0)
发布评论:

提交评论
- 加载中...
抹茶落季

2021-02-03 15:40
Aha! So your real problem is how to test many conditions per line and if one of them is satisfied, to output that line. Easiest will be using regular expression, me thinks:
```
import re
keywords = ['S0414', 'GT213', 'AT3423', 'PR342'] # etc - you probably get those from some source
pattern = re.compile('|'.join(keywords))

for line in inf:
    if pattern.search(ln):
        outf.write(line)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2021-02-03 15:40
The best bet to speed it up would be if the specific string S0414 always appears at the same character position, so instead of having to make several failed comparisons per line (you said they start with different names) it could just do one and done.

e.g. if you're file has lines like
```
GLY S0414 GCT
ASP S0435 AGG
LEU S0432 CCT
```
do an if line[4:9] == 'S0414': small.write(line).
0 讨论(0)
发布评论:

提交评论
- 加载中...
囚心锁ツ

2021-02-03 15:41
This reminds me of a problem described by Tim Bray, who attempted to extract data from web server log files using multi-core machines. The results are described in The Wide Finder Project and Wide Finder 2. So, if serial optimizations don't go fast enough for you, this may be a place to start. There are examples of this sort of problem contributed in many languages, including python. Key quote from that last link:
Summary

In this article, we took a relatively fast Python implementation and optimized it, using a number of tricks:
- Pre-compiled RE patterns
- Fast filtering of candidate lines
- Chunked reading
- Multiple processes
- Memory mapping, combined with support for RE operations on mapped buffers
This reduced the time needed to parse 200 megabytes of log data from 6.7 seconds to 0.8 seconds on the test machine. Or in other words, the final version is over 8 times faster than the original Python version, and (potentially) 600 times faster than Tim’s original Erlang version.
Having said this, 30,000 lines isn't that many so you may want to at least start by investigating your disk read/write performance. Does it help if you write the output to something other than the disk that you are reading the input from or read the whole file in one go before processing?
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2