How do I re.search or re.match on a whole file without reading it all into memory?

前端未结

关注

 9  1220

I want to be able to run a regular expression on an entire file, but I\'d like to be able to not have to read the whole file into memory at once as I may be working with rat

相关标签:

9条回答

清歌不尽

2020-12-01 01:21
This depends on the file and the regex. The best thing you could do would be to read the file in line by line but if that does not work for your situation then might get stuck with pulling the whole file into memory.

Lets say for example that this is your file:
```
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Ut fringilla pede blandit
eros sagittis viverra. Curabitur facilisis
urna ABC elementum lacus molestie aliquet.
Vestibulum lobortis semper risus. Etiam
sollicitudin. Vivamus posuere mauris eu
nulla. Nunc nisi. Curabitur fringilla fringilla
elit. Nullam feugiat, metus et suscipit
fermentum, mauris ipsum blandit purus,
non vehicula purus felis sit amet tortor.
Vestibulum odio. Mauris dapibus ultricies
metus. Cras XYZ eu lectus. Cras elit turpis,
ultrices nec, commodo eu, sodales non, erat.
Quisque accumsan, nunc nec porttitor vulputate,
erat dolor suscipit quam, a tristique justo
turpis at erat.
```
And this was your regex:
```
consectetur(?=\sadipiscing)
```
Now this regex uses positive lookahead and will only match a string of "consectetur" if it is immediately followed by any whitepace character and then a string of "adipiscing".

So in this example you would have to read the whole file into memory because your regex is depending on the entire file being parsed as a single string. This is one of many examples that would require you to have your entire string in memory for a particular regex to work.

I guess the unfortunate answer is that it all depends on your situation.
0 讨论(0)
发布评论:

提交评论
- 加载中...
轮回少年

2020-12-01 01:23
Open the file and iterate over the lines.
```
fd = open('myfile')
for line in fd:
    if re.match(...,line)
        print line
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

小鲜肉

2020-12-01 01:23

f = open(filename,'r')
  for eachline in f:
    string=re.search("(<tr align=\"right\"><td>)([0-9]*)(</td><td>)([a-zA-Z]*)(</td><td>)([a-zA-Z]*)(</td>)",eachline)
    if string:
      for i in range (2,8,2):
        add = string.group(i)
        l.append(add)

0 讨论(0)

忘掉有多难

2020-12-01 01:25
If this is a big deal and worth some effort, you can convert the regular expression into a finite state machine which reads the file. The FSM can be of O(n) complexity which means it will be a lot faster as the file size gets big.

You will be able to efficiently match patterns that span lines in files too large to fit in memory.

Here are two places that describe the algorithm for converting a regular expression to a FSM:
- http://swtch.com/~rsc/regexp/regexp1.html
- http://www.math.grin.edu/~rebelsky/Courses/CS362/98F/Outlines/outline.07.html
0 讨论(0)
发布评论:

提交评论
- 加载中...
眼角桃花

2020-12-01 01:29

For single line patterns you can iterate over the lines of the file, but for multi-line patterns, You will have to read all (or part, but that'll be hard to keep track of) of the file into memory.

0 讨论(0)
发布评论:

提交评论
- 加载中...
逝去的感伤

2020-12-01 01:34
You can use mmap to map the file to memory. The file contents can then be accessed like a normal string:
```
import re, mmap

with open('/var/log/error.log', 'r+') as f:
  data = mmap.mmap(f.fileno(), 0)
  mo = re.search('error: (.*)', data)
  if mo:
    print "found error", mo.group(1)
```
This also works for big files, the file content is internally loaded from disk as needed.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页