I want to be able to run a regular expression on an entire file, but I\'d like to be able to not have to read the whole file into memory at once as I may be working with rat
Here's an option for you using re and mmap to find all the words in a file that doesn't build lists or load the whole file into memory.
import re
from contextlib import closing
from mmap import mmap, ACCESS_READ
with open('filepath.txt', 'r') as f:
with closing(mmap(f.fileno(), 0, access=ACCESS_READ)) as d:
print(sum(1 for _ in re.finditer(b'\w+', d)))
based on @sth's answer but less memory usage
Python 3: To load file as one big string use read() and decode() methods
import re, mmap
def read_search_in_file(file):
with open('/var/log/error.log', 'r+') as f:
data = mmap.mmap(f.fileno(), 0).read().decode("utf-8")
error = re.search(r'error: (.*)', data)
if error:
return error.group(1)
This is one way:
import re
REGEX = '\d+'
with open('/tmp/workfile', 'r') as f:
for line in f:
print re.match(REGEX,line)
Another approach which comes to my mind is to use read(size) and file.seek(offset) method, which will read a portion of the file size at a time.
import re
REGEX = '\d+'
with open('/tmp/workfile', 'r') as f:
filesize = f.size()
part = filesize / 10 # a suitable size that you can determine ahead or in the prog.
position = 0
while position <= filesize:
content = f.read(part)
print re.match(REGEX,content)
position = position + part
f.seek(position)
You can also combine these two there you can create generator that would return contents a certain bytes at the time and iterate through that content to check your regex. This IMO would be a good approach.