How do I re.search or re.match on a whole file without reading it all into memory?

前端 未结 9 1221
不知归路
不知归路 2020-12-01 01:14

I want to be able to run a regular expression on an entire file, but I\'d like to be able to not have to read the whole file into memory at once as I may be working with rat

相关标签:
9条回答
  • 2020-12-01 01:36

    Here's an option for you using re and mmap to find all the words in a file that doesn't build lists or load the whole file into memory.

    import re
    from contextlib import closing
    from mmap import mmap, ACCESS_READ
    
    with open('filepath.txt', 'r') as f:
        with closing(mmap(f.fileno(), 0, access=ACCESS_READ)) as d:
            print(sum(1 for _ in re.finditer(b'\w+', d)))
    

    based on @sth's answer but less memory usage

    0 讨论(0)
  • 2020-12-01 01:40

    Python 3: To load file as one big string use read() and decode() methods

    import re, mmap
    
    
    def read_search_in_file(file):
        with open('/var/log/error.log', 'r+') as f:
            data = mmap.mmap(f.fileno(), 0).read().decode("utf-8")
            error = re.search(r'error: (.*)', data)
      if error:
        return error.group(1)
    
    0 讨论(0)
  • 2020-12-01 01:43

    This is one way:

    import re
    
    REGEX = '\d+'
    
    with open('/tmp/workfile', 'r') as f:
          for line in f:
              print re.match(REGEX,line)
    
    1. with operator in python 2.5 takes of automatic file closure. Hence you need not worry about it.
    2. iterator over the file object is memory efficient. that is it wont read more than a line of memory at a given time.
    3. But the draw back of this approach is that it would take a lot of time for huge files.

    Another approach which comes to my mind is to use read(size) and file.seek(offset) method, which will read a portion of the file size at a time.

    import re
    
    REGEX = '\d+'
    
    with open('/tmp/workfile', 'r') as f:
          filesize = f.size()
          part = filesize / 10 # a suitable size that you can determine ahead or in the prog.
          position = 0 
          while position <= filesize: 
              content = f.read(part)
              print re.match(REGEX,content)
              position = position + part
              f.seek(position)
    

    You can also combine these two there you can create generator that would return contents a certain bytes at the time and iterate through that content to check your regex. This IMO would be a good approach.

    0 讨论(0)
提交回复
热议问题