Python: find regexp in a file

前端未结

关注

 3  830

Have:

f = open(...)  
r = re.compile(...)

Need:
Find the position (start and end) of a first matching regexp in a big file?
(star

相关标签:

3条回答

伪装坚强ぢ

2020-12-06 02:29

The following code works reasonably well with test files around 2GB in size.

def search_file(pattern, filename, offset=0):
    with open(filename) as f:
        f.seek(offset)
        for line in f:
            m = pattern.search(line)
            if m:
                search_offset = f.tell() - len(line) - 1
                return search_offset + m.start(), search_offset + m.end()

Note that the regular expression must not span multiple lines.

0 讨论(0)

悲&欢浪女

2020-12-06 02:30
One way to search through big files is to use the mmap library to map the file into a big memory chunk. Then you can search through it without having to explicitly read it.

For example, something like:
```
size = os.stat(fn).st_size
f = open(fn)
data = mmap.mmap(f.fileno(), size, access=mmap.ACCESS_READ)

m = re.search(r"867-?5309", data)
```
This works well for very big files (I've done it for a file 30+ GB in size, but you'll need a 64-bit OS if your file is more than a GB or two).
0 讨论(0)
发布评论:

提交评论
- 加载中...
夕颜

2020-12-06 02:32
NOTE: this has been tested on python2.7. You may have to tweak things in python 3 to handle strings vs bytes but it shouldn't be too painful hopefully.

Memory mapped files may not be ideal for your situation (32-bit mode increases chance there isn't enough contiguous virtual memory, can't read from pipes or other non-files, etc).

Here is a solution that reads 128k blocks at a time and as long as your regex matches a string smaller than that size, this will work. Also note you are not restricted by using single-line regexes. This solution works plenty fast, although I suspect it will be marginally slower than using mmap. It probably depends more on what you're doing with the matches, as well as the size/complexity of the regex you're searching for.

The method will make sure to only keep a maximum of 2 blocks in memory. You might want to enforce at least 1 match per block as a sanity check in some use cases, but this method will truncate in order to keep the maximum of 2 blocks in memory. It also makes sure that any regex match that eats to the end of the current block is NOT yielded and instead the last position is saved for when either the true input is exhausted or we have another block that the regex matches before the end of, in order to better match patterns like "[^\n]+" or "xxx$". You may still be able to break things if you have a lookahead at the end of the regex like xx(?!xyz) where yz is in the next block, but in most cases you can work around using such patterns.
```
import re

def regex_stream(regex,stream,block_size=128*1024):
    stream_read=stream.read
    finditer=regex.finditer
    block=stream_read(block_size)
    if not block:
        return
    lastpos = 0
    for mo in finditer(block):
        if mo.end()!=len(block):
            yield mo
            lastpos = mo.end()
        else:
            break
    while True:
        new_buffer = stream_read(block_size)
        if not new_buffer:
            break
        if lastpos:
            size_to_append=len(block)-lastpos
            if size_to_append > block_size:
                block='%s%s'%(block[-block_size:],new_buffer)
            else:
                block='%s%s'%(block[lastpos:],new_buffer)
        else:
            size_to_append=len(block)
            if size_to_append > block_size:
                block='%s%s'%(block[-block_size:],new_buffer)
            else:
                block='%s%s'%(block,new_buffer)
        lastpos = 0
        for mo in finditer(block):
            if mo.end()!=len(block):
                yield mo
                lastpos = mo.end()
            else:
                break
    if lastpos:
        block=block[lastpos:]
    for mo in finditer(block):
        yield mo
```
To test / explore, you can run this:
```
# NOTE: you can substitute a real file stream here for t_in but using this as a test
t_in=cStringIO.StringIO('testing this is a 1regexxx\nanother 2regexx\nmore 3regexes')
block_size=len('testing this is a regex')
re_pattern=re.compile(r'\dregex+',re.DOTALL)
for match_obj in regex_stream(re_pattern,t_in,block_size=block_size):
    print 'found regex in block of len %s/%s: "%s[[[%s]]]%s"'%(
        len(match_obj.string),
        block_size,match_obj.string[:match_obj.start()].encode('string_escape'),
        match_obj.group(),
        match_obj.string[match_obj.end():].encode('string_escape'))
```
Here is the output:
```
found regex in block of len 46/23: "testing this is a [[[1regexxx]]]\nanother 2regexx\nmor"
found regex in block of len 46/23: "testing this is a 1regexxx\nanother [[[2regexx]]]\nmor"
found regex in block of len 14/23: "\nmore [[[3regex]]]es"
```
This can be useful in conjunction with quick-parsing a large XML where it can be split up into mini-DOMs based on a sub element as root, instead of having to dive into handling callbacks and states when using a SAX parser. It also allows you to filter through XML faster as well. But I've used it for tons of other purposes as well. I'm kind of surprised recipes like this aren't more readily available on the net!

One more thing: Parsing in unicode should work as long as the passed in stream is producing unicode strings, and if you're using the character classes like \w, you'll need to add the re.U flag to the re.compile pattern construction. In this case block_size actually means character count instead of byte count.
0 讨论(0)
发布评论:

提交评论
- 加载中...