How to read lines from a mmapped file?

前端未结

关注

 4  1535

Is seems that the mmap interface only supports readline(). If I try to iterate over the object I get character instead of complete lines.

What would be the \"python

相关标签:

4条回答

情歌与酒

2020-12-24 03:10
The most concise way to iterate over the lines of an mmap is
```
with open(STAT_FILE, "r+b") as f:
    map_file = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    for line in iter(map_file.readline, b""):
        # whatever
```
Note that in Python 3 the sentinel parameter of iter() must be of type bytes, while in Python 2 it needs to be a str (i.e. "" instead of b"").
0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2020-12-24 03:16
I modified your example like this:
```
with open(STAT_FILE, "r+b") as f:
        m=mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
        while True:
                line=m.readline()
                if line == '': break
                print line.rstrip()
```
Suggestions:
- Do not call a variable map, this is a built-in function.
- Open the file in r+b mode, as in the Python example on the mmap help page. It states: In either case you must provide a file descriptor for a file opened for update. See http://docs.python.org/library/mmap.html#mmap.mmap.
- It's better to not use UPPER_CASE_WITH_UNDERSCORES global variable names, as mentioned in Global Variable Names at https://www.python.org/dev/peps/pep-0008/#global-variable-names. In other programming languages (like C), constants are often written all uppercase.
Hope this helps.

Edit: I did some timing tests on Linux because the comment made me curious. Here is a comparison of timings made on 5 sequential runs on a 137MB text file.

Normal file access:
```
real    2.410 2.414 2.428 2.478 2.490
sys     0.052 0.052 0.064 0.080 0.152
user    2.232 2.276 2.292 2.304 2.320
```
mmap file access:
```
real    1.885 1.899 1.925 1.940 1.954
sys     0.088 0.108 0.108 0.116 0.120
user    1.696 1.732 1.736 1.744 1.752
```
Those timings do not include the print statement (I excluded it). Following these numbers I'd say memory mapped file access is quite a bit faster.

Edit 2: Using python -m cProfile test.py I got the following results:
```
5432833    2.273    0.000    2.273    0.000 {method 'readline' of 'file' objects}
5432833    1.451    0.000    1.451    0.000 {method 'readline' of 'mmap.mmap' objects}
```
If I'm not mistaken then mmap is quite a bit faster.

Additionally, it seems not len(line) performs worse than line == '', at least that's how I interpret the profiler output.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一向

2020-12-24 03:20
Python 2.7 32bit on Windows is more than twice as fast on an mmapped file:

On a 27MB, 509k line text file (my 'parse' function is not interesting it mostly just readline()'s very rapidly):
```
with open(someFile,"r") as f:
    if usemmap:
        m=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    else:
        m=f
        e.parse(m)
```
With MMAP:
```
read in 0.308000087738
```
Without MMAP:
```
read in 0.680999994278
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2020-12-24 03:28
The following is reasonably concise:
```
with open(STAT_FILE, "r") as f:
    m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    while True:
        line = m.readline()  
        if line == "": break
        print line
    m.close()
```
Note that line retains the newline, so you might like to remove it. It is also the reason why if line == "" does the right thing (an empty line is returned as "\n").

The reason the original iteration works the way it does is that mmap tries to look like both a file and a string. It looks like a string for the purposes of iteration.

I have no idea why it can't (or chooses not to) provide readlines()/xreadlines().
0 讨论(0)
发布评论:

提交评论
- 加载中...