Text processing - Python vs Perl performance [closed]

前端未结

关注

 5  609

独厮守ぢ

相关标签:

5条回答

一整个雨季

2020-12-12 15:57
Function calls are a bit expensive in terms of time in Python. And yet you have a loop invariant function call to get the file name inside the loop:
```
fn = fileinput.filename()
```
Move this line above the for loop and you should see some improvement to your Python timing. Probably not enough to beat out Perl though.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-12 16:03
This is exactly the sort of stuff that Perl was designed to do, so it doesn't surprise me that it's faster.

One easy optimization in your Python code would be to precompile those regexes, so they aren't getting recompiled each time.
```
exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists')
location_re = re.compile(r'^AwbLocation (.*?) insert into')
```
And then in your loop:
```
mprev = exists_re.search(currline)
```
and
```
mcurr = location_re.search(currline)
```
That by itself won't magically bring your Python script in line with your Perl script, but repeatedly calling re in a loop without compiling first is bad practice in Python.
0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2020-12-12 16:13

In general, all artificial benchmarks are evil. However, everything else being equal (algorithmic approach), you can make improvements on a relative basis. However, it should be noted that I don't use Perl, so I can't argue in its favor. That being said, with Python you can try using Pyrex or Cython to improve performance. Or, if you are adventurous, you can try converting the Python code into C++ via ShedSkin (which works for most of the core language, and some - but not all, of the core modules).

Nevertheless, you can follow some of the tips posted here:

http://wiki.python.org/moin/PythonSpeed/PerformanceTips

0 讨论(0)
发布评论:

提交评论
- 加载中...

礼貌的吻别

2020-12-12 16:18

I expect Perl be faster. Just being curious, can you try the following?

#!/usr/bin/python

import re
import glob
import sys
import os

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for mask in sys.argv[1:]:
    for fname in glob.glob(mask):
        if os.path.isfile(fname):
            f = open(fname)
            for line in f:
                mex = exists_re.search(line)
                if mex:
                    xlogtime = mex.group(1)

                mloc = location_re.search(line)
                if mloc:
                    print fname, xlogtime, mloc.group(1)
            f.close()

Update as reaction to "it is too complex".

Of course it looks more complex than the Perl version. The Perl was built around the regular expressions. This way, you can hardly find interpreted language that is faster in regular expressions. The Perl syntax...

while (<>) {
    ...
}

... also hides a lot of things that have to be done somehow in a more general language. On the other hand, it is quite easy to make the Python code more readable if you move the unreadable part out:

#!/usr/bin/python

import re
import glob
import sys
import os

def input_files():
    '''The generator loops through the files defined by masks from cmd.'''
    for mask in sys.argv[1:]:
        for fname in glob.glob(mask):
            if os.path.isfile(fname):
                yield fname


exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for fname in input_files():
    with open(fname) as f:        # Now the f.close() is done automatically
        for line in f:
            mex = exists_re.search(line)
            if mex:
                xlogtime = mex.group(1)

            mloc = location_re.search(line)
            if mloc:
                print fname, xlogtime, mloc.group(1)

Here the def input_files() could be placed elsewhere (say in another module), or it can be reused. It is possible to mimic even the Perl's while (<>) {...} easily, even though not the same way syntactically:

#!/usr/bin/python

import re
import glob
import sys
import os

def input_lines():
    '''The generator loops through the lines of the files defined by masks from cmd.'''
    for mask in sys.argv[1:]:
        for fname in glob.glob(mask):
            if os.path.isfile(fname):
                with open(fname) as f: # now the f.close() is done automatically
                    for line in f:
                        yield fname, line

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for fname, line in input_lines():
    mex = exists_re.search(line)
    if mex:
        xlogtime = mex.group(1)

    mloc = location_re.search(line)
    if mloc:
        print fname, xlogtime, mloc.group(1)

Then the last for may look as easy (in principle) as the Perl's while (<>) {...}. Such readability enhancements are more difficult in Perl.

Anyway, it will not make the Python program faster. Perl will be faster again here. Perl is a file/text cruncher. But--in my opinion--Python is a better programming language for more general purposes.

0 讨论(0)

执笔经年

2020-12-12 16:22
Hypothesis: Perl spends less time backtracking in lines that don't match due to optimisations it has that Python doesn't.

What do you get by replacing
```
^(.*?) INFO.*Such a record already exists
```
with
```
^((?:(?! INFO).)*?) INFO.*Such a record already 
```
or
```
^(?>(.*?) INFO).*Such a record already exists
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题