I have a bash script that cuts out a section of a logfile between 2 timestamps, but because of the size of the files, it takes quite a while to run.
If I were to rew
bash
actually reads the file a line at a time as it interprets it on the fly (which you'll be made painfully aware of if you ever modify a bash
script while it's still running), rather than preloading and parsing it all at once. So yeah, Perl will generally be a lot faster if you're doing anything that you wouldn't normally do in bash
anyways.
Based on the shell code you have, with multiple calls to tail/head, I'd say absolutely Perl could be faster. C could be even faster, but the development time probably won't be worth it, so I'd stick to Perl. (I say "could" because you can write shell scripts in Perl, and I've seen enough of those to cringe. That obviously wouldn't have the speed benefit that you want.)
Perl has a higher startup cost, or so it's claimed. Honestly, I've never noticed. If your alternative is to do it in Java, Perl has no startup cost. Compared to Bash, I simply haven't noticed. What I have noticed is that as I get away from calling all the specialised Unix tools, which are great when you don't have alternatives, and get toward doing it all in a single process, speed goes up. The overhead of creating new processes on Unix isn't as severe as it may have been on Windows, but it's still not entirely negligible as you have to reinitialise the C runtime library (libC) each time, parse arguments, open files (perhaps), etc. In Perl, you end up using vast swaths of memory as you pass everything around in a list or something, but it is all in memory, so it's faster. And many of the tools you're used to are either built in (map/grep
, regexes) or are available in modules on CPAN. A good combination of these would get the job done easily.
The big thing is to avoid re-reading files. It's costly. And you're doing it many times. Heck, you could use the :gzip
modifier on open to read your gzip files directly, saving yet another pass - and this would be faster in that you'd be reading less from disk.
Well, bash is intepreted line by line as it runs and depends on calling a lot of external progs (depending on what you want to do).You often have to use temp files as intermediate storage for result sets. It (shell) was originally designed to talk to the system and automate cmd sequences (shell files).
Perl is more like C, it's largely self contained with a huge library of free code and it's compiled , so it runs much faster, eg about 80-90% speed of C, but easier to program (eg variable sizes are dynamic).
In your bash script, put this:
perl -ne "print if /$FROM/../$TO/" $LOGFILES
$FROM and $TO are really regex to your start and end time.
They are inclusive, so you might want to put 2009-06-14 23:59:59
for your end time, since 2009-06-15 00:00:00
will include transactions at midnight.
I would profile all three solutions and pick which is best in terms of initial startup speed, processing speed, and memory usage.
Something like Perl/Python/Ruby may not be the absolute fastest, but you can rapidly develop in those languages - much faster than in C and even Bash.
You will almost certainly realize a massive speed benefit from writing your script in Perl just by cutting off the file read when you pass your second timestamp.
More generally, yes; a bash
script of any complexity, unless it's a truly amazing piece of wizardry, can handily be outperformed by a Perl script for equivalent inputs and outputs.