I want to get the maximum number in a file, where numbers are integers that can occur in any place of the file.
I thought about doing the following:
In awk you can say:
awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file
In my experience awk is the fastest text processing language for most tasks and the only thing I have seen of comparable speed (on Linux systems) are programs written in C/C++.
In the code above using minimal functions and commands will allow for faster execution.
for(i=1;i<=NF;i++) - Loops through fields on the line. Using the default FS/RS and looping
this way is usually faster than using custom ones as awk is optimised
to use the default
if(int($i)) - Checks if the field is not equal to zero and as strings are set to zero
by int, does not execute the next block if the field is a string. I
believe this is the quickest way to perform this check
{a[$i]=$i} - Sets an array variable with the number as key and value. This means
there will only be as many array variables as there are numbers in
the file and will hopefully be quicker than a comparison of every
number
END{x=asort(a) - At the end of the file, use asort on the array and store the s
size of the array in x.
print a[x] - Print the last element in the array.
Mine:
time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file
took
real 0m0.434s
user 0m0.357s
sys 0m0.008s
hek2mgl's:
awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]' file
took
real 0m1.256s
user 0m1.134s
sys 0m0.019s
For those wondering why it is faster it is due to using the default FS and RS which awk is optimised for using
Changing
awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]'
to
awk '{for(i=1;i<=NF;i++)m=(m<$i && int($i))?$i:m}END{print m}'
provides the time
real 0m0.574s
user 0m0.497s
sys 0m0.011s
Which is still a little slower than my command.
I believe the slight difference that is still present is due to asort()
only working on around 6 numbers as they are only saved once in the array.
In comparison, the other command is performing a comparison on every single number in the file which will be more computationally expensive.
I think they would be around the same speed if all the numbers in the file were unique.
Tom Fenech's:
time awk -v RS="[^-0-9]+" '$0>max{max=$0}END{print max}' myfile
real 0m0.716s
user 0m0.612s
sys 0m0.013s
A drawback of this approach, though, is that if all the numbers are below zero then max will be blank.
Glenn Jackman's:
time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' file
real 0m1.492s
user 0m1.258s
sys 0m0.022s
and
time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file
real 0m0.790s
user 0m0.686s
sys 0m0.034s
The good thing about perl -MList::Util=max -0777 -nE 'say max /-?\d+/g'
is that it is the only answer on here that will work if 0 appears in the file as the largest number and also works if all numbers are negative.
All times are representative of the average of 3 tests
I suspect this will be fastest:
$ tr ' ' '\n' < file | sort -rn | head -1
42342234
Third run:
$ time tr ' ' '\n' < file | sort -rn | head -1
42342234
real 0m0.078s
user 0m0.000s
sys 0m0.076s
btw DON'T WRITE SHELL LOOPS to manipulate text, even if it's creating sample input files:
$ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfile
real 0m0.109s
user 0m0.031s
sys 0m0.061s
$ wc -l myfile
150000 myfile
compared to the shell loop suggested in the question:
$ time for i in {1..50000}; do cat a >> myfile2 ; done
real 26m38.771s
user 1m44.765s
sys 17m9.837s
$ wc -l myfile2
150000 myfile2
If we want something that more robustly handles input files that contain digits in strings that are not integers, we need something like this:
$ cat b
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
73 starts a line
avoid these: 3.14 or 4-5 or $15 or 2:30 or 05/12/2015
$ grep -o -E '(^| )[-]?[0-9]+( |$)' b | sort -rn
42342234
3624
123
73
-23
$ time awk -v s="$(cat b)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfileB
real 0m0.109s
user 0m0.000s
sys 0m0.076s
$ wc -l myfileB
250000 myfileB
$ time grep -o -E '(^| )-?[0-9]+( |$)' myfileB | sort -rn | head -1 | tr -d ' '
42342234
real 0m2.480s
user 0m2.509s
sys 0m0.108s
Note that the input file has more lines than the original and with this input the above robust grep solution is actually faster than the original I posted at the start of this question:
$ time tr ' ' '\n' < myfileB | sort -rn | head -1
42342234
real 0m4.836s
user 0m4.445s
sys 0m0.277s
I'm surprised by awk's speed here. perl is usually pretty speedy, but:
$ for ((i=0; i<1000000; i++)); do echo $RANDOM; done > rand
$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' rand
32767
real 0m0.890s
user 0m0.887s
sys 0m0.003s
$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} @F} END {print $max' rand
32767
real 0m1.110s
user 0m1.107s
sys 0m0.002s
I think I've found a winner: With perl, slurp the file as a single string, find the (possibly negative) integers, and take the max:
$ time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' rand
32767
real 0m0.565s
user 0m0.539s
sys 0m0.025s
Takes a little more "sys" time, but less real time.
Works with a file with only negative numbers too:
$ cat file
hello -42 world
$ perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file
-42
I'm sure a C implementation optimized using assembler will be the fastest. Also I could think of a program which separates the file into multiple chunks and maps every chunk onto a single processor core, and afterwards just get's the maximum of nproc remaning numbers.
Just using the existing command line tools, have you tried awk
?
time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile
Looks like it can do the job in ~50% of the time compared to the perl command in the accepted answer:
time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' myfile
cp myfile myfile2
time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile2
Gave me:
42342234
real 0m0.360s
user 0m0.340s
sys 0m0.020s
42342234
real 0m0.193s <-- Good job awk! You are the winner.
user 0m0.185s
sys 0m0.008s