How to get the biggest number in a file?

前端 未结 4 499
盖世英雄少女心
盖世英雄少女心 2020-12-04 00:41

I want to get the maximum number in a file, where numbers are integers that can occur in any place of the file.

I thought about doing the following:

         


        
相关标签:
4条回答
  • 2020-12-04 01:15

    In awk you can say:

    awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file
    

    Explanation

    In my experience awk is the fastest text processing language for most tasks and the only thing I have seen of comparable speed (on Linux systems) are programs written in C/C++.

    In the code above using minimal functions and commands will allow for faster execution.

    for(i=1;i<=NF;i++) - Loops through fields on the line. Using the default FS/RS and looping
                         this way is usually faster than using custom ones as awk is optimised 
                         to use the default
    
    if(int($i))        - Checks if the field is not equal to zero and as strings are set to zero 
                         by int, does not execute the next block if the field is a string. I 
                         believe this is the quickest way to perform this check
    
    {a[$i]=$i}         - Sets an array variable with the number as key and value. This means 
                         there will only be as many array variables as there are numbers in 
                         the file and will hopefully be quicker than a comparison of every 
                         number 
    
    END{x=asort(a)     - At the end of the file, use asort on the array and store the s
                         size of the array in x.
    
    print a[x]         - Print the last element in the array.           
    

    Benchmark

    Mine:

    time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file
    

    took

    real    0m0.434s
    user    0m0.357s
    sys     0m0.008s
    

    hek2mgl's:

    awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]' file
    

    took

    real    0m1.256s
    user    0m1.134s
    sys     0m0.019s
    

    For those wondering why it is faster it is due to using the default FS and RS which awk is optimised for using

    Changing

    awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]'
    

    to

    awk '{for(i=1;i<=NF;i++)m=(m<$i && int($i))?$i:m}END{print m}'
    

    provides the time

    real    0m0.574s
    user    0m0.497s
    sys     0m0.011s
    

    Which is still a little slower than my command.

    I believe the slight difference that is still present is due to asort() only working on around 6 numbers as they are only saved once in the array.

    In comparison, the other command is performing a comparison on every single number in the file which will be more computationally expensive.

    I think they would be around the same speed if all the numbers in the file were unique.


    Tom Fenech's:

     time awk -v RS="[^-0-9]+" '$0>max{max=$0}END{print max}' myfile
    
     real    0m0.716s
     user    0m0.612s
     sys     0m0.013s
    

    A drawback of this approach, though, is that if all the numbers are below zero then max will be blank.


    Glenn Jackman's:

    time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' file
    
    real    0m1.492s
    user    0m1.258s
    sys     0m0.022s
    

    and

    time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file
    
    real    0m0.790s
    user    0m0.686s
    sys     0m0.034s
    

    The good thing about perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' is that it is the only answer on here that will work if 0 appears in the file as the largest number and also works if all numbers are negative.


    Notes

    All times are representative of the average of 3 tests

    0 讨论(0)
  • 2020-12-04 01:22

    I suspect this will be fastest:

    $ tr ' ' '\n' < file | sort -rn | head -1
    42342234
    

    Third run:

    $ time tr ' ' '\n' < file | sort -rn | head -1
    42342234
    real    0m0.078s
    user    0m0.000s
    sys     0m0.076s
    

    btw DON'T WRITE SHELL LOOPS to manipulate text, even if it's creating sample input files:

    $ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfile
    
    real    0m0.109s
    user    0m0.031s
    sys     0m0.061s
    
    $ wc -l myfile
    150000 myfile
    

    compared to the shell loop suggested in the question:

    $ time for i in {1..50000}; do cat a >> myfile2 ; done
    
    real    26m38.771s
    user    1m44.765s
    sys     17m9.837s
    
    $ wc -l myfile2
    150000 myfile2
    

    If we want something that more robustly handles input files that contain digits in strings that are not integers, we need something like this:

    $ cat b
    hello 123 how are you i am fine 42342234 and blab bla bla
    and 3624 is another number
    but this is not enough for -23 234245
    73 starts a line
    avoid these: 3.14 or 4-5 or $15 or 2:30 or 05/12/2015
    
    $ grep -o -E '(^| )[-]?[0-9]+( |$)' b | sort -rn
     42342234
     3624
     123
    73
     -23
    
    $ time awk -v s="$(cat b)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfileB
    real    0m0.109s
    user    0m0.000s
    sys     0m0.076s
    
    $ wc -l myfileB
    250000 myfileB
    
    $ time grep -o -E '(^| )-?[0-9]+( |$)' myfileB | sort -rn | head -1 | tr -d ' '
    42342234
    real    0m2.480s
    user    0m2.509s
    sys     0m0.108s
    

    Note that the input file has more lines than the original and with this input the above robust grep solution is actually faster than the original I posted at the start of this question:

    $ time tr ' ' '\n' < myfileB | sort -rn | head -1
    42342234
    real    0m4.836s
    user    0m4.445s
    sys     0m0.277s
    
    0 讨论(0)
  • 2020-12-04 01:25

    I'm surprised by awk's speed here. perl is usually pretty speedy, but:

    $ for ((i=0; i<1000000; i++)); do echo $RANDOM; done > rand
    
    $ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' rand
    32767
    
    real    0m0.890s
    user    0m0.887s
    sys 0m0.003s
    
    $ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} @F} END {print $max' rand 
    32767
    
    real    0m1.110s
    user    0m1.107s
    sys 0m0.002s
    

    I think I've found a winner: With perl, slurp the file as a single string, find the (possibly negative) integers, and take the max:

    $ time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' rand
    32767
    
    real    0m0.565s
    user    0m0.539s
    sys 0m0.025s
    

    Takes a little more "sys" time, but less real time.

    Works with a file with only negative numbers too:

    $ cat file
    hello -42 world
    $ perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file
    -42
    
    0 讨论(0)
  • 2020-12-04 01:27

    I'm sure a C implementation optimized using assembler will be the fastest. Also I could think of a program which separates the file into multiple chunks and maps every chunk onto a single processor core, and afterwards just get's the maximum of nproc remaning numbers.

    Just using the existing command line tools, have you tried awk?

    time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile
    

    Looks like it can do the job in ~50% of the time compared to the perl command in the accepted answer:

    time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' myfile
    cp myfile myfile2
    
    time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile2
    

    Gave me:

    42342234
    
    real    0m0.360s
    user    0m0.340s
    sys 0m0.020s
    42342234
    
    real    0m0.193s   <-- Good job awk! You are the winner.
    user    0m0.185s
    sys 0m0.008s
    
    0 讨论(0)
提交回复
热议问题