How to get the biggest number in a file?

前端未结

关注

 4  499

I want to get the maximum number in a file, where numbers are integers that can occur in any place of the file.

I thought about doing the following:


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北海茫月        
                
              
                            
                2020-12-04 01:15
              
            
            
                                                                       
In awk you can say:
awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file

Explanation
In my experience awk is the fastest text processing language for most tasks and the only thing I have seen of comparable speed (on Linux systems) are programs written in C/C++.
In the code above using minimal functions and commands will allow for faster execution.
for(i=1;i<=NF;i++) - Loops through fields on the line. Using the default FS/RS and looping
                     this way is usually faster than using custom ones as awk is optimised 
                     to use the default

if(int($i))        - Checks if the field is not equal to zero and as strings are set to zero 
                     by int, does not execute the next block if the field is a string. I 
                     believe this is the quickest way to perform this check

{a[$i]=$i}         - Sets an array variable with the number as key and value. This means 
                     there will only be as many array variables as there are numbers in 
                     the file and will hopefully be quicker than a comparison of every 
                     number 

END{x=asort(a)     - At the end of the file, use asort on the array and store the s
                     size of the array in x.

print a[x]         - Print the last element in the array.           


Benchmark
Mine:
time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file

took
real    0m0.434s
user    0m0.357s
sys     0m0.008s


hek2mgl's:
awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]' file

took
real    0m1.256s
user    0m1.134s
sys     0m0.019s

For those wondering why it is faster it is due to using the default FS and RS which awk is optimised for using
Changing
awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]'

to
awk '{for(i=1;i<=NF;i++)m=(m<$i && int($i))?$i:m}END{print m}'

provides the time
real    0m0.574s
user    0m0.497s
sys     0m0.011s

Which is still a little slower than my command.
I believe the slight difference that is still present is due to asort() only working on around 6 numbers as they are only saved once in the array.
In comparison, the other command is performing a comparison on every single number in the file which will be more computationally expensive.
I think they would be around the same speed if all the numbers in the file were unique.

Tom Fenech's:
 time awk -v RS="[^-0-9]+" '$0>max{max=$0}END{print max}' myfile

 real    0m0.716s
 user    0m0.612s
 sys     0m0.013s

A drawback of this approach, though, is that if all the numbers are below zero then max will be blank.

Glenn Jackman's:
time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' file

real    0m1.492s
user    0m1.258s
sys     0m0.022s

and
time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file

real    0m0.790s
user    0m0.686s
sys     0m0.034s

The good thing about perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' is that it is the only answer on here that will work if 0 appears in the file as the largest number and also works if all numbers are negative.

Notes
All times are representative of the average of 3 tests
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北海茫月        
                
              
                            
                2020-12-04 01:22
              
            
            
                                                                       
I suspect this will be fastest:

$ tr ' ' '\n' < file | sort -rn | head -1
42342234


Third run:

$ time tr ' ' '\n' < file | sort -rn | head -1
42342234
real    0m0.078s
user    0m0.000s
sys     0m0.076s


btw DON'T WRITE SHELL LOOPS to manipulate text, even if it's creating sample input files:

$ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfile

real    0m0.109s
user    0m0.031s
sys     0m0.061s

$ wc -l myfile
150000 myfile


compared to the shell loop suggested in the question:

$ time for i in {1..50000}; do cat a >> myfile2 ; done

real    26m38.771s
user    1m44.765s
sys     17m9.837s

$ wc -l myfile2
150000 myfile2




If we want something that more robustly handles input files that contain digits in strings that are not integers, we need something like this:

$ cat b
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
73 starts a line
avoid these: 3.14 or 4-5 or $15 or 2:30 or 05/12/2015

$ grep -o -E '(^| )[-]?[0-9]+( |$)' b | sort -rn
 42342234
 3624
 123
73
 -23

$ time awk -v s="$(cat b)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfileB
real    0m0.109s
user    0m0.000s
sys     0m0.076s

$ wc -l myfileB
250000 myfileB

$ time grep -o -E '(^| )-?[0-9]+( |$)' myfileB | sort -rn | head -1 | tr -d ' '
42342234
real    0m2.480s
user    0m2.509s
sys     0m0.108s


Note that the input file has more lines than the original and with this input the above robust grep solution is actually faster than the original I posted at the start of this question:

$ time tr ' ' '\n' < myfileB | sort -rn | head -1
42342234
real    0m4.836s
user    0m4.445s
sys     0m0.277s

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  故里飘歌        
                
              
                            
                2020-12-04 01:25
              
            
            
                                                                       
I'm surprised by awk's speed here. perl is usually pretty speedy, but:

$ for ((i=0; i<1000000; i++)); do echo $RANDOM; done > rand

$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' rand
32767

real    0m0.890s
user    0m0.887s
sys 0m0.003s

$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} @F} END {print $max' rand 
32767

real    0m1.110s
user    0m1.107s
sys 0m0.002s




I think I've found a winner: With perl, slurp the file as a single string, find the (possibly negative) integers, and take the max:

$ time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' rand
32767

real    0m0.565s
user    0m0.539s
sys 0m0.025s


Takes a little more "sys" time, but less real time.

Works with a file with only negative numbers too:

$ cat file
hello -42 world
$ perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file
-42

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  清歌不尽        
                
              
                            
                2020-12-04 01:27
              
            
            
                                                                       
I'm sure a C implementation optimized using assembler will be the fastest. Also I could think of a program which separates the file into multiple chunks and maps every chunk onto a single processor core, and afterwards just get's the maximum of nproc remaning numbers.

Just using the existing command line tools, have you tried awk?

time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile


Looks like it can do the job in ~50% of the time compared to the perl command in the accepted answer:

time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' myfile
cp myfile myfile2

time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile2


Gave me:

42342234

real    0m0.360s
user    0m0.340s
sys 0m0.020s
42342234

real    0m0.193s   <-- Good job awk! You are the winner.
user    0m0.185s
sys 0m0.008s

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复