grep from tar.gz without extracting [faster one]

前端未结

关注

 8  670

Am trying to grep pattern from dozen files .tar.gz but its very slow

am using

tar -ztf file.tar.gz | while read FILENAME
do
        if tar -zxf file


                      
              相关标签:


      
      
        
          8条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  一向        
                
              
                            
                2020-12-13 04:25
              
            
            
                                                                       

  Am trying to grep pattern from dozen files .tar.gz but its very slow

tar -ztf file.tar.gz | while read FILENAME
do
        if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
        then
                echo "$FILENAME contains string"
        fi
done



That's actually very easy with ugrep option -z:

-z, --decompress
        Decompress files to search, when compressed.  Archives (.cpio,
        .pax, .tar, and .zip) and compressed archives (e.g. .taz, .tgz,
        .tpz, .tbz, .tbz2, .tb2, .tz2, .tlz, and .txz) are searched and
        matching pathnames of files in archives are output in braces.  If
        -g, -O, -M, or -t is specified, searches files within archives
        whose name matches globs, matches file name extensions, matches
        file signature magic bytes, or matches file types, respectively.
        Supported compression formats: gzip (.gz), compress (.Z), zip,
        bzip2 (requires suffix .bz, .bz2, .bzip2, .tbz, .tbz2, .tb2, .tz2),
        lzma and xz (requires suffix .lzma, .tlz, .xz, .txz).


Which requires just one command to search file.tar.gz as follows:

ugrep -z "string" file.tar.gz


This greps each of the archived files to display matches. Archived filenames are shown in braces to distinguish them from ordinary filenames. For example:

$ ugrep -z "Hello" archive.tgz
{Hello.bat}:echo "Hello World!"
Binary file archive.tgz{Hello.class} matches
{Hello.java}:public class Hello // prints a Hello World! greeting
{Hello.java}:  { System.out.println("Hello World!");
{Hello.pdf}:(Hello)
{Hello.sh}:echo "Hello World!"
{Hello.txt}:Hello


If you just want the file names, use option -l (--files-with-matches) and customize the filename output with option --format="%z%~" to get rid of the braces:

$ ugrep -z Hello -l --format="%z%~" archive.tgz
Hello.bat
Hello.class
Hello.java
Hello.pdf
Hello.sh
Hello.txt

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北海茫月        
                
              
                            
                2020-12-13 04:26
              
            
            
                                                                       
For starters, you could start more than one process:

tar -ztf file.tar.gz | while read FILENAME
do
        (if tar -zxf file.tar.gz "$FILENAME" -O | grep -l "string"
        then
                echo "$FILENAME contains string"
        fi) &
done


The ( ... ) & creates a new detached (read: the parent shell does not wait for the child)
process.

After that, you should optimize the extracting of your archive. The read is no problem, 
as the OS should have cached the file access already. However, tar needs to unpack
the archive every time the loop runs, which can be slow. Unpacking the archive once
and iterating over the result may help here:

local tempPath=`tempfile`
mkdir $tempPath && tar -zxf file.tar.gz -C $tempPath &&
find $tempPath -type f | while read FILENAME
do
        (if grep -l "string" "$FILENAME"
        then
                echo "$FILENAME contains string"
        fi) &
done && rm -r $tempPath


find is used here, to get a list of files in the target directory of tar, which we're iterating over, for each file searching for a string.

Edit: Use grep -l to speed up things, as Jim pointed out. From man grep:

   -l, --files-with-matches
          Suppress normal output; instead print the name of each input file from which output would
          normally have been printed.  The scanning will stop on the first match.  (-l is specified
          by POSIX.)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复