How to get all fields in outer join with Unix join?

后端未结

关注

 1  393

Suppose that I have two files, en.csv and sp.csv, each containing exactly two comma-separated records:

en.csv:

1,d


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  轮回少年        
                
              
                            
                2021-02-08 03:07
              
            
            
                                                                       
Here is a solution that might or might not work for your data. It approaches the problem by aligning the records within a csv file by line number, i.e. record 2 ends up on line 2, record 3123 on line number 3123 and so on. Missing records/lines are padded with MISSING fields, so the input files would be mangled to look like this:

en.csv:

1,dog,red,car
2,MISSING,MISSING,MISSING
3,cat,white,boat


de.csv:

1,Hund,Rot,Auto
2,Kaninchen,Grau,Zug
3,MISSING,MISSING,MISSING


sp.csv:

1,MISSING,MISSING,MISSING
2,conejo,gris,tren
3,gato,blanco,bote


From there it is easy to cut out the columns of interest and just print them side-by-side using paste.

To achieve this, we sort the input files first and then apply some stupid awk magic:


If a record appears on their expected line number, print it
Otherwise, print as many lines containing the number of expected (this is based on the number of fields of the first line in the file, same as what join -o auto does) MISSING fields until the alignment is correct again
Not all input files are going to the same number of records, so the maximum is searched for before all of this. Then, more lines with MISSING fields are printed until the maximum is hit.


Code

reccut.sh:

#!/bin/bash

get_max_recnum()
{
    awk -F, '{ if ($1 > max) { max = $1 } } END { print max }' "$@"
}

align_by_recnum()
{
    sort -t, -k1 "$1" \
        | awk -F, -v MAXREC="$2" '
            NR==1 { for(x = 1; x < NF; x++) missing = missing ",MISSING" }
            {
                i = NR
                if (NR < $1)
                {
                    while (i < $1)
                    {
                        print i++ missing
                    }
                    NR+=i
                }
            }1
            END { for(i++; i <= MAXREC; i++) { print i missing } }
            '
}

_reccut()
{
    local infiles=()
    local args=( $@ )
    for arg; do
        infiles+=( "$2" )
        shift 2
    done
    MAXREC="$(get_max_recnum "${infiles[@]}")" __reccut "${args[@]}"
}

__reccut()
{
    local cols="$1"
    local infile="$2"
    shift 2

    if (( $# > 0 )); then
        paste -d, \
            <(align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}) \
            <(__reccut "$@")
    else
        align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}
    fi
}

_reccut "$@"


Run

$ ./reccut.sh 3 en.csv 2,4 sp.csv 3 de.csv
red,MISSING,MISSING,Rot
MISSING,conejo,tren,Grau
white,gato,bote,MISSING

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复