Compare two text files and print the difference against key in bash shell script

问题

Shell script, bash, have 2 large files around 1.2 GB data, with key and values, I need to compare both files based on the key and store difference in the value in the third file, File 2 will always be a subset of File 1, just need to find values(against key) which are not present in file 2 and unique ones in File 1.

File 1:

test1 marco;polo;angus
test2 mike;zen;liza
test3 tom;harry;alan
test4 bob;june;janet
1332239_44557576_CONTI Lased & Micro kjd $353.50_30062020_lsdf3_no-rule 343323H;343434311H;454656556H;343343432H
1332240_44557576_CONTI Mazed & Micro kjd $353.50_30062020_lsdf3_some-rule 232324L;2226556H;343223432H

File 2:

test1 polo;angus
test2 mike
test4 bob;janet
1332240_44557576_CONTI Mazed & Micro kjd $353.50_30062020_lsdf3_some-rule 232324L;343223432H

I would like to compare the first two columns of file1 with file2 (search through the entire contents of file2 in first two columns) if they match print the difference of values. Then search for the second line of file 1 and so on. Also keys unique in file 1 should be printed.

Expected Output:

test1 marco
test2 zen;liza
test3 tom;harry;alan
test4 june
1332239_44557576_CONTI Lased & Micro kjd $353.50_30062020_lsdf3_no-rule 343323H;343434311H;454656556H;343343432H
1332240_44557576_CONTI Mazed & Micro kjd $353.50_30062020_lsdf3_some-rule 2226556H

The files I have are huge, containing about 100,000 lines, so I would like to make the execution fast. This is running in shell script, using bash shell scripting. These files file1 and file2 are text file, with this as key 1332239_44557576_CONTI Lased & Micro kjd $353.50_30062020_lsdf3_no-rule and these as values : 343323H;343434311H;454656556H;343343432H

Thanks in advance!!

回答1:

Could you please try following, written and tested with shown samples(also considering that your Input_file(s) are not starting from spaces).

awk '
BEGIN{
  OFS=";"
}
{
  match($0,/ .*/)
  line=substr($0,RSTART,RLENGTH)
  sub(/^ +/,"",line)
}
FNR==NR{
  num=split(line,array,";")
  for(i=1;i<=num;i++){
    arrayfromFile2[$1]=(arrayfromFile2[$1]?arrayfromFile2[$1] OFS:"")array[i]
  }
  delete array
  next
}
($1 in arrayfromFile2){
  num=split(arrayfromFile2[$1],temparrayChkFile2,";")
  for(i=1;i<=num;i++){
    arrayChkFile2[temparrayChkFile2[i]]
  }
  num=split(line,array,";")
  for(i=1;i<=num;i++){
    if(!(array[i] in arrayChkFile2)){
       val=(val?val OFS:"")array[i]
    }
  }
  print $1" "val
  val=""
  next
}
1
'  Input_file2  Input_file1

回答2:

If your input files are too large to fit in memory then you could create a set of tag-value pairs from each tag-values line in each input file, e.g.:

$ awk 'BEGIN{FS=OFS=";"} {tag=$0; sub(/ [^; ]+(;.*|$)/,"",tag); sub(/[^;]+ /,""); for (i=1;i<=NF;i++) print tag, $i}' file2
test1;polo
test1;angus
test2;mike
test4;bob
test4;janet
1332240_44557576_CONTI Mazed & Micro kjd $353.50_30062020_lsdf3_some-rule;232324L
1332240_44557576_CONTI Mazed & Micro kjd $353.50_30062020_lsdf3_some-rule;343223432H

and then use standard UNIX tools like sort and comm to get the differences you want and then recombine with awk into the original tag-values. Here's how the whole thing could work:

$ cat tst.sh
#!/usr/bin/env bash

separate() {
    awk '
        BEGIN { FS=OFS=";" }
        {
            tag = $0
            sub(/ [^; ]+(;.*|$)/,"",tag)
            sub(/[^;]+ /,"")
            for (i=1; i<=NF; i++) {
                print tag, $i
            }
        }
    ' "${@:--}" | sort
}

combine() {
    awk '
        BEGIN { FS=OFS=";" }
        $1 != prev {
            printf "%s%s", ors, $1
            prev = $1
            ors = ORS
            ofs = " "
        }
        {
            printf "%s%s", ofs, $2
            ofs = OFS
        }
        END {
            printf "%s", ors
        }
    ' "${@:--}"
}

comm -23 <(separate "$1") <(separate "$2") | combine

$ ./tst.sh file1 file2
1332239_44557576_CONTI Lased & Micro kjd $353.50_30062020_lsdf3_no-rule 343323H;343343432H;343434311H;454656556H
1332240_44557576_CONTI Mazed & Micro kjd $353.50_30062020_lsdf3_some-rule 2226556H
test1 marco
test2 liza;zen
test3 alan;harry;tom
test4 june

and if you in future want to find the tag-value pairs in file2 but not file1 or the pairs in both then you'd just change comm -23 to comm -13 or comm -12.

来源：https://stackoverflow.com/questions/61298408/compare-two-text-files-and-print-the-difference-against-key-in-bash-shell-script

标签

bash

shell

awk