Compare two text files and print the difference against key in bash shell script

前端 未结 2 1388
终归单人心
终归单人心 2021-01-29 14:27

Shell script, bash, have 2 large files around 1.2 GB data, with key and values, I need to compare both files based on the key and store difference in the value in the third file

2条回答
  •  佛祖请我去吃肉
    2021-01-29 15:01

    If your input files are too large to fit in memory then you could create a set of tag-value pairs from each tag-values line in each input file, e.g.:

    $ awk 'BEGIN{FS=OFS=";"} {tag=$0; sub(/ [^; ]+(;.*|$)/,"",tag); sub(/[^;]+ /,""); for (i=1;i<=NF;i++) print tag, $i}' file2
    test1;polo
    test1;angus
    test2;mike
    test4;bob
    test4;janet
    1332240_44557576_CONTI Mazed & Micro kjd $353.50_30062020_lsdf3_some-rule;232324L
    1332240_44557576_CONTI Mazed & Micro kjd $353.50_30062020_lsdf3_some-rule;343223432H
    

    and then use standard UNIX tools like sort and comm to get the differences you want and then recombine with awk into the original tag-values. Here's how the whole thing could work:

    $ cat tst.sh
    #!/usr/bin/env bash
    
    separate() {
        awk '
            BEGIN { FS=OFS=";" }
            {
                tag = $0
                sub(/ [^; ]+(;.*|$)/,"",tag)
                sub(/[^;]+ /,"")
                for (i=1; i<=NF; i++) {
                    print tag, $i
                }
            }
        ' "${@:--}" | sort
    }
    
    combine() {
        awk '
            BEGIN { FS=OFS=";" }
            $1 != prev {
                printf "%s%s", ors, $1
                prev = $1
                ors = ORS
                ofs = " "
            }
            {
                printf "%s%s", ofs, $2
                ofs = OFS
            }
            END {
                printf "%s", ors
            }
        ' "${@:--}"
    }
    
    comm -23 <(separate "$1") <(separate "$2") | combine
    

    .

    $ ./tst.sh file1 file2
    1332239_44557576_CONTI Lased & Micro kjd $353.50_30062020_lsdf3_no-rule 343323H;343343432H;343434311H;454656556H
    1332240_44557576_CONTI Mazed & Micro kjd $353.50_30062020_lsdf3_some-rule 2226556H
    test1 marco
    test2 liza;zen
    test3 alan;harry;tom
    test4 june
    

    and if you in future want to find the tag-value pairs in file2 but not file1 or the pairs in both then you'd just change comm -23 to comm -13 or comm -12.

提交回复
热议问题