Print differences between not sorted strings from files

醉酒当歌 提交于 2019-12-11 06:50:10

问题


I have two files that contain n lines with a string in each line. I want to print out the difference in characters between those lists. You could imagine the operation as a sort of "Subtraction" of letters. This is how it should look like:

List1       List2      Result
AaBbCcDd    AaCcDd     Bb
AaBbCcE     AaBbCc     E
AaBbCcF     AaCcF      Bb

Which means that the second list is not sorted alphabetically, but all the substrings to remove are sorted within each string (Aa comes before Bb comes before Cc). Note that the elements to remove can be either 1 or 2 characters long (Aa or F), always starting with uppercase letters followed (sometimes) by a lowercased letter. The strings are completely composed of permutations of a few "elements" like Aa, Bb, Cc, Dd, E, F, Gg, ... and so on.

This question has been answered in very similar form here: Bash script Find difference between two strings, but only for two strings entered manually, whereas I need to do the operation many hundreds of times. I am struggling with implementing files as a source to this command while also separating the characters correctly. Here is my adaptation:

split_chars() { sed $'s/./&\\\n/g' <<< "$1"; }
comm -23 <(split_chars AaBbCcDd) <(split_chars AaCcDd)

which gives as output

B
b

so still not quite what I want even in this single case. I guess that the split_chars command is the key here but I was not able to apply it to my files in any way. Putting the file names inside the brackets does not work obviously. For reference, a simple

commm -23 List1 List2

just leads to

AaBbCcDd
AaBbCcEe
AaBbCcF
comm: file 2 is not in sorted order

回答1:


Since you don't want to split characters but substrings starting with an uppercase letter you should replace split_chars with the following function.

split() { sed 's/[A-Z]/\n&/g' <<< "$1"; }

Splitting a line can be undone by deleting all newline characters using tr -d \\n.

To subtract a list of lines from another list of lines you can use grep without having to sort.

grep -vFxf subtrahend minuend

This will print in original order those lines from file minuend which are not in file subtrahend.

To put everything together, you have to

  • read both files line by line in parallel
  • split each string into a list of lines
  • subtract those lists
  • undo the splitting

Here is a simplified version assuming your input files contain only lines of the described format and have the same length.

split() { sed 's/[A-Z]/\n&/g' <<< "$1"; }
subtract() { grep -vFxf "$2" "$1"; }
union() { tr -d \\n; echo; }
paste List1 List2 | while read -r minuend subtrahend; do
    subtract <(split "$minuend") <(split "$subtrahend") | union
done

Bash scripts with loops are slow. If you need a faster solution you should rewrite this script in a more advanced language like perl or python.




回答2:


Another in GNU awk:

$ gawk 'NR==FNR {
    a[FNR]=$0
    next
}
{
    patsplit($0 a[FNR],b,/[A-Z][a-z]?/)
    printf "%s%s%s", a[FNR],OFS,$0
    for(i in b)
        if(!(match($0,b[i])&&match(a[FNR],b[i])))
            printf "%s%s", OFS, b[i]
    print ""
}' file1 file2

Output:

List1 List2
AaBbCcDd AaCcDd Bb
AaBbCcE AaBbCc E
AaBbCcF AaCcF Bb


来源:https://stackoverflow.com/questions/55726452/print-differences-between-not-sorted-strings-from-files

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!