How to get the output from the comm command into 3 separate files?

牧云@^-^@ 提交于 2019-12-06 07:32:33

The basic solution using sed relies on the fact that comm outputs lines found only in the first file with no prefix; it outputs the lines found only in the second file with a single tab; and it outputs the lines found in both files with two tabs.

It also relies on sed's w command to write to files.

Given file 1.sorted.txt containing:

1.line-1
1.line-2
1.line-4
1.line-6
2.line-2
3.line-5

and file 2.sorted.txt containing:

1.line-3
2.line-1
2.line-2
2.line-4
2.line-6
3.line-5

the basic output from comm 1.sorted.txt 2.sorted.txt is:

1.line-1
1.line-2
        1.line-3
1.line-4
1.line-6
        2.line-1
                2.line-2
        2.line-4
        2.line-6
                3.line-5

Given a file script.sed containing:

/^\t\t/ {
    s///
    w file.3
    d
}
/^\t/ {
    s///
    w file.2
    d
}
/^[^\t]/ {
    w file.1
    d
}

you can run the command shown below and get the desired output like this:

$ comm 1.sorted.txt 2.sorted.txt | sed -f script.sed
$ cat file.1
1.line-1
1.line-2
1.line-4
1.line-6
$ cat file.2
1.line-3
2.line-1
2.line-4
2.line-6
$ cat file.3
2.line-2
3.line-5
$

The script works by:

  1. matching lines that start with 2 tabs, deleting the tabs, writing the line to file.3, and deleting the line (so the rest of the script is ignored),
  2. matching lines that start with 1 tab, deleting the tab, writing the line to file.2, and deleting the line (so the rest of the script is ignored),
  3. matching lines that do not start with a tab, writing the line to file.1, and deleting the line.

The match and delete operations in step 3 are more for symmetry than anything else; they could be omitted (leaving just w file.1) and this script would work the same. However, see script3.sed below for further justification for keeping the symmetry.

As written, that requires GNU sed; BSD sed doesn't recognize the \t escapes. Obviously, the file could be written with actual tabs in place of the \t notation, and then BSD sed is OK with the script.

It is possible to make it work all on the command line, but it is fiddly (and that's being polite about it). Using Bash's ANSI C Quoting, you can write:

$ comm 1.sorted.txt 2.sorted.txt |
> sed -e $'/^\t\t/  { s///\n w file.3\n d\n }' \
>     -e $'/^\t/    { s///\n w file.2\n d\n }' \
>     -e $'/^[^\t]/ {        w file.1\n d\n }'
$

which writes each of the three 'paragraphs' of script.sed in a separate -e option. The w command is fussy; it expects the file name, and only the file name, after it on the same line of the script, hence the use of \n after the file names in the script. There are spaces aplenty that could be eliminated, but the symmetry is clearer with the layout shown. And using the -f script.sed file is probably simpler — it is certainly a technique worth knowing as it can avoid problems when the sed script must operate on single, double and back-quotes, which makes it difficult to write the script on the Bash command line.

Finally, if the two files can contain lines starting with tabs, this technique requires some more brute force to make it work. One variant solution exploits Bash's process substitution to add a prefix before the lines in the files, and then the post-processing sed script removes the prefixes before writing to the output files.

script3.sed (with tabs replaced by up to 8 spaces) — note that this time there is a substitute s/// needed in the third paragraph (the d is still optional, but may as well be included):

/^              X/ {
    s///
    w file.3
    d
}
/^      X/ {
    s///
    w file.2
    d
}
/^X/ {
    s///
    w file.1
    d
}

And the command line:

$ comm <(sed 's/^/X/' 1.sorted.txt) <(sed 's/^/X/' 2.sorted.txt) |
> sed -f script3.sed
$

For the same input files, this produces the same output, but by adding and then removing the X at the start of each line, the code doesn't change the sort order of the data and would handle leading tabs if they were present.

You can also easily write solutions that use Perl or Awk, and those do not even have to use comm (and can be made to work with unsorted files, provided the files fit into memory).

comm + awk solution:

Complicated sample files:

1.txt:

1. line-1 with spaces (                 |   | here
1.line-2
1.line-4    with tabs > 
 1.line-6
2.line-2
        3.line-5 (tabs)

2.txt:

1.line-3
  2.line-1 with spaces
2.line-2
2.line-4
    2.line-6 with tabs
        3.line-5 (tabs)

The job:

comm -12 1.txt 2.txt > file-common 
awk 'NR==FNR{ a[$0];next }!($0 in a){ print $0 > "file"ARGIND-1 }' file-common 1.txt 2.txt
  • comm -12 1.txt 2.txt > file-common - will save common lines to file-common file

  • awk ... - will print lines unique to 1.txt and 2.txt into files file1 and file2 respectively


Viewing results:

head file*
==> file1 <==
1. line-1 with spaces (                 |   | here
1.line-2
1.line-4    with tabs > 
 1.line-6

==> file2 <==
1.line-3
  2.line-1 with spaces
2.line-4
    2.line-6 with tabs

==> file-common <==
2.line-2
        3.line-5 (tabs)
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!