Is there a way to delete duplicate lines in a file in Unix?
I can do it with sort -u
and uniq
commands, but I want to use sed
uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.
I think that the $!N; needs curly braces or else it continues, and that is the cause of infinite loop.
I have bash 5.0 and sed 4.7 in Ubuntu 20.10. The second one-liner did not work, at the character set match.
Three variations, first to eliminate adjacent repeat lines, second to eliminate repeat lines wherever they occur, third to eliminate all but the last instance of lines in file.
pastebin
# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.
dedupe() {
sed -E '
$!{
N;
s/[ \t]+$//;
/^(.*)\n\1$/!P;
D;
}
';
}
# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one
norepeat() {
sed -n -E '
s/[ \t]+$//;
G;
/^(\n){2,}/d;
/^([^\n]+).*\n\1(\n|$)/d;
h;
P;
';
}
lastrepeat() {
sed -n -E '
s/[ \t]+$//;
/^$/{
H;
d;
};
G;
# delete previous repeated line if found
s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
# after searching for previous repeat, move tested last line to end
s/^([^\n]+)(\n)(.*)/\3\2\1/;
$!{
h;
d;
};
# squeeze blank lines to one
s/(\n){3,}/\n\n/g;
s/^\n//;
p;
';
}
cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'
Deletes the duplicate lines using awk.
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5
the core idea is:
print ONLY once of each duplicate consecutive lines at its LAST appearance and use D command to implement LOOP.
Explains:
$!N;
: if current line is NOT the last line, use N
command to read the next line into pattern space
./^(.*)\n\1$/!P
: if the contents of current pattern space
is two duplicate string
separated by \n
, which means the next line is the same
with current line, we can NOT print it according to our core idea; otherwise, which means current line is the LAST appearance of all of its duplicate consecutive lines, we can now use P
command to print the chars in current pattern space
util \n
(\n
also printed).D
: we use D
command to delete the chars in current pattern space
util \n
(\n
also deleted), then the content of pattern space
is the next line.D
command will force sed
to jump to its FIRST
command $!N
, but NOT read the next line from file or standard input stream.$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5
the core idea is:
print ONLY once of each duplicate consecutive lines at its FIRST appearance and use : command & t command to implement LOOP.
Explains:
:loop
command set a label
named loop
.N
to read next line into the pattern space
.s/^(.*)\n\1$/\1/
to delete current line if the next line is same with current line, we use s
command to do the delete
action.s
command is executed successfully, then use tloop
command force sed
to jump to the label
named loop
, which will do the same loop to the next lines util there are no duplicate consecutive lines of the line which is latest printed
; otherwise, use D
command to delete
the line which is the same with thelatest-printed line
, and force sed
to jump to first command, which is the p
command, the content of current pattern space
is the next new line.