How do I remove duplicate characters and keep the unique one only. For example, my input is:
EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU
Expected out
for a file containing the data you list named foo.txt
python -c "print set(open('foo.txt').read())"
use strict;
use warnings;
my ($uniq, $seq, @result);
$uniq ='';
sub uniq {
$seq = shift;
for (split'',$seq) {
$uniq .=$_ unless $uniq =~ /$_/;
}
push @result,$uniq;
$uniq='';
}
while(<DATA>){
uniq($_);
}
print @result;
__DATA__
EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU
The output:
EFUAH
UEH
UJHACDEF
Here is a solution, that I think should work faster than the lookahead one, but is not regexp-based and uses hashtable.
perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}'
It splits every line into characters and prints only the first appearance by counting appearances inside %seen hashtable
This can be done using positive lookahead :
perl -pe 's/(.)(?=.*?\1)//g' FILE_NAME
The regex used is: (.)(?=.*?\1)
.
: to match any char.()
: remember the matched
single char.(?=...)
: +ve lookahead.*?
: to match anything in between\1
: the remembered match.(.)(?=.*?\1)
: match and remember
any char only if it appears again
later in the string.s///
: Perl way of doing the
substitution.g
: to do the substitution
globally...that is don't stop after
first substitution.s/(.)(?=.*?\1)//g
: this will
delete a char from the input string
only if that char appears again later
in the string.This will not maintain the order of the char in the input because for every unique char in the input string, we retain its last occurrence and not the first.
To keep the relative order intact we can do what KennyTM
tells in one of the comments:
The Perl one line for this is:
perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' FILE_NAME
Since we are doing print
manually after reversal, we don't use the -p
flag but use the -n
flag.
I'm not sure if this is the best one-liner to do this. I welcome others to edit this answer if they have a better alternative.
From the shell, this works:
sed -e 's/$/<EOL>/ ; s/./&\n/g' test.txt | uniq | sed -e :a -e '$!N; s/\n//; ta ; s/<EOL>/\n/g'
In words: mark every linebreak with a <EOL>
string, then put every character on a line of its own, then use uniq
to remove duplicate lines, then strip out all the linebreaks, then put back linebreaks instead of the <EOL>
markers.
I found the -e :a -e '$!N; s/\n//; ta
part in a forum post and I don't understand the seperate -e :a
part, or the $!N
part, so if anyone can explain those, I'd be grateful.
Hmm, that one does only consecutive duplicates; to eliminate all duplicates you could do this:
cat test.txt | while read line ; do echo $line | sed -e 's/./&\n/g' | sort | uniq | sed -e :a -e '$!N; s/\n//; ta' ; done
That puts the characters in each line in alphabetical order though.