How do I remove duplicate characters and keep the unique one only in Perl?

前端 未结 11 711
隐瞒了意图╮
隐瞒了意图╮ 2020-12-05 16:08

How do I remove duplicate characters and keep the unique one only. For example, my input is:

EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU

Expected out

相关标签:
11条回答
  • 2020-12-05 16:34

    for a file containing the data you list named foo.txt

    python -c "print set(open('foo.txt').read())"
    
    0 讨论(0)
  • 2020-12-05 16:36
    use strict;
    use warnings;
    
    my ($uniq, $seq, @result);
    $uniq ='';
    sub uniq {
        $seq = shift;
        for (split'',$seq) {
        $uniq .=$_ unless $uniq =~ /$_/;
        }
        push @result,$uniq;
        $uniq='';
    }
    
    while(<DATA>){
       uniq($_);
    }
    print @result;
    
    __DATA__
    EFUAHUU
    UUUEUUUUH
    UJUJHHACDEFUCU
    

    The output:

    EFUAH
    UEH
    UJHACDEF
    
    0 讨论(0)
  • 2020-12-05 16:38

    Here is a solution, that I think should work faster than the lookahead one, but is not regexp-based and uses hashtable.

    perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}' 
    

    It splits every line into characters and prints only the first appearance by counting appearances inside %seen hashtable

    0 讨论(0)
  • 2020-12-05 16:48

    This can be done using positive lookahead :

    perl -pe 's/(.)(?=.*?\1)//g' FILE_NAME
    

    The regex used is: (.)(?=.*?\1)

    • . : to match any char.
    • first () : remember the matched single char.
    • (?=...) : +ve lookahead
    • .*? : to match anything in between
    • \1 : the remembered match.
    • (.)(?=.*?\1) : match and remember any char only if it appears again later in the string.
    • s/// : Perl way of doing the substitution.
    • g: to do the substitution globally...that is don't stop after first substitution.
    • s/(.)(?=.*?\1)//g : this will delete a char from the input string only if that char appears again later in the string.

    This will not maintain the order of the char in the input because for every unique char in the input string, we retain its last occurrence and not the first.

    To keep the relative order intact we can do what KennyTM tells in one of the comments:

    • reverse the input line
    • do the substitution as before
    • reverse the result before printing

    The Perl one line for this is:

    perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' FILE_NAME
    

    Since we are doing print manually after reversal, we don't use the -p flag but use the -n flag.

    I'm not sure if this is the best one-liner to do this. I welcome others to edit this answer if they have a better alternative.

    0 讨论(0)
  • 2020-12-05 16:49

    From the shell, this works:

    sed -e 's/$/<EOL>/ ; s/./&\n/g' test.txt | uniq | sed -e :a -e '$!N; s/\n//; ta ; s/<EOL>/\n/g'
    

    In words: mark every linebreak with a <EOL> string, then put every character on a line of its own, then use uniq to remove duplicate lines, then strip out all the linebreaks, then put back linebreaks instead of the <EOL> markers.

    I found the -e :a -e '$!N; s/\n//; ta part in a forum post and I don't understand the seperate -e :a part, or the $!N part, so if anyone can explain those, I'd be grateful.

    Hmm, that one does only consecutive duplicates; to eliminate all duplicates you could do this:

    cat test.txt | while read line ; do echo $line | sed -e 's/./&\n/g' | sort | uniq | sed -e :a -e '$!N; s/\n//; ta' ; done
    

    That puts the characters in each line in alphabetical order though.

    0 讨论(0)
提交回复
热议问题