How do I remove duplicate characters and keep the unique one only in Perl?

前端 未结 11 712
隐瞒了意图╮
隐瞒了意图╮ 2020-12-05 16:08

How do I remove duplicate characters and keep the unique one only. For example, my input is:

EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU

Expected out

相关标签:
11条回答
  • 2020-12-05 16:23

    Use uniq from List::MoreUtils:

    perl -MList::MoreUtils=uniq -ne 'print uniq split ""'
    
    0 讨论(0)
  • 2020-12-05 16:24

    If the set of characters that can be encountered is restricted, e.g. only letters, then the easiest solution will be with tr
    perl -p -e 'tr/a-zA-Z/a-zA-Z/s'
    It will replace all the letters by themselves, leaving other characters unaffected and /s modifier will squeeze repeated occurrences of the same character (after replacement), thus removing duplicates

    Me bad - it removes only adjoining appearances. Disregard

    0 讨论(0)
  • 2020-12-05 16:30

    if Perl is not a must, you can also use awk. here's a fun benchmark on the Perl one liners posted against awk. awk is 10+ seconds faster for a file with 3million++ lines

    $ wc -l <file2
    3210220
    
    $ time awk 'BEGIN{FS=""}{delete _;for(i=1;i<=NF;i++){if(!_[$i]++) printf $i};print""}' file2 >/dev/null
    
    real    1m1.761s
    user    0m58.565s
    sys     0m1.568s
    
    $ time perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}'  file2 > /dev/null
    
    real    1m32.123s
    user    1m23.623s
    sys     0m3.450s
    
    $ time perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' file2 >/dev/null
    
    real    1m17.818s
    user    1m10.611s
    sys     0m2.557s
    
    $ time perl -ne'my%s;print grep!$s{$_}++,split//' file2 >/dev/null
    
    real    1m20.347s
    user    1m13.069s
    sys     0m2.896s
    
    0 讨论(0)
  • 2020-12-05 16:31

    Tie::IxHash is a good module to store hash order (but may be slow, you will need to benchmark if speed is important). Example with tests:

    use Test::More 0.88;
    
    use Tie::IxHash;
    sub dedupe {
      my $str=shift;
      my $hash=Tie::IxHash->new(map { $_ => 1} split //,$str);
      return join('',$hash->Keys);
    }
    
    {
    my $str='EFUAHUU';
    is(dedupe($str),'EFUAH');
    }
    
    {
    my $str='EFUAHHUU';
    is(dedupe($str),'EFUAH');
    }
    
    {
    my $str='UJUJHHACDEFUCU';
    is(dedupe($str),'UJHACDEF');
    }
    
    done_testing();
    
    0 讨论(0)
  • 2020-12-05 16:34
    perl -ne'my%s;print grep!$s{$_}++,split//'
    
    0 讨论(0)
  • 2020-12-05 16:34

    This looks like a classic application of positive lookbehind, but unfortunately perl doesn't support that. In fact, doing this (matching the preceding text of a character in a string with a full regex whose length is indeterminable) can only be done with .NET regex classes, I think.

    However, positive lookahead supports full regexes, so all you need to do is reverse the string, apply positive lookahead (like unicornaddict said):

    perl -pe 's/(.)(?=.*?\1)//g' 
    

    And reverse it back, because without the reverse that'll only keep the duplicate character at the last place in a line.

    MASSIVE EDIT

    I've been spending the last half an hour on this, and this looks like this works, without the reversing.

    perl -pe 's/\G$1//g while (/(.).*(?=\1)/g)' FILE_NAME
    

    I don't know whether to be proud or horrified. I'm basically doing the positive looakahead, then substituting on the string with \G specified - which makes the regex engine start its matching from the last place matched (internally represented by the pos() variable).

    With test input like this:

    aabbbcbbccbabb

    EFAUUUUH

    ABCBBBBD

    DEEEFEGGH

    AABBCC

    The output is like this:

    abc

    EFAUH

    ABCD

    DEFGH

    ABC

    I think it's working...

    Explanation - Okay, in case my explanation last time wasn't clear enough - the lookahead will go and stop at the last match of a duplicate variable [in the code you can do a print pos(); inside the loop to check] and the s/\G//g will remove it [you don't need the /g really]. So within the loop, the substitution will continue removing until all such duplicates are zapped. Of course, this might be a little too processor intensive for your tastes... but so are most of the regex-based solutions you'll see. The reversing/lookahead method will probably be more efficient than this, though.

    0 讨论(0)
提交回复
热议问题