How do I remove duplicate characters and keep the unique one only in Perl?

前端未结

关注

 11  712

How do I remove duplicate characters and keep the unique one only. For example, my input is:

EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU

Expected out

相关标签:

11条回答

星月不相逢

2020-12-05 16:23
Use uniq from List::MoreUtils:
```
perl -MList::MoreUtils=uniq -ne 'print uniq split ""'
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
鱼传尺愫

2020-12-05 16:24

If the set of characters that can be encountered is restricted, e.g. only letters, then the easiest solution will be with tr
perl -p -e 'tr/a-zA-Z/a-zA-Z/s'
It will replace all the letters by themselves, leaving other characters unaffected and /s modifier will squeeze repeated occurrences of the same character (after replacement), thus removing duplicates

Me bad - it removes only adjoining appearances. Disregard

0 讨论(0)
发布评论:

提交评论
- 加载中...

北恋

2020-12-05 16:30

if Perl is not a must, you can also use awk. here's a fun benchmark on the Perl one liners posted against awk. awk is 10+ seconds faster for a file with 3million++ lines

$ wc -l <file2
3210220

$ time awk 'BEGIN{FS=""}{delete _;for(i=1;i<=NF;i++){if(!_[$i]++) printf $i};print""}' file2 >/dev/null

real    1m1.761s
user    0m58.565s
sys     0m1.568s

$ time perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}'  file2 > /dev/null

real    1m32.123s
user    1m23.623s
sys     0m3.450s

$ time perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' file2 >/dev/null

real    1m17.818s
user    1m10.611s
sys     0m2.557s

$ time perl -ne'my%s;print grep!$s{$_}++,split//' file2 >/dev/null

real    1m20.347s
user    1m13.069s
sys     0m2.896s

0 讨论(0)

我在风中等你

2020-12-05 16:31

Tie::IxHash is a good module to store hash order (but may be slow, you will need to benchmark if speed is important). Example with tests:

use Test::More 0.88;

use Tie::IxHash;
sub dedupe {
  my $str=shift;
  my $hash=Tie::IxHash->new(map { $_ => 1} split //,$str);
  return join('',$hash->Keys);
}

{
my $str='EFUAHUU';
is(dedupe($str),'EFUAH');
}

{
my $str='EFUAHHUU';
is(dedupe($str),'EFUAH');
}

{
my $str='UJUJHHACDEFUCU';
is(dedupe($str),'UJHACDEF');
}

done_testing();

0 讨论(0)

夕颜

2020-12-05 16:34
```
perl -ne'my%s;print grep!$s{$_}++,split//'
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
温柔的废话

2020-12-05 16:34
This looks like a classic application of positive lookbehind, but unfortunately perl doesn't support that. In fact, doing this (matching the preceding text of a character in a string with a full regex whose length is indeterminable) can only be done with .NET regex classes, I think.

However, positive lookahead supports full regexes, so all you need to do is reverse the string, apply positive lookahead (like unicornaddict said):
```
perl -pe 's/(.)(?=.*?\1)//g' 
```
And reverse it back, because without the reverse that'll only keep the duplicate character at the last place in a line.

MASSIVE EDIT

I've been spending the last half an hour on this, and this looks like this works, without the reversing.
```
perl -pe 's/\G$1//g while (/(.).*(?=\1)/g)' FILE_NAME
```
I don't know whether to be proud or horrified. I'm basically doing the positive looakahead, then substituting on the string with \G specified - which makes the regex engine start its matching from the last place matched (internally represented by the pos() variable).

With test input like this:

aabbbcbbccbabb

EFAUUUUH

ABCBBBBD

DEEEFEGGH

AABBCC

The output is like this:

abc

EFAUH

ABCD

DEFGH

ABC

I think it's working...

Explanation - Okay, in case my explanation last time wasn't clear enough - the lookahead will go and stop at the last match of a duplicate variable [in the code you can do a print pos(); inside the loop to check] and the s/\G//g will remove it [you don't need the /g really]. So within the loop, the substitution will continue removing until all such duplicates are zapped. Of course, this might be a little too processor intensive for your tastes... but so are most of the regex-based solutions you'll see. The reversing/lookahead method will probably be more efficient than this, though.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页