How to remove duplicated characters from string in Bash?

问题

I have a string

cabbagee

I want to delete duplicate charaters. If I use tr -s it will remove duplicate characters in the sequence. But my desired output is

cabge

Appreciate if anyone can help me with that.

The answer provided was right but I was not able to use awk so I used:

#!/usr/bin/bash
key=$1
len=${#key}
mkey=""
for (( c=0; c<len; c++ ))
do
    tmp=${key:$c:1}
    echo $mkey | grep $tmp >/dev/null 2>&1   
    if [ "$?" -eq "0" ]; then
        echo "Found $tmp in $mkey"
    else
        mkey+=$tmp
    fi
done
echo $mkey

回答1:

Can you use awk?

awk -v FS="" '{
    for(i=1;i<=NF;i++)str=(++a[$i]==1?str $i:str)
}
END {print str}' <<< "cabbagee"
cabge

Couple of other ways:

gnu awk:

awk -v RS='[a-z]' '{str=(++a[RT]==1?str RT: str)}END{print str}' <<< "cabbagee"
cabge

awk -v RS='[a-z]' -v ORS= '++a[RT]==1{print RT}END{print "\n"}' <<< "cabbagee"
cabge

gnu sed and awk:

sed 's/./&\n/g' <<< "cabbagee" | awk '!a[$1]++' | sed ':a;N;s/\n//;ba'
cabge

回答2:

You edited your post and posted an answer that's ugly and broken. A simpler, working and more efficient one, in pure Bash:

#!/bin/bash

key=$1
mkey=$key
for ((i=0;i<${#mkey};++i)); do
    c=${mkey:i:1}
    tailmkey=${mkey:i+1}
    mkey=${mkey::i+1}${tailmkey//"$c"/}
done
echo "$mkey"

Why is your script broken? Here are a few cases where yours fail and mine doesn't. For the sake of the demonstration, I called your script banana and mine gorilla. Oh, because I'm not mean, I fixed the trivial quoting problems your script has (that trivially breaks with the * character) and commented the flooding part:

#!/usr/bin/bash
key=$1
len=${#key}
mkey=""
for (( c=0; c<len; c++ )); do
    tmp=${key:$c:1}
    echo "$mkey" | grep "$tmp" >/dev/null 2>&1   # Added quotes here!
    if [ "$?" -eq "0" ]; then
        : # echo "Found $tmp in $mkey" # Commented this to remove flooding
    else
        mkey+=$tmp
    fi
done
echo "$mkey"   # Added quotes here!

So let's go:

$ ./banana '^'

$ ./gorilla '^'
'^'

Yes, that's because ^ is a character used in grep's regex. Similar stuff happens with $, and also with .:

$ ./banana 'a.'
a
$ ./gorilla 'a.'
a.

Now the backslash causes problems too:

$ ./banana '\\'
\\
$ ./gorilla '\\'
\

(remove the >/dev/null 2>&1 part to see the grep: Trailing backslash error). The same thing happens with [.

Not mentioning that your script is highly inefficient! it calls grep multiple times. Mine is a bit better in that respect:

$ time for i in {1..200}; do ./banana cabbage; done &>/dev/null

real    0m3.028s
user    0m0.216s
sys     0m0.464s
$ time for i in {1..200}; do ./gorilla cabbage; done &>/dev/null

real    0m0.878s
user    0m0.172s
sys     0m0.324s

Not bad, eh?

Another benchmark that speaks for itself: with a long string, e.g., a paragraph of Lorem Ipsum:

$ time ./banana 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam viverra nec consectetur ante hendrerit. Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ut gravida lorem. Ut turpis felis, pulvinar a semper sed, adipiscing id dolor. Pellentesque auctor nisi id magna consequat sagittis. Curabitur dapibus enim sit amet elit pharetra tincidunt feugiat nisl imperdiet. Ut convallis libero in urna ultrices accumsan. Donec sed odio eros. Donec viverra mi quis quam pulvinar at malesuada arcu rhoncus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. In rutrum accumsan ultricies. Mauris vitae nisi at sem facilisis semper ac in est.'
Lorem ipsudlta,cngDSMqvhPbNAUfCI

real    0m1.464s
user    0m0.104s
sys     0m0.224s
$ time ./gorilla 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam viverra nec consectetur ante hendrerit. Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ut gravida lorem. Ut turpis felis, pulvinar a semper sed, adipiscing id dolor. Pellentesque auctor nisi id magna consequat sagittis. Curabitur dapibus enim sit amet elit pharetra tincidunt feugiat nisl imperdiet. Ut convallis libero in urna ultrices accumsan. Donec sed odio eros. Donec viverra mi quis quam pulvinar at malesuada arcu rhoncus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. In rutrum accumsan ultricies. Mauris vitae nisi at sem facilisis semper ac in est.'
Lorem ipsudlta,cng.DSMqvhPbNAUfCI

real    0m0.013s
user    0m0.000s
sys     0m0.008s

That's because banana is calling a grep for each character of the input string, whereas gorilla performs removal dynamically. (I'm not going to mention that banana missed the period).

回答3:

How about:

echo "cabbagee" | sed 's/./&\n/g' | perl -ne '$H{$_}++ or print' | tr -d '\n'

Which yields:

cabge

The above splits your string's characters into individual lines (sed 's/./&\n/g') and then uses a bit of perl magic (credit unix tool to remove duplicate lines from a file) to remove any duplicate lines. Finally, the tr -d '\n' removes the newlines we added to achieve your desired output.

Might need to modify it a bit for your specific purpose, and it feels terribly hacky, but it seems to get the job done.

Good luck.

回答4:

You could use grep -o . to split each character with \n then collect only the characters that haven't been seen in bash:

grep -o . <<<'cabbagee' | \
{ while read c; do [[ "$s" = *$c* ]] || s=$s$c; done; echo $s; }

回答5:

I'm not sure what language you are doing this in, but you could always make a for loop to go through the string. Then make an if loop stating if yourstring.charAt(i).equals(yourstring.char(i+1){ replace(yourstring.char(i+1),"")} So basically going through a loop stating if the character at the current index is equal to the character at the next index then replace the next index with an empty string: "".

来源：https://stackoverflow.com/questions/23402740/how-to-remove-duplicated-characters-from-string-in-bash

标签

regex

Linux

bash