How to remove duplicate characters in a string using regex?

前端未结

关注

 2  623

I need to replace the duplicate characters in a string. I tried using

outputString = str.replaceAll(\"(.)(?=.*\\\\1)\", \"\");

This replac

相关标签:

2条回答

南旧

2021-01-20 02:41
Overview

It's possible with Oracle's implementation, but I wouldn't recommend this answer for many reasons:
- It relies on a bug in the implementation, which interprets *, + or {n,} as {0, 0x7FFFFFFF}, {1, 0x7FFFFFFF}, {n, 0x7FFFFFFF} respectively, which allows the look-behind to contains such quantifiers. Since it relies on a bug, there is no guarantee that it will work similarly in the future.
- It is unmaintainable mess. Writing normal code and any people who have some basic Java knowledge can read it, but using the regex in this answer limits the number of people who can understand the code at a glance to people who understand the in and out of regex implementation.
Therefore, this answer is for educational purpose, rather than something to be used in production code.

Solution

Here is the one-liner replaceAll regex solution:
```
String output = input.replaceAll("(.)(?=(.*))(?<=(?=\\1.*?\\1\\2$).+)","")
```
Printing out the regex:
```
(.)(?=(.*))(?<=(?=\1.*?\1\2$).+)
```
What we want to do is to look-behind to see whether the same character has appeared before or not. The capturing group (.) at the beginning captures the current character, and the look-behind group is there to check whether the character has appeared before. So far, so good.

However, since backreferences \1 doesn't have obvious length, it can't appear in the look-behind directly.

This is where we make use of the bug to look-behind up to the beginning of the string, then use a look-ahead inside the look-behind to include the backreference, as you can see (?<=(?=...).+).

This is not the end of the problem, though. While the non-assertion pattern inside look-behind .+ can't advance past the position after the character in (.), the look-ahead inside can. As a simple test:
```
"haaaaaaaaa".replaceAll("h(?<=(?=(.*)).*)","$1")
> "aaaaaaaaaaaaaaaaaa"
```
To make sure that the search doesn't spill beyond the current character, I capture the rest of the string in a look-ahead (?=(.*)) and use it to "mark" the current position (?=\\1.*?\\1\\2$).

Can this be done in one replacement without using look-behind?

I think it is impossible. We need to differentiate the first appearance of a character with subsequent appearance of the same character. While we can do this for one fixed character (e.g. a), the problem requires us to do so for all characters in the string.

For your information, this is for removing all subsequent appearance of a fixed character (h is used here):
```
.replaceAll("^([^h]*h[^h]*)|(?!^)\\Gh+([^h]*)","$1$2")
```
To do this for multiple characters, we must keep track of whether the character has appeared before or not, across matches and for all characters. The regex above shows the across matches part, but the other condition kinda makes this impossible.

We obviously can't do this in a single match, since subsequent occurrences can be non-contiguous and arbitrary in number.
0 讨论(0)
发布评论:

提交评论
- 加载中...

长发绾君心

2021-01-20 02:45

It seems like your code is leaving the last character, so how about this?

outputString = new StringBuilder(str).reverse().toString();
// outputString is now hiah
outputString = outputString.replaceAll("(.)(?=.*\\1)", "");
// outputString is now iah
outputString = new StringBuilder(outputString).reverse().toString();
// outputString is now hai

0 讨论(0)

How to remove duplicate characters in a string using regex?

Overview

Solution

Can this be done in one replacement without using look-behind?