My input file looks like below:
“true true, rohith Rohith;
cold burn, and fact and fact good good?”
Output shoud look like:
Just match the same backreference in sed:
sed ':l; s/\(^\|[^[:alpha:]]\)\([[:alpha:]]\{1,\}\)[^[:alpha:]]\{1,\}\2\($\|[^[:alpha:]]\)/\1\2\3/g; tl'
How it works:
:l
- create a label l
to jump to. See tl
below.s
- substitute
/
\(^\|[^[:alpha:]]\)
- match beginning of the line or non-alphabetic character. This is so that the next part matches the whole word, not only suffix.\([[:alpha:]]\{1,\}\)
- match a word - one or more alphabetic characters.[^[:alpha:]]\{1,\}
- match a non-word - one or more non-alphabetic characters.\2
- match the same thing as in the second \(...\)
- ie. match the word.\($\|[^[:alpha:]]\)
- match the end of the line or match a non-alphabetic character. That is so we match the whole second word, not only it's prefix./
\1\2\3
- substitute it for
/
g
- substitute globally. But, because regex is never going back, it will substitute 2 words at a time.tl
- Jump to label l
if last s
command was successfull. This is here, so that when there are 3 words the same, like true true true
, they are properly replaced by a single true
.Without the \(^\|[^[:alpha:]]\)
and \($\|[^[:alpha:]]\)
, without them for example true rue
would be substituted by true
, because the suffix rue rue
would match.
Below are my other solution, which also remove repeated words across lines.
My first solution was with uniq
. So first I will transform the input into pairs with the format
. Then run it via uniq -f1
with ignoring first field and then convert back. This will be very slow:
# recreate input
cat <
But then I noticed that sed
is doing a good job at tokenizing the input - it places zero bytes between each word and non-word tokens. So I could easily read the stream. I can ignore repeated words in awk by reading zero separated stream in GNU awk and comparing the last readed word:
cat <
In place of zero byte something unique could be used as record separator, for example ^
character, that way it could be used with non-GNU awk version, tested with mawk available on repl. Shortened the script by using shorter variable names here:
cat <
Tested on repl. The snippets output:
true, rohith Rohith;
cold burn, and fact and fact good?