My input file looks like below:
“true true, rohith Rohith;
cold burn, and fact and fact good good?”
Output shoud look like:
This is not exactly what you have shown in output but is close using gnu-awk
:
awk -v RS='[^-_[:alnum:]]+' '$1 == p{printf "%s", RT; next} {p=$1; ORS=RT} 1' file
“true , rohith Rohith;
cold burn, and fact and fact good ?”
With GNU awk for the 4th arg to split():
$ cat tst.awk
{
n = split($0,words,/[^[:alpha:]]+/,seps)
prev = ""
for (i=1; i<=n; i++) {
word = words[i]
if (word != prev) {
printf "%s%s", seps[i-1], word
}
prev = word
}
print ""
}
$ awk -f tst.awk file
“true, rohith Rohith;
cold burn, and fact and fact good?”
sed -E 's/(\w+) *\1/\1/g' sample.txt
sample.txt
“true true, rohith Rohith;
cold burn, and fact and fact good good?”
output:
:~$ sed -E 's/(\w+) *\1/\1/g' sample.txt
“true, rohith Rohith;
cold burn, and fact and fact good?”
Explanation
(\w) *\1
- matches a word separated by a space of the same word and saves it
Depending on your expected input, this might work:
sed -r 's/([a-zA-Z0-9_-]+)( *)\1/\1\2/g ; s/ ([.,;:])/\1/g ; s/ / /g' myfile
([a-zA-Z0-9_-]+) = words that might be repeated.
( *)\1 = check if the previous word is repeated after a space.
s/ ([.,;:])/\1/g = removes extra spaces before punctuation (you might want to add characters to this group).
s/ / /g = removes double spaces.
This works with GNU sed.
Simple sed
:
echo "true true, rohith Rohith;
cold burn, and fact and fact good good?" | sed -r 's/(\w+) (\1)/\1/g'
Just match the same backreference in sed:
sed ':l; s/\(^\|[^[:alpha:]]\)\([[:alpha:]]\{1,\}\)[^[:alpha:]]\{1,\}\2\($\|[^[:alpha:]]\)/\1\2\3/g; tl'
How it works:
:l
- create a label l
to jump to. See tl
below.s
- substitute
/
\(^\|[^[:alpha:]]\)
- match beginning of the line or non-alphabetic character. This is so that the next part matches the whole word, not only suffix.\([[:alpha:]]\{1,\}\)
- match a word - one or more alphabetic characters.[^[:alpha:]]\{1,\}
- match a non-word - one or more non-alphabetic characters.\2
- match the same thing as in the second \(...\)
- ie. match the word.\($\|[^[:alpha:]]\)
- match the end of the line or match a non-alphabetic character. That is so we match the whole second word, not only it's prefix./
\1\2\3
- substitute it for <beginning of the line or non-alphabetic prefix character><the word><end of the line or non-alphabetic suffix character found>
/
g
- substitute globally. But, because regex is never going back, it will substitute 2 words at a time.tl
- Jump to label l
if last s
command was successfull. This is here, so that when there are 3 words the same, like true true true
, they are properly replaced by a single true
.Without the \(^\|[^[:alpha:]]\)
and \($\|[^[:alpha:]]\)
, without them for example true rue
would be substituted by true
, because the suffix rue rue
would match.
Below are my other solution, which also remove repeated words across lines.
My first solution was with uniq
. So first I will transform the input into pairs with the format <non-alphabetical sequence separating words encoded in hex> <a word>
. Then run it via uniq -f1
with ignoring first field and then convert back. This will be very slow:
# recreate input
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
# insert zero byte after each word and non-word
# the -z option is from GNU sed
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
# for each pair (non-word, word)
xargs -0 -n2 sh -c '
# ouptut hexadecimal representation of non-word
printf "%s" "$1" | xxd -p | tr -d "\n"
# and output space with the word
printf " %s\n" "$2"
' -- |
# uniq ignores empty fields - so make sure field1 always has something
sed 's/^/-/' |
# uniq while ignoring first field
uniq -f1 |
# for each pair (non-word in hex, word)
xargs -n2 bash -c '
# just `printf "%s" "$1" | sed 's/^-//' | xxd -r -p` for posix shell
# change non-word from hex to characters
printf "%s" "${1:1}" | xxd -r -p
# output word
printf "%s" "$2"
' --
But then I noticed that sed
is doing a good job at tokenizing the input - it places zero bytes between each word and non-word tokens. So I could easily read the stream. I can ignore repeated words in awk by reading zero separated stream in GNU awk and comparing the last readed word:
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
gawk -vRS='\0' '
NR%2==1{
nonword=$0
}
NR%2==0{
if (length(lastword) && lastword != $0) {
printf "%s%s", lastword, nonword
}
lastword=$0
}
END{
printf "%s%s", lastword, nonword
}'
In place of zero byte something unique could be used as record separator, for example ^
character, that way it could be used with non-GNU awk version, tested with mawk available on repl. Shortened the script by using shorter variable names here:
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r 's/[[:alpha:]]+/^&^/g' |
awk -vRS='^' '
NR%2{ n=$0 }
NR%2-1 && length(l) && l != $0 { printf "%s%s", l, n }
NR%2-1 { l=$0 }
END { printf "%s%s", l, n }
'
Tested on repl. The snippets output:
true, rohith Rohith;
cold burn, and fact and fact good?