Remove consecutive duplicate words from a file using awk or sed

前端 未结 6 1368
不思量自难忘°
不思量自难忘° 2021-01-16 16:53

My input file looks like below:

“true true, rohith Rohith;
cold burn, and fact and fact good good?”

Output shoud look like:



        
相关标签:
6条回答
  • 2021-01-16 17:32

    This is not exactly what you have shown in output but is close using gnu-awk:

    awk -v RS='[^-_[:alnum:]]+' '$1 == p{printf "%s", RT; next} {p=$1; ORS=RT} 1' file
    

    “true , rohith Rohith;
    cold burn, and fact and fact good ?”
    
    0 讨论(0)
  • 2021-01-16 17:34

    With GNU awk for the 4th arg to split():

    $ cat tst.awk
    {
        n = split($0,words,/[^[:alpha:]]+/,seps)
        prev = ""
        for (i=1; i<=n; i++) {
            word = words[i]
            if (word != prev) {
                printf "%s%s", seps[i-1], word
            }
            prev = word
        }
        print ""
    }
    
    $ awk -f tst.awk file
    “true, rohith Rohith;
    cold burn, and fact and fact good?”
    
    0 讨论(0)
  • 2021-01-16 17:34
    sed -E 's/(\w+) *\1/\1/g' sample.txt
    

    sample.txt

    “true true, rohith Rohith;
    cold burn, and fact and fact good good?”
    

    output:

    :~$ sed -E 's/(\w+) *\1/\1/g' sample.txt
    “true, rohith Rohith;
    cold burn, and fact and fact good?”
    

    Explanation

    (\w) *\1 - matches a word separated by a space of the same word and saves it

    0 讨论(0)
  • 2021-01-16 17:46

    Depending on your expected input, this might work:

    sed -r 's/([a-zA-Z0-9_-]+)( *)\1/\1\2/g ; s/ ([.,;:])/\1/g ; s/  / /g' myfile
    

    ([a-zA-Z0-9_-]+) = words that might be repeated.

    ( *)\1 = check if the previous word is repeated after a space.

    s/ ([.,;:])/\1/g = removes extra spaces before punctuation (you might want to add characters to this group).

    s/ / /g = removes double spaces.

    This works with GNU sed.

    0 讨论(0)
  • 2021-01-16 17:48

    Simple sed:

    echo "true true, rohith Rohith;
    cold burn, and fact and fact good good?" | sed -r 's/(\w+) (\1)/\1/g'
    
    0 讨论(0)
  • 2021-01-16 17:51

    Just match the same backreference in sed:

    sed ':l; s/\(^\|[^[:alpha:]]\)\([[:alpha:]]\{1,\}\)[^[:alpha:]]\{1,\}\2\($\|[^[:alpha:]]\)/\1\2\3/g; tl'
    

    How it works:

    • :l - create a label l to jump to. See tl below.
    • s - substitute
      • /
      • \(^\|[^[:alpha:]]\) - match beginning of the line or non-alphabetic character. This is so that the next part matches the whole word, not only suffix.
      • \([[:alpha:]]\{1,\}\) - match a word - one or more alphabetic characters.
      • [^[:alpha:]]\{1,\} - match a non-word - one or more non-alphabetic characters.
      • \2 - match the same thing as in the second \(...\) - ie. match the word.
      • \($\|[^[:alpha:]]\) - match the end of the line or match a non-alphabetic character. That is so we match the whole second word, not only it's prefix.
      • /
      • \1\2\3 - substitute it for <beginning of the line or non-alphabetic prefix character><the word><end of the line or non-alphabetic suffix character found>
      • /
      • g - substitute globally. But, because regex is never going back, it will substitute 2 words at a time.
    • tl - Jump to label l if last s command was successfull. This is here, so that when there are 3 words the same, like true true true, they are properly replaced by a single true.

    Without the \(^\|[^[:alpha:]]\) and \($\|[^[:alpha:]]\), without them for example true rue would be substituted by true, because the suffix rue rue would match.

    Below are my other solution, which also remove repeated words across lines.

    My first solution was with uniq. So first I will transform the input into pairs with the format <non-alphabetical sequence separating words encoded in hex> <a word>. Then run it via uniq -f1 with ignoring first field and then convert back. This will be very slow:

    # recreate input
    cat <<EOF |
    true true, rohith Rohith;
    cold burn, and fact and fact good good?
    EOF
    # insert zero byte after each word and non-word
    # the -z option is from GNU sed
    sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
    # for each pair (non-word, word)
    xargs -0 -n2 sh -c '
        # ouptut hexadecimal representation of non-word
        printf "%s" "$1" | xxd -p | tr -d "\n"
        # and output space with the word
        printf " %s\n" "$2"
    ' -- |
    # uniq ignores empty fields - so make sure field1 always has something
    sed 's/^/-/' |
    # uniq while ignoring first field
    uniq -f1 |
    # for each pair (non-word in hex, word)
    xargs -n2 bash -c '
        # just `printf "%s" "$1" | sed 's/^-//' | xxd -r -p` for posix shell
        # change non-word from hex to characters
        printf "%s" "${1:1}" | xxd -r -p
        # output word
        printf "%s" "$2"
    ' --
    

    But then I noticed that sed is doing a good job at tokenizing the input - it places zero bytes between each word and non-word tokens. So I could easily read the stream. I can ignore repeated words in awk by reading zero separated stream in GNU awk and comparing the last readed word:

    cat <<EOF |
    true true, rohith Rohith;
    cold burn, and fact and fact good good?
    EOF
    sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
    gawk -vRS='\0' '
    NR%2==1{
        nonword=$0
    }
    NR%2==0{
        if (length(lastword) && lastword != $0) {
            printf "%s%s", lastword, nonword
        }
        lastword=$0
    }
    END{
        printf "%s%s", lastword, nonword
    }'
    

    In place of zero byte something unique could be used as record separator, for example ^ character, that way it could be used with non-GNU awk version, tested with mawk available on repl. Shortened the script by using shorter variable names here:

    cat <<EOF |
    true true, rohith Rohith;
    cold burn, and fact and fact good good?
    EOF
    sed -r 's/[[:alpha:]]+/^&^/g' |
    awk -vRS='^' '
        NR%2{ n=$0 }
        NR%2-1 && length(l) && l != $0 { printf "%s%s", l, n }
        NR%2-1 { l=$0 }
        END { printf "%s%s", l, n }
    '
    

    Tested on repl. The snippets output:

    true, rohith Rohith;
    cold burn, and fact and fact good?
    
    0 讨论(0)
提交回复
热议问题