Removing duplicate strings with SED

假装没事ソ 提交于 2021-02-10 07:18:21

问题


I use buildroot package to port some software packages to some Linux embedded system. Some software packages also produce plain text script and/or library control files with references to staging directories. It is necessary to remove the references to staging directories at the stage of packaging the software for distribution. I have no problem to use SED to remove such references. However, this processing leaves some undesired patterns of duplicate strings and I excerpted as shown below. I would like to know if it is possible to use SED to remove such duplicates.

Note1: The 'dependency_libs=' was left out and is now amended as shown below. I tried to be succinct to excerpt what is needed here and did not include the 'dependency_libs=' here before because it doesn't contain any duplicates. Apparently, it plays an important part on some of suggested solutions below. Therefore, I amended it here for posterity.

Note2: I just found out a little bug with the sed scripts from @potong. If the duplicate strings are the last object sans an empty space, the sed scripts fails. In this case, the 1st 'dependency_libs=' line will partially fail the sed scripts. The 2nd 'dependency_libs=' line has included a space at the end of the line (right before the single quote) and passes through the sed scripts without a problem. I have amended it here to show the difference.

cppflags=-I/usr/include -I/include -I/usr/include -I/include -I${includedir}/mine
cxxflags=-I/usr/include -I/include -I/usr/include -I/include -I${includedir}/mine 
Cflags: -I/usr/include -I/include -I/usr/include -I/include -I${includedir}/mine 
Libs: -L/usr/lib -L/lib -L/usr/lib -L/lib -L${libdir} -lmine${suffix}
dependency_libs='-L/usr/lib -L/lib -L/usr/lib -L/lib -L/usr/lib/libiconv-full/lib -L/usr/lib/libintl-full/lib -L/usr/lib -L/lib -L/usr/lib -L/lib'
dependency_libs='-L/usr/lib -L/lib -L/usr/lib -L/lib -L/usr/lib/libiconv-full/lib -L/usr/lib/libintl-full/lib -L/usr/lib -L/lib -L/usr/lib -L/lib '

so that it will become:

cppflags=-I/usr/include -I/include -I${includedir}/mine
cxxflags=-I/usr/include -I/include -I${includedir}/mine                        
Cflags: -I/usr/include -I/include -I${includedir}/mine                         
Libs: -L/usr/lib -L/lib -L${libdir} -lmine${suffix}
dependency_libs='-L/usr/lib/libiconv-full/lib -L/usr/lib/libintl-full/lib'
dependency_libs='-L/usr/lib/libiconv-full/lib -L/usr/lib/libintl-full/lib'

回答1:


This might work for you (GNU sed):

sed -r ':a;s|((-[IL]/\S+\s).*)\2|\1|;ta' file

This looks for strings begining with -I/ or -L/ followed by one or more non-spaces and a space that are repeated and removes the second occurance. If the substitution takes place the process is repeated until no more substitutions occur.




回答2:


This may work for you:

awk -F- '
  {
    for(i = 2; i <= NF; ++i) a[$i] = 1;
    printf("%s", $1)
    for(x in a) printf("-%s ", x)
    print""
    delete a
  }
'

Output:

cppflags=-I${includedir}/mine -I/include  -I/usr/include
cxxflags=-I${includedir}/mine  -I/include  -I/usr/include
Cflags: -I${includedir}/mine  -I/include  -I/usr/include
Libs: -L${libdir}  -lmine${suffix} -L/lib  -L/usr/lib

Note that it doesn't retain the order of the directories, and it adds an extra space here and there.

If you need to retain the order of the directories and you can use gawk, try:

gawk -F- '
  BEGIN {PROCINFO["sorted_in"] = "@val_num_asc"}
  {
    for(i = 2; i <= NF; ++i)
      if (!($i in a))
        a[$i] = i;
    printf("%s", $1)
    for(x in a) printf("-%s ", x)
    print""
    delete a
  }
'

Output:

cppflags=-I/usr/include  -I/include  -I${includedir}/mine
cxxflags=-I/usr/include  -I/include  -I${includedir}/mine
Cflags: -I/usr/include  -I/include  -I${includedir}/mine
Libs: -L/usr/lib  -L/lib  -L${libdir}  -lmine${suffix}

Or you can get the same output using a non-gnu awk like this:

awk -F- '
  {
    for(i = 2; i <= NF; ++i)
      if (!($i in a))
        a[$i] = i;
    printf("%s", $1)
    for(x in a) b[a[x]] = x
    for(x in b) printf("-%s ", b[x])
    print""
    delete a
    delete b
  }
'

And, of course, if you need to get rid of the extra spaces, you can pipe the output through tr -s ' '.




回答3:


I don't think sed will work, because you need a field-oriented utility that can process interrelated parts of a single line.

Use of awk, as in @ooga's answer, is an option, but here's a pure bash solution.

Note:

  • Only suitable for small input files for performance reasons.
  • Assumes that no options in the input have embedded whitespace.
  • Input order of options is preserved (whitespace between options is normalized).
#!/usr/bin/env bash

while read -r line; do
    # Split line into prefix, separator, options array.
  [[ $line =~ ^([^=:]+)([:=]\ *)(.*)$ ]]
  prefix=${BASH_REMATCH[1]}
  sep=${BASH_REMATCH[2]}
  read -ra optArray <<<"${BASH_REMATCH[3]}"
    # Loop over options array and build up a list without duplicates.
  dedupOptList=''
  for opt in "${optArray[@]}"; do
    [[ " $dedupOptList " == *" $opt "* ]] || dedupOptList+=" $opt"
  done
    # Finally, rebuild the line with the deduplicated options list and print.
  printf '%s%s%s\n' "$prefix" "$sep" "${dedupOptList:1}"
done < file


来源:https://stackoverflow.com/questions/24612037/removing-duplicate-strings-with-sed

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!