Trying to remove non-printable charaters(junk values) from a UNIX file

问题

I am trying to remove non-printable character (for e.g. ^@) from records in my file. Since the volume to records is too big in the file using cat is not an option as the loop is taking too much time. I tried using

sed -i 's/[^@a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILENAME

but still the ^@ characters are not removed. Also I tried using

awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print } FILENAME > NEW FILE

but it also did not help.

Can anybody suggest some alternative way to remove non-printable characters?

Used tr -cd but it is removing accented characters. But they are required in the file.

回答1:

Perhaps you could go with the complement of [:print:], which contains all printable characters:

tr -cd '[:print:]' < file > newfile

If your version of tr doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):

sed 's/[^[:print:]]//g' file

回答2:

Remove all control characters first:

tr -dc '\007-\011\012-\015\040-\376' < file > newfile

Then try your string:

sed -i 's/[^@a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' newfile

I believe that what you see ^@ is in fact a zero value \0.
The tr filter from above will remove those as well.

回答3:

strings -1 file... > outputfile

seems to work

来源：https://stackoverflow.com/questions/34412754/trying-to-remove-non-printable-charatersjunk-values-from-a-unix-file

标签

bash

unix

awk

sed

non-printing-characters

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!