How to remove non UTF-8 characters from text file

后端 未结 3 1804
青春惊慌失措
青春惊慌失措 2020-11-28 20:16

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:

Malformed U         


        
相关标签:
3条回答
  • 2020-11-28 20:22

    Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.

    0 讨论(0)
  • 2020-11-28 20:23
    cat foo.txt | strings -n 8 > bar.txt
    

    will do the job.

    0 讨论(0)
  • 2020-11-28 20:39

    This command:

    iconv -f utf-8 -t utf-8 -c file.txt
    

    will clean up your UTF-8 file, skipping all the invalid characters.

    -f is the source format
    -t the target format
    -c skips any invalid sequence
    
    0 讨论(0)
提交回复
热议问题