How can I be sure of the file encoding?

后端 未结 4 572
南笙 2020-11-30 21:46

I have a PHP file that I created with VIM, but I\'m not sure which is its encoding.

When I use the terminal and check the encoding with the command file -bi fo

  • 2020-11-30 22:26

    Well, first of all, note that ASCII is a subset of UTF-8, so if your file contains only ASCII characters, it's correct to say that it's encoded in ASCII and it's correct to say that it's encoded in UTF-8.

    That being said, file typically only examines a short segment at the beginning of the file to determine its type, so it might be declaring it us-ascii if there are non-ASCII characters but they are beyond the initial segment of the file. On the other hand, gedit might say that the file is UTF-8 even if it's ASCII because UTF-8 is gedit's preferred character encoding and it intends to save the file with UTF-8 if you were to add any non-ASCII characters during your edit session. Again, if that's what gedit is saying, it wouldn't be wrong.

    Now to your question:

    1. Run this command:

      tr -d \\000-\\177 < your-file | wc -c

      If the output says "0", then the file contains only ASCII characters. It's in ASCII (and it's also valid UTF-8) End of story.

    2. Run this command

      iconv -f utf-8 -t ucs-4 < your-file >/dev/null

      If you get an error, the file does not contain valid UTF-8 (or at least, some part of it is corrupted).

      If you get no error, the file is extremely likely to be UTF-8. That's because UTF-8 has properties that make it very hard to mistake typical text in any other commonly used character encoding for valid UTF-8.

    0 讨论(0)
  • 2020-11-30 22:29
    $ file --mime my.txt 
    my.txt: text/plain; charset=iso-8859-1
    0 讨论(0)
  • 2020-11-30 22:29

    Based on @Celada answer and the @Arthur Zennig, I have created this simple script:

    if [ "$#" -lt 1 ]
      echo "Usage: utf8-check filename"
      exit 1
    chardet $1
    countchars="$(tr -d \\000-\\177 < $1 | wc -c)"
    if [ $countchars -eq 0 ]
     echo "Ascii";
     exit 0
      iconv -f utf-8 -t ucs-4 < $1 >/dev/null
      echo "UTF-8"
    } || {
      echo "not UTF-8 or corrupted"
    0 讨论(0)
  • 2020-11-30 22:47

    (on Linux)

    $ chardet <filename>

    it also delivers the confidence level [0-1] of the output.

    0 讨论(0)