How to save Chinese Characters to file with java?

问题

I use the following code to save Chinese characters into a .txt file, but when I opened it with Wordpad, I couldn't read it.

StringBuffer Shanghai_StrBuf = new StringBuffer("\u4E0A\u6D77");
boolean Append = true;

FileOutputStream fos;
fos = new FileOutputStream(FileName, Append);
for (int i = 0;i < Shanghai_StrBuf.length(); i++) {
    fos.write(Shanghai_StrBuf.charAt(i));
}
fos.close();

What can I do ? I know if I cut and paste Chinese characters into Wordpad, I can save it into a .txt file. How do I do that in Java ?

回答1:

There are several factors at work here:

Text files have no intrinsic metadata for describing their encoding (for all the talk of angle-bracket taxes, there are reasons XML is popular)
The default encoding for Windows is still an 8bit (or doublebyte) "ANSI" character set with a limited range of values - text files written in this format are not portable
To tell a Unicode file from an ANSI file, Windows apps rely on the presence of a byte order mark at the start of the file (not strictly true - Raymond Chen explains). In theory, the BOM is there to tell you the endianess (byte order) of the data. For UTF-8, even though there is only one byte order, Windows apps rely on the marker bytes to automatically figure out that it is Unicode (though you'll note that Notepad has an encoding option on its open/save dialogs).
It is wrong to say that Java is broken because it does not write a UTF-8 BOM automatically. On Unix systems, it would be an error to write a BOM to a script file, for example, and many Unix systems use UTF-8 as their default encoding. There are times when you don't want it on Windows, either, like when you're appending data to an existing file: fos = new FileOutputStream(FileName,Append);

Here is a method of reliably appending UTF-8 data to a file:

  private static void writeUtf8ToFile(File file, boolean append, String data)
      throws IOException {
    boolean skipBOM = append && file.isFile() && (file.length() > 0);
    Closer res = new Closer();
    try {
      OutputStream out = res.using(new FileOutputStream(file, append));
      Writer writer = res.using(new OutputStreamWriter(out, Charset
          .forName("UTF-8")));
      if (!skipBOM) {
        writer.write('\uFEFF');
      }
      writer.write(data);
    } finally {
      res.close();
    }
  }

Usage:

  public static void main(String[] args) throws IOException {
    String chinese = "\u4E0A\u6D77";
    boolean append = true;
    writeUtf8ToFile(new File("chinese.txt"), append, chinese);
  }

Note: if the file already existed and you chose to append and existing data wasn't UTF-8 encoded, the only thing that code will create is a mess.

Here is the Closer type used in this code:

public class Closer implements Closeable {
  private Closeable closeable;

  public <T extends Closeable> T using(T t) {
    closeable = t;
    return t;
  }

  @Override public void close() throws IOException {
    if (closeable != null) {
      closeable.close();
    }
  }
}

This code makes a Windows-style best guess about how to read the file based on byte order marks:

  private static final Charset[] UTF_ENCODINGS = { Charset.forName("UTF-8"),
      Charset.forName("UTF-16LE"), Charset.forName("UTF-16BE") };

  private static Charset getEncoding(InputStream in) throws IOException {
    charsetLoop: for (Charset encodings : UTF_ENCODINGS) {
      byte[] bom = "\uFEFF".getBytes(encodings);
      in.mark(bom.length);
      for (byte b : bom) {
        if ((0xFF & b) != in.read()) {
          in.reset();
          continue charsetLoop;
        }
      }
      return encodings;
    }
    return Charset.defaultCharset();
  }

  private static String readText(File file) throws IOException {
    Closer res = new Closer();
    try {
      InputStream in = res.using(new FileInputStream(file));
      InputStream bin = res.using(new BufferedInputStream(in));
      Reader reader = res.using(new InputStreamReader(bin, getEncoding(bin)));
      StringBuilder out = new StringBuilder();
      for (int ch = reader.read(); ch != -1; ch = reader.read())
        out.append((char) ch);
      return out.toString();
    } finally {
      res.close();
    }
  }

Usage:

  public static void main(String[] args) throws IOException {
    System.out.println(readText(new File("chinese.txt")));
  }

(System.out uses the default encoding, so whether it prints anything sensible depends on your platform and configuration.)

回答2:

That reminds me:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

回答3:

If you can rely that the default character encoding is UTF-8 (or some other Unicode encoding), you may use the following:

    Writer w = new FileWriter("test.txt");
    w.append("上海");
    w.close();

The safest way is to always explicitly specify the encoding:

    Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
    w.append("上海");
    w.close();

P.S. You may use any Unicode characters in Java source code, even as method and variable names, if the -encoding parameter for javac is configured right. That makes the source code more readable than the escaped \uXXXX form.

回答4:

Be very careful with the approaches proposed. Even specifying the encoding for the file as follows:

Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");

will not work if you're running under an operating system like Windows. Even setting the system property for file.encoding to UTF-8 does not fix the issue. This is because Java fails to write a byte order mark (BOM) for the file. Even if you specify the encoding when writing out to a file, opening the same file in an application like Wordpad will display the text as garbage because it doesn't detect the BOM. I tried running the examples here in Windows (with a platform/container encoding of CP1252).

The following bug exists to describe the issue in Java:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058

The solution for the time being is to write the byte order mark yourself to ensure the file opens correctly in other applications. See this for more details on the BOM:

http://mindprod.com/jgloss/bom.html

and for a more correct solution see the following link:

http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html

回答5:

Here's one way among many. Basically, we're just specifying that the conversion be done to UTF-8 before outputting bytes to the FileOutputStream:

String FileName = "output.txt";

StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;

Writer writer = new OutputStreamWriter(new FileOutputStream(FileName,Append), "UTF-8");
writer.write(Shanghai_StrBuf.toString(), 0, Shanghai_StrBuf.length());
writer.close();

I manually verified this against the images at http://www.fileformat.info/info/unicode/char/ . In the future, please follow Java coding standards, including lower-case variable names. It improves readability.

回答6:

Try this,

StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
    boolean Append=true;

    Writer out = new BufferedWriter(new OutputStreamWriter(
        new FileOutputStream(FileName,Append), "UTF8"));
    for (int i=0;i<Shanghai_StrBuf.length();i++) out.write(Shanghai_StrBuf.charAt(i));
    out.close();

来源：https://stackoverflow.com/questions/766361/how-to-save-chinese-characters-to-file-with-java

标签

java

file

character-encoding

cjk