Control configure set Apache Spark UTF encoding for writting as saveAsTextFile

…衆ロ難τιáo~ 提交于 2019-12-22 13:33:13

问题


So how does one tell spark which UTF to use when using saveAsTextFile(path)? Of course if it's known that all the Strings are UTF-8 then it will save space on disk by 2x! (assuming the default UTF is 16 like java)


回答1:


saveAsTextFile actually uses Text from hadoop which is encoded as UTF-8.

def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]) {
    this.map(x => (NullWritable.get(), new Text(x.toString)))
      .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path, codec)
  }

From Text.java:

public class Text extends BinaryComparable
    implements WritableComparable<BinaryComparable> {

  static final int SHORT_STRING_MAX = 1024 * 1024;

  private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY =
    new ThreadLocal<CharsetEncoder>() {
      protected CharsetEncoder initialValue() {
        return Charset.forName("UTF-8").newEncoder().
               onMalformedInput(CodingErrorAction.REPORT).
               onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };

  private static ThreadLocal<CharsetDecoder> DECODER_FACTORY =
    new ThreadLocal<CharsetDecoder>() {
    protected CharsetDecoder initialValue() {
      return Charset.forName("UTF-8").newDecoder().
             onMalformedInput(CodingErrorAction.REPORT).
             onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };

If you wanted to save as UTF-16 I think you could use saveAsHadoopFile with org.apache.hadoop.io.BytesWritable and get the bytes of a java String (which as you said will be UTF-16). Something like this:
saveAsHadoopFile[SequenceFileOutputFormat[NullWritable, BytesWritable]](path)
You can get the bytes from "...".getBytes("UTF-16")



来源:https://stackoverflow.com/questions/24651969/control-configure-set-apache-spark-utf-encoding-for-writting-as-saveastextfile

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!