Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?

后端 未结 4 971
别跟我提以往
别跟我提以往 2020-12-29 13:08

I have a Spark program (in Scala) and a SparkContext. I am writing some files with RDD\'s saveAsTextFile. On my local machine I can us

相关标签:
4条回答
  • 2020-12-29 13:39

    Using HDFS API (hadoop-hdfs.jar) you can create InputStream/OutputStream for an HDFS path and read from/write to a file using regular java.io classes. For example:

    URI uri = URI.create (“hdfs://host:port/file path”);
    Configuration conf = new Configuration();
    FileSystem file = FileSystem.get(uri, conf);
    FSDataInputStream in = file.open(new Path(uri));
    

    This code will work with local files as well (change hdfs:// to file://).

    0 讨论(0)
  • 2020-12-29 13:44

    Here's what worked best for me (using Spark 2.0):

    val path = new Path("hdfs://namenode:8020/some/folder/myfile.txt")
    val conf = new Configuration(spark.sparkContext.hadoopConfiguration)
    conf.setInt("dfs.blocksize", 16 * 1024 * 1024) // 16MB HDFS Block Size
    val fs = path.getFileSystem(conf)
    if (fs.exists(path))
        fs.delete(path, true)
    val out = new BufferedOutputStream(fs.create(path)))
    val txt = "Some text to output"
    out.write(txt.getBytes("UTF-8"))
    out.flush()
    out.close()
    fs.close()
    
    0 讨论(0)
  • 2020-12-29 13:47

    Thanks to marios and kostya, but there are few steps to writing a text file into HDFS from Spark.

    // Hadoop Config is accessible from SparkContext
    val fs = FileSystem.get(sparkContext.hadoopConfiguration); 
    
    // Output file can be created from file system.
    val output = fs.create(new Path(filename));
    
    // But BufferedOutputStream must be used to output an actual text file.
    val os = BufferedOutputStream(output)
    
    os.write("Hello World".getBytes("UTF-8"))
    
    os.close()
    

    Note that FSDataOutputStream, which has been suggested, is a Java serialized object output stream, not a text output stream. The writeUTF method appears to write plaint text, but it's actually a binary serialization format that includes extra bytes.

    0 讨论(0)
  • 2020-12-29 13:54

    One simple way to write files to HDFS is to use a SequenceFiles. Here you use the native Hadoop APIs and not the ones provided by Spark.

    Here is a simple snippet (in Scala):

    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.fs._
    import org.apache.hadoop.io._ 
    
    val conf = new Configuration() // Hadoop configuration 
    val sfwriter = SequenceFile.createWriter(conf,
                  SequenceFile.Writer.file(new Path("hdfs://nn1.example.com/file1")),
                  SequenceFile.Writer.keyClass(LongWritable.class),
                  SequenceFile.Writer.valueClass(Text.class))
    val lw = new LongWritable()
    val txt = new Text()
    lw.set(12)
    text.set("hello")
    sfwriter.append(lw, txt)
    sfwriter.close()
    ...
    

    In case you don't have a key you can use NullWritable.class in its place:

    SequenceFile.Writer.keyClass(NullWritable.class)
    sfwriter.append(NullWritable.get(), new Text("12345"));
    
    0 讨论(0)
提交回复
热议问题