Write to multiple outputs by key Spark - one Spark job

后端 未结 10 1641
挽巷
挽巷 2020-11-22 05:08

How can you write to multiple outputs dependent on the key using Spark in a single Job.

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<

10条回答
  •  情话喂你
    2020-11-22 06:13

    I had a similar use case where I split the input file on Hadoop HDFS into multiple files based on a key (1 file per key). Here is my scala code for spark

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    
    val hadoopconf = new Configuration();
    val fs = FileSystem.get(hadoopconf);
    
    @serializable object processGroup {
        def apply(groupName:String, records:Iterable[String]): Unit = {
            val outFileStream = fs.create(new Path("/output_dir/"+groupName))
            for( line <- records ) {
                    outFileStream.writeUTF(line+"\n")
                }
            outFileStream.close()
        }
    }
    val infile = sc.textFile("input_file")
    val dateGrouped = infile.groupBy( _.split(",")(0))
    dateGrouped.foreach( (x) => processGroup(x._1, x._2))
    

    I have grouped the records based on key. The values for each key is written to separate file.

提交回复
热议问题