How to efficiently read multiple small parquet files with Spark? is there a CombineParquetInputFormat?

前端 未结 2 1778
礼貌的吻别
礼貌的吻别 2021-01-26 03:18

Spark generated multiple small parquet Files. How can one handle efficiently small number of parquet files both on producer and consumer Spark jobs.

2条回答
  •  有刺的猬
    2021-01-26 03:29

    import org.apache.hadoop.mapreduce.InputSplit;
    import org.apache.hadoop.mapreduce.RecordReader;
    import org.apache.hadoop.mapreduce.TaskAttemptContext;
    import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
    import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReaderWrapper;
    import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
    import parquet.avro.AvroReadSupport;
    import parquet.hadoop.ParquetInputFormat;
    
    import java.io.IOException;
    
    public class CombineParquetInputFormat extends CombineFileInputFormat {
    
    
        @Override
        public RecordReader createRecordReader(InputSplit split, TaskAttemptContext
                context) throws IOException {
            CombineFileSplit combineSplit = (CombineFileSplit) split;
            return new CombineFileRecordReader(combineSplit, context, CombineParquetrecordReader.class);
        }
    
        private static class CombineParquetrecordReader extends CombineFileRecordReaderWrapper {
    
    
            public  CombineParquetrecordReader(CombineFileSplit split, TaskAttemptContext context, Integer idx) throws
                    IOException, InterruptedException {
                super(new ParquetInputFormat(AvroReadSupport.class), split, context, idx);
            }
        }
    }
    

    On consumer side please use the CombinedParquetInputFile: which will force multiple small files to be read from a single task .

    On Producer side : User coalesce(numFiles) to have adequate no of files as output.

    How to use the customInputFileFormat in spark and form RDD and Dataframes :

         JavaRDD javaRDD = sc.newAPIHadoopFile(hdfsInputPath, CombineParquetInputFormat.class, Void.class, "AvroPojo.class", sc.hadoopConfiguration())
                                                .values()
                                                .map(p -> {
                                                    Row row = RowFactory.create(avroPojoToObjectArray((p));
                                                    return row;
                                                });
    
    
       sc.hadoopConfiguration().setBoolean(FileInputFormat.INPUT_DIR_RECURSIVE,true);
    
    
    //set max split size else only 1 task wil be spawned    
     sc.hadoopConfiguration().setLong("mapreduce.input.fileinputformat.split.maxsize", (long) (128 * 1024 * 1024));
    
    
         StructType outputSchema = (StructType) SchemaConverters.toSqlType(Profile.getClassSchema()).dataType();
                final DataFrame requiredDataFrame = sqlContext.createDataFrame(javaRDD, outputSchema);
    

    Please refer to http://bytepadding.com/big-data/spark/combineparquetfileinputformat/ for in-depth understanding

提交回复
热议问题