Read parquet data from AWS s3 bucket

后端 未结 1 1432
独厮守ぢ
独厮守ぢ 2020-12-18 23:29

I need read parquet data from aws s3. If I use aws sdk for this I can get inputstream like this:

S3Object object = s3Client.getObject(new GetObjectRequest(bu         


        
相关标签:
1条回答
  • 2020-12-19 00:00
    String SCHEMA_TEMPLATE = "{" +
                            "\"type\": \"record\",\n" +
                            "    \"name\": \"schema\",\n" +
                            "    \"fields\": [\n" +
                            "        {\"name\": \"timeStamp\", \"type\": \"string\"},\n" +
                            "        {\"name\": \"temperature\", \"type\": \"double\"},\n" +
                            "        {\"name\": \"pressure\", \"type\": \"double\"}\n" +
                            "    ]" +
                            "}";
    String PATH_SCHEMA = "s3a";
    Path internalPath = new Path(PATH_SCHEMA, bucketName, folderName);
    Schema schema = new Schema.Parser().parse(SCHEMA_TEMPLATE);
    Configuration configuration = new Configuration();
    AvroReadSupport.setRequestedProjection(configuration, schema);
    ParquetReader<GenericRecord> = AvroParquetReader.GenericRecord>builder(internalPath).withConf(configuration).build();
    GenericRecord genericRecord = parquetReader.read();
    
    while(genericRecord != null) {
            Map<String, String> valuesMap = new HashMap<>();
            genericRecord.getSchema().getFields().forEach(field -> valuesMap.put(field.name(), genericRecord.get(field.name()).toString()));
    
            genericRecord = parquetReader.read();
    }
    

    Gradle dependencies

        compile 'com.amazonaws:aws-java-sdk:1.11.213'
        compile 'org.apache.parquet:parquet-avro:1.9.0'
        compile 'org.apache.parquet:parquet-hadoop:1.9.0'
        compile 'org.apache.hadoop:hadoop-common:2.8.1'
        compile 'org.apache.hadoop:hadoop-aws:2.8.1'
        compile 'org.apache.hadoop:hadoop-client:2.8.1'
    
    0 讨论(0)
提交回复
热议问题