Looking for a concise example on how to read and write Map
from/to parquet file in Java or Scala?
Here is expected structure, usin
Apache Drill is your answer!
Convert to parquet : You can use the CTAS(create table as) feature in drill. By default drill creates a folder with parquet files after executing the below query. You can substitute any query and drill writes the output of you query into parquet files
create table file_parquet as select * from dfs.`/data/file.json`;
Convert from parquet : We also use the CTAS feature here, however we request drill to use a different format for writing the output
alter session set `store.format`='json';
create table file_json as select * from dfs.`/data/file.parquet`;
Refer to http://drill.apache.org/docs/create-table-as-ctas-command/ for more information
i'm not quite good about parquet but, from here:
Schema schema = new Schema.Parser().parse(Resources.getResource("map.avsc").openStream());
File tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
tmp.deleteOnExit();
tmp.delete();
Path file = new Path(tmp.getPath());
AvroParquetWriter<GenericRecord> writer =
new AvroParquetWriter<GenericRecord>(file, schema);
// Write a record with an empty map.
ImmutableMap emptyMap = new ImmutableMap.Builder<String, Integer>().build();
GenericData.Record record = new GenericRecordBuilder(schema)
.set("mymap", emptyMap).build();
writer.write(record);
writer.close();
AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file);
GenericRecord nextRecord = reader.read();
assertNotNull(nextRecord);
assertEquals(emptyMap, nextRecord.get("mymap"));
In your situation change ImmutableMap
(Google Collections) with a default Map as below:
Schema schema = new Schema.Parser().parse( Resources.getResource( "map.avsc" ).openStream() );
File tmp = File.createTempFile( getClass().getSimpleName(), ".tmp" );
tmp.deleteOnExit();
tmp.delete();
Path file = new Path( tmp.getPath() );
AvroParquetWriter<GenericRecord> writer = new AvroParquetWriter<GenericRecord>( file, schema );
// Write a record with an empty map.
Map<String,Object> emptyMap = new HashMap<String, Object>();
// not empty any more
emptyMap.put( "SOMETHING", new SOMETHING() );
GenericData.Record record = new GenericRecordBuilder( schema ).set( "mymap", emptyMap ).build();
writer.write( record );
writer.close();
AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>( file );
GenericRecord nextRecord = reader.read();
assertNotNull( nextRecord );
assertEquals( emptyMap, nextRecord.get( "mymap" ) );
I didn't test the code, but give it a try..
I doubt there is a solution to this readily available. When you talk about Maps, its still possible to create a AvroSchema out of it provided the values of the maps is a primitive type, or a complexType which inturn has primitive type fields.
In your case,
To resolve this issue, you can try to convert your object into a JsonObject
and then use the Apache Spark libraries to convert it to Parquet.