问题
I'm writing a custom SerDe and will only be using it to deserialize.
The underlying data is a thrift binary, each row is an event log. Each event has a schema which i have access to, but we wrap the event in another schema, let's call it Message
before storing.
The reason I'm writing a SerDe instead of using the ThriftDeserializer is because as mentioned the underlying event is wrapped as a Message. So we first need to deserialize using the schema of Message
and then deserialize the data for that event.
The SerDe works (only) when I do a SELECT *
and I can deserialize the data as expected but whenever I select a column from the table instead of a SELECT *, the rows are all NULL. The object inspector returned is a ThriftStructObjectInspector
and the Object returned by the deserialize is a TBase.
What could cause Hive to return NULL when we select a column, but return the column data when I do a SELECT * ?
Here's the SerDe class (changed some classnames):
public class MyThriftSerde extends AbstractSerDe {
private static final Log LOG = LogFactory.getLog(MyThriftSerde.class);
/* Abstracting away the deserialization of the underlying event which is wrapped in a message */
private static final MessageDeserializer myMessageDeserializer =
MessageDeserializer.getInstance();
/* Underlying event class which is wrapped in a Message */
private String schemaClassName;
private Class<?> schemaClass;
/* Used to read the input row */
public static List<String> inputFieldNames;
public static List<ObjectInspector> inputFieldOIs;
public static List<Integer> notSkipIDs;
public static ObjectInspector inputRowObjectInspector;
/* Output Object Inspector */
public static ObjectInspector thriftStructObjectInspector;
@Override
public void initialize(Configuration conf, Properties tbl) throws SerDeException {
try {
logHeading("INITIALIZE MyThriftSerde");
schemaClassName = tbl.getProperty(SERIALIZATION_CLASS);
schemaClass = conf.getClassByName(schemaClassName);
LOG.info(String.format("Building DDL for event: %s", schemaClass.getName()));
inputFieldNames = new ArrayList<>();
inputFieldOIs = new ArrayList<>();
notSkipIDs = new ArrayList<>();
/* Initialize the Input fields */
// The underlying data is stored in RCFile format, and only has 1 column, event_binary
// So we create a ColumnarStructBase for each row we deserialize.
// This ColumnasStruct only has 1 column: event_binary
inputFieldNames.add("event_binary");
notSkipIDs.add(0);
inputFieldOIs.add(LazyPrimitiveObjectInspectorFactory.LAZY_BINARY_OBJECT_INSPECTOR);
inputRowObjectInspector =
ObjectInspectorFactory.getColumnarStructObjectInspector(inputFieldNames, inputFieldOIs);
/* Output Object Inspector*/
// This is what the SerDe will return, it is a ThriftStructObjectInspector
thriftStructObjectInspector =
ObjectInspectorFactory.getReflectionObjectInspector(
schemaClass, ObjectInspectorFactory.ObjectInspectorOptions.THRIFT);
// Only for debugging
logHeading("THRIFT OBJECT INSPECTOR");
LOG.info("Output OI Class Name: " + thriftStructObjectInspector.getClass().getName());
LOG.info(
"OI Details: "
+ ObjectInspectorUtils.getObjectInspectorName(thriftStructObjectInspector));
} catch (Exception e) {
LOG.info("Exception while initializing SerDe", e);
}
}
@Override
public Object deserialize(Writable rowWritable) throws SerDeException {
logHeading("START DESERIALIZATION");
ColumnarStructBase inputLazyStruct =
new ColumnarStruct(inputRowObjectInspector, notSkipIDs, null);
LazyBinary eventBinary;
Message rowAsMessage;
TBase deserializedRow = null;
try {
inputLazyStruct.init((BytesRefArrayWritable) rowWritable);
eventBinary = (LazyBinary) inputLazyStruct.getField(0);
rowAsMessage =
myMessageDeserializer.fromBytes(eventBinary.getWritableObject().copyBytes(), null);
deserializedRow = rowAsMessage.getEvent();
LOG.info("deserializedRow.getClass(): " + deserializedRow.getClass());
LOG.info("deserializedRow.toString(): " + deserializedRow.toString());
} catch (Exception e) {
e.printStackTrace();
}
logHeading("END DESERIALIZATION");
return deserializedRow;
}
private void logHeading(String s) {
LOG.info(String.format("------------------- %s -------------------", s));
}
@Override
public ObjectInspector getObjectInspector() {
return thriftStructObjectInspector;
}
}
Context on the code:
- In the underlying data, each row contains only 1 column (called event_binary), stored as a binary. The binary is a Message which contains 2 fields, "schema" + "event_data". i.e. each row is a Message which contains the underlying event's schema + data. We use the schema from Message to deserialize the data.
- The SerDe first deserializes the row as a Message, extracts the event data and then deserializes the event.
I create an EXTERNAL table which points to the Thrift data using
ADD JAR hdfs://my-jar.jar;
CREATE EXTERNAL TABLE dev_db.thrift_event_data_deserialized
ROW FORMAT SERDE 'com.test.only.MyThriftSerde'
WITH SERDEPROPERTIES (
"serialization.class"="com.test.only.TestEvent"
) STORED AS RCFILE
LOCATION 'location/of/thrift/data';
MSCK REPAIR TABLE thrift_event_data_deserialized;
Then SELECT * FROM dev_db.thrift_event_data_deserialized LIMIT 10;
works as expected
But, SELECT column1_name, column2_name FROM dev_db.thrift_event_data_deserialized LIMIT 10;
does not work.
Any idea what i'm missing here? Would love any help on this!
来源:https://stackoverflow.com/questions/52305545/custom-hive-serde-unable-to-select-column-but-works-when-i-do-select