问题
I would like to create an array_agg
UDF for Apache Drill to be able to aggregate all values of a group to a list of values.
This should work with any major types (required, optional) and minor types (varchar, dict, map, int, etc.)
However, I get the impression that Apache Drill's UDF API does not really make use of inheritance and generics. Each type has its own writer and handler, and they cannot be abstracted to handle any type. E.g., the ValueHolder
interface seems to be purely cosmetic and cannot be used to have type-agnostic hooking of UDFs to any type.
My current implementation
I tried to solve this by using Java's reflection so I could use the ListHolder
's write function independent of the holder of the original value.
However, I then ran into the limitations of the @FunctionTemplate
annotation.
I cannot create a general UDF annotation for any value (I tried it with the interface ValueHolder
: @param ValueHolder input
.
So to me it seems like the only way to support different types to have separate classes for each type. But I can't even abstract much and work on any @Param input
, because input
is only visible in the class where its defined (i.e. type specific).
I based my implementation on https://issues.apache.org/jira/browse/DRILL-6963 and created the following two classes for required and optional varchars (how can this be unified in the first place?)
@FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class VarChar_Agg implements DrillAggFunc {
@Param org.apache.drill.exec.expr.holders.VarCharHolder input;
@Workspace ObjectHolder agg;
@Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
@Override
public void setup() {
agg = new ObjectHolder();
}
@Override
public void reset() {
agg = new ObjectHolder();
}
@Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj;
listWriter.varChar().write(input);
}
@Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
@FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class NullableVarChar_Agg implements DrillAggFunc {
@Param NullableVarCharHolder input;
@Workspace ObjectHolder agg;
@Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
@Override
public void setup() {
agg = new ObjectHolder();
}
@Override
public void reset() {
agg = new ObjectHolder();
}
@Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
if (input.isSet != 1) {
return;
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj;
org.apache.drill.exec.expr.holders.VarCharHolder outHolder = new org.apache.drill.exec.expr.holders.VarCharHolder();
outHolder.start = input.start;
outHolder.end = input.end;
outHolder.buffer = input.buffer;
listWriter.varChar().write(outHolder);
}
@Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
Interestingly, I can't import org.apache.drill.exec.vector.complex.writer.BaseWriter
to make the whole thing easier because then Apache Drill would not find it.
So I have to put the entire package path for everything in org.apache.drill.exec.vector.complex.writer
in the code.
Furthermore, I'm using the depcreated ObjectHolder. Any better solution?
Anyway: These work so far, e.g. with this query:
SELECT
MIN(tbl.`timestamp`) AS start_view,
MAX(tbl.`timestamp`) AS end_view,
array_agg(tbl.eventLabel) AS label_agg
FROM `dfs.root`.`/path/to/avro/folder` AS tbl
WHERE tbl.data.slug IS NOT NULL
GROUP BY tbl.data.slug
however, when I use ORDER BY
, I get this:
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: UnsupportedOperationException: NULL
Fragment 0:0
Additionally, I tried more complex types, namely maps/dicts.
Interestingly, when I call SELECT sqlTypeOf(tbl.data) FROM tbl
, I get MAP.
But when I write UDFs, the query planner complains about having no UDF array_agg
for type dict
.
Anyway, I wrote a version for dicts:
@FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class Map_Agg implements DrillAggFunc {
@Param MapHolder input;
@Workspace ObjectHolder agg;
@Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
@Override
public void setup() {
agg = new ObjectHolder();
}
@Override
public void reset() {
agg = new ObjectHolder();
}
@Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj;
//listWriter.copyReader(input.reader);
input.reader.copyAsValue(listWriter);
}
@Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
@FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class Dict_agg implements DrillAggFunc {
@Param DictHolder input;
@Workspace ObjectHolder agg;
@Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
@Override
public void setup() {
agg = new ObjectHolder();
}
@Override
public void reset() {
agg = new ObjectHolder();
}
@Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj;
//listWriter.copyReader(input.reader);
input.reader.copyAsValue(listWriter);
}
@Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
But here, I get an empty list in the field data_agg
for my query:
SELECT
MIN(tbl.`timestamp`) AS start_view,
MAX(tbl.`timestamp`) AS end_view,
array_agg(tbl.data) AS data_agg
FROM `dfs.root`.`/path/to/avro/folder` AS tbl
GROUP BY tbl.data.viewSlag
Summary of questions
- Most importantly: How do I create an
array_agg
UDF for Apache Drill? - How to make UDFs type-agnostic/general purpose? Do I really have to implement an entire class for each Nullable, Required and Repeated version of all types? That's a lot to do and quite tedious. Isn't there a way to handle values in an UDF agnostic to the underlying types? I wish Apache Drill would just use what Java offers here with function generic types, specialised function overloading and inheritence of their own type system. Am I missing something on how to do that?
- How can I fix the NULL problem when I use
ORDER BY
on my varchar version of the aggregate? - How can I fix the problem where my aggregate of maps/dicts is an empty list?
- Is there an alternative to using the deprecated
ObjectHolder
?
来源:https://stackoverflow.com/questions/62919727/apache-drill-write-general-purpose-array-agg-udf