-
Notifications
You must be signed in to change notification settings - Fork 982
Structure of a UDF
In this section we will discuss the structure of a UDF. As we go along, keep in mind that a UDF is not just a Java function wrapped in a funny class structure. Instead, it is a DSL that happens to use Java syntax. Because of the way that Drill generates code using the UDF, you are severely limited by what you can do in a UDF. However, we'll later propose a convenient structure that gives you back the full power of Java.
As we go, we will refer to the log2
example discussed in the Debugging UDFs section.
Drill has the job of starting with a SQL statement of the form:
SELECT log2(4) FROM (VALUES (1))
From which it must produce the generated code discussed in the previous section. A key aspect of that task is to find an implementation for the log2
function. To do that, Drill creates an internal function registry of all functions known to Drill. These include "built-in" functions, "static" UDFs (given to Drill as jar files) and "dynamic" UDFs (loaded at run time.)
In each case, Drill is given a collection of Java class files. Drill has to sort though them all to find those that implement Drill functions (including UDFs.) To do that, Drill requires that functions include an annotation:
import org.apache.drill.exec.expr.DrillSimpleFunc;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;
@FunctionTemplate(
name = "log2",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public static class Log2Function implements DrillSimpleFunc {
Note that the name of the annotation, @FunctionTemplate
, reminds us that this is a template for generating the function code, it is not the function itself. That is, the Java class we create uses a simplified DSL that Drill translates into generated Java code.
The implementation class name must follow Java naming conventions and cannot be the same as an existing class. So, we can't name our class "log2" (as classes must start with an upper-case letter). Instead, the name of the class is independent of the function name. We set the function name using the name
argument of the FunctionTemplate
annotation.
Function names need follow very few rules:
- Cannot be blank or null.
- Should be a valid SQL identifier.
- Should not duplicate a SQL keyword.
- Must not duplicate Drill built-in function or UDF you have imported (except that you can declare multiple functions of the same name, but with different argument types.)
You can us an identifier which is invalid for SQL, but if so, you must escape the name. For example, we can name our function something like ++
as long as we quote it:
SELECT `++`(1) FROM VALUES(1);
Quoting is, however, a nuisance, so instead follow the naming rules to choose a valid name which does not duplicate an existing function name or SQL keyword.
If you are using the ((link needed)) source-based debugging techniques, you can ask your IDE to display the source code for the FunctionTemplate
. Doing so is well worth the effort: the is one of the few classes in Drill that provides good documentation.
In that documentation we find that the annotation supports an alternative form for names: the names
field that can hold a list of aliases. For example, for the functions that add dates:
@FunctionTemplate(names = {"date_add", "add"}, ...
You can consider this approach if you name your method something, but then want to change the name, but don't want to break existing queries. Just give your function two names: the old one and the new one.
By the way, this points out another handy technique. If you have built Drill from sources, you can review Drill's own function implementations for additional hints and ideas. Look in java-exec/target/generated-sources/
in the org.apache.drill.exec.expr.fn.impl
package and its sub-packages. The above example comes from CastDecimal9VarChar
. Note, however, that these classes are generated, so you won't find these classes in GitHub, you have to build Drill to see them.
Drill supports two kinds of functions: simple row-by-row functions (FunctionScope.SIMPLE
), and group-based aggregates (FunctionScope.POINT_AGGREGATE
). Most of our examples focus on the simple functions. We'll cover aggregates in a later section ((need link)) once we've mastered simple functions.
Drill defines two additional aggregate types: HOLISTIC_AGGREGATE
and RANGE_AGGREGATE
. But, a search of the code suggests that they are never used, and so we won't discuss them further.
Every Drill function must implement an interface that matches the scope
setting. Here we use the DrillSimpleFunc
interface defined as:
public interface DrillSimpleFunc extends DrillFunc {
public void setup();
public void eval();
}
We will discuss the methods in the next section.
The relationship between scope and interface is:
Scope | Interface |
---|---|
SIMPLE |
DrillSimpleFunc |
POINT_AGGREGATE |
DrillAggFunc |
In the theory section, we discussed the most common rule for handling nulls: "if any input value is null, then the output is also null." The rule is so common that Drill will do the work for you, as shown in the simplified examples on that page. The result is that, if your function follows that rule, your function is called only when all the inputs are non-null. To select this mode, we specify nulls = NullHandling.NULL_IF_NULL
.
On the other hand, there may be cases where that rule does not apply. For example, suppose we are implementing the Boolean OR
operation. If the first argument is TRUE
, then the result is TRUE
regardless of the value of the second argument. In this case, we specify nulls = NullHandling.INTERNAL
.
As discussed in the code generation section, Drill will automatically inline our code. Drill needs to know how to find our input parameters and output value. We do that using annotations:
import org.apache.drill.exec.expr.annotations.Output;
import org.apache.drill.exec.expr.annotations.Param;
import org.apache.drill.exec.expr.holders.Float8Holder;
public class Log2Function implements DrillSimpleFunc {
@Param public Float8Holder x;
@Output public Float8Holder out;
The @Param
argument declares the parameters in the order they are passed to a function. (We'll see a more advanced example later.) The @Output parameter declares the return value (the output.) (See also the For Drill documentation.)
The arguments in the example are declared as public
, but those in the Drill example default to protected. Which is right? As it turns out, either is fine: Drill never actually uses your compiled code. We have marked them public so we can more easily create unit tests.
As shown in the documentation, you can add a constant
field to the @Param
annotation. At present, the constant
attribute is used only to enforce that the argument is, in fact, a constant rather than a column reference. If not, you'll see this error in the log:
The argument 'yourArg' of Function 'yourFunc' has to be constant!
The example from the documentation:
public class SimpleMaskFunc implements DrillSimpleFunc {
@Param NullableVarCharHolder input;
@Param(constant = true) VarCharHolder mask;
@Param(constant = true) IntHolder toReplace;
The fields must be one of Drill's "holder" types. Drill will work backward from the type of holder to the data type (and cardinality) of the input and output. For example, the use of Float8Holder
above tells Drill that our function accepts a non-nullable FLOAT8
argument, and returns the same type. Holders are explained in depth in Data Types and Holders for UDFs.
Drill (and SQL) support overloaded functions which occurs when two or more functions have the same name, but distinct parameter types. Thus, the name does not uniquely identify a function, instead, think of the function identity as:
name(arg1_type, arg2_type, ...)
Thus, the identify of our example log2
function is: log2(FLOAT8)
.
We could also define other versions, using the same name, if we wanted:
log2(INT-REQUIRED)
log2(INT-OPTIONAL)
log2(BIGINT-REQUIRED)
log2(DECIMAL9-REQUIRED)
Drill does a search ((need link)) to find the best match. If no exact match is found, Drill will attempt to convert parameters from their actual types to the type needed by the method. For example, if we call our log2()
function on an INT
column, Drill will convert the INT
value to a FLOAT8
. Providing multiple versions (in this case, one that takes an INT
argument) can potentially save the (negligible) cost of the conversion. Of course, in some cases (such as the DECIMAL9
version) we may not want Drill to convert a decimal to a float. Good testing will help you determine when you need to provide multiple versions of your function.
Drill also considers cardinality (AKA "mode" or nullability) as part of the argument type. For example, in the list above, versions of log2
appear for both nullable and non-nullable INT
. Generally, if you let Drill handle nulls, you only need create the required versions.
Functions sometimes need temporary internal state. Aggregates need a place to store the running totals. Drill provides the @Workspace
annotation to indicate such fields. For example, in the log2
function, we might want to pre-compute part of the expression:
public class Log2Function implements DrillSimpleFunc {
@Workspace private double LOG_2;
A very common, and mysterious, mistake is to omit the @Workspace
annotation. If you do that, you will get a spectacular set of error messages in your log file, along with multiple copies of the generated source code. (See the troubleshooting section.) If you see that, verify that every field has one of the three annotations: @Param
, @Output
or @Workspace
.
Note also that it is common to declare constants for cases such as that shown above.
public class FunctionImpl {
// DO NOT do this!
private static final double LOG_2 = Math.log(2.0D);
public static final double log2(double x) {
return Math.log(x) / LOG_2;
}
}
But, constants do not work with UDFs. You won't get an error, but your constant will never be initialized, resulting in mysterious errors.
While the @Param
and @Output
field must be declared as holders, the @Workspace
fields can be of any (supported) type. (But, as shown above, must be instance fields, not static
.)
When working with VARCHAR
or VARBINARY
data you must negotiate with Drill a place to store the data using the @Inject
annotation. We'll discuss this annotation in the Working with VARCHAR Data in UDFs section.
A function must provide a setup()
method, but it can be empty. For simple functions, setup()
is handy to do one-time initialization. For example, to define the "constant" discussed above:
public class Log2Function implements DrillSimpleFunc {
@Workspace private double LOG_2;
public void setup() {
LOG_2 = Math.log(2.0D);
}
For simple functions, setup()
is called once per batch. Actually, the method is not "called." Instead, just like the eval()
method, is is inlined into the generated code. Recall we showed that code generation creates two methods, also called setup()
and eval()
. Your function's setup()
method is inlined into the generated setup()
method. (For this to work, the workspace fields are also inlined.)
The setup()
method plays a large role in aggregate functions.
The core of a UDF is the eval()
method, called once per row. For example:
@Override
public void eval() {
out.value = Math.log(x.value) / LOG_2;
}
As noted, the eval()
method is not actually "called", it is instead copied inline into the generated code. This is why Drill must have visibility to both you object code (in order to get the annotations) and your source code (to inline the methods.)
Although your eval()
method is inlined, it cannot access any information other than that passed in via the @Param
arguments. Since the @Param
arguments are holders, your function has no visibility to Drill value vectors or other internals.
We write UDFs using Java, but are restricted to a small subset. One restriction is that Drill will not honor Java import
statements. Suppose we had an existing MathUtils.log2()
function we wanted to use. We cannot do this:
// DO NOT do this!
import org.yoyodyne.utils.MathUtils;
...
@Override
public void eval() {
out.value = MathUtils.log2(x.value);
}
Doing the above will also result in a spectacular set of error messages in your Drill log file. Instead, you must put an fully-qualfied reference inline so that Drill properly copies it to the generated code:
@Override
public void eval() {
// Use fully-qualified references
out.value = org.yoyodyne.utils.MathUtils.log2(x.value);
}
Suppose that, like the above case, we want to define our function in another class. Suppose this time the class is in the same package as our UDF. Normally in Java we can just say:
@Override
public void eval() {
// DON'T DO THIS!
// Implicit references to same-package classes not supported.
out.value = MathUtils.log2(x.value);
}
The same error will result as if we used import
. Why? Drill simply copies the code, but puts it into Drill's own org.apache.drill.exec.test.generated
package. Since the code is now in a different package, Java can no longer locate our MathUtils
class and a Java compile error results when Drill compiles the generated code.
The solution is to always use fully-qualified references to all classes except for those defined in the JDK, or the subset of Java classes that Drill will import for you. (The set of imported classes are the holders and a few others.)
So, the proper form is:
package org.apache.drill.contrib.mathUtils;
...
@Override
public void eval() {
// Use fully-qualified references
out.value = org.apache.drill.contrib.mathUtils.MathUtils.log2(x.value);
}
Drill uses a fantastically complex process to reverse engineer your source code, forward engineer (e.g. generate) the Java code, compile Java, then merge byte codes. This process is (supposed to be) fast, but it is very fragile. In general, the process cannot handle complex Java and is design for very simple declarative statements with shallow nesting. Said another way, don't try to implement K-means directly in a UDF.
It turns out, however, that once we are aware of the limitations, an easy workaround exists. Simply put all the complex code into another function, and leave the UDF as a wrapper that calls our function. That way, Drill will muck about only with the wrapper, not with the actual function implementation.
Because of the way that UDFs are copied into generated code, they have access to nothing inside Drill except the holders. In particular, they cannot access:
- The operator that runs the code.
- The value vectors that hold other rows within the current batch.
- Values other than those passed as parameters.
On the one hand, the above restrictions make UDFs quite simple and robust (once we get them running.) On the other, the restrictions limit what you can accomplish with a UDF.
A quick summary of (some of the) restrictions on a UDF includes:
- All fields must have annotations.
- No static or final fields.
- No imports or implicit same-package class references.
- No complex code.
- No access to any Drill internals other than the arguments to the function.
- UDFs can access nullable or non-nullable columns, but not repeated (array) columns.
- Do not call methods on the holder objects.
- Do not pass holder objects to other methods.
- Function classes cannot extend other classes or interfaces; only the Drill function interfaces described here.
Presumably, all this extra work for the developer pays off in slightly faster runtime because, again presumably, Drill can generate better code than Java can (a highly dubious proposition, but there we have it.)