-
Notifications
You must be signed in to change notification settings - Fork 982
Structure of a UDF
In this section we will discuss the structure of a UDF. As we go along, keep in mind that a UDF is not just a Java function wrapped in a funny class structure. Instead, it is a DSL that happens to use Java syntax. Because of the way that Drill generates code using the UDF, you are severely limited by what you can do in a UDF. However, we'll later propose a convenient structure that gives you back the full power of Java.
As we go, we will refer to the log2
example discussed ((need link)).
Drill has the job of starting with a SQL statement of the form:
SELECT log2(4) FROM (VALUES (1))
And producing the generated code discussed in the previous section. A key aspect of that task is to find an implementation for the log2
function. To do that, Drill creates an internal function registry of all functions known to Drill. These include "built-in" functions, "static" UDFs (given to Drill as jar files) and "dynamic" UDFs (loaded at run time.)
In each case, Drill is given a collection of Java class files. Drill has to sort though them all to find those that implement Drill functions (including UDFs.) To do that, Drill requires that functions include an annotation:
import org.apache.drill.exec.expr.DrillSimpleFunc;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;
@FunctionTemplate(
name = "log2",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public static class Log2Function implements DrillSimpleFunc {
The implementation class name must follow Java naming conventions and cannot be the same as an existing class. So, we can't name our class "log2" (as classes must start with an upper-case letter). Instead, the name of the class is independent of the function name. We set the function name using the name
argument of the FunctionTemplate
annotation.
Function names need follow very few rules:
- Cannot be blank or null.
- Should be a valid SQL identifier.
- Should not duplicate a SQL keyword.
- Must not duplicate Drill built-in function or UDF you have imported (except that you can declare multiple functions of the same name, but with different argument types.) ((Need link.))
You can us an invalid or duplicate identifier, but you must escape the name. (Quoting does not help, however, if your function name duplicates another.) For example, we can name our function something like ++
as long as we quote it:
SELECT `++`(1) FROM VALUES(1);
Quoting is, however, a nuisance, so instead follow the naming rules to choose a valid, unique name.
If you are using the ((link needed)) source-based debugging techniques, you can ask your IDE to display the source code for the FunctionTemplate
. Doing so is well worth the effort: the is one of the few classes in Drill that provides good documentation.
In that documentation we find that the annotation supports an alternative form for names: the names
field that can hold a list of aliases. For example, for the functions that add dates:
@FunctionTemplate(names = {"date_add", "add"}, ...
You can consider this approach if you name your method something, but then want to change the name, but don't want to break existing queries. Just give your function two names: the old one and the new one.
By the way, this points out another handy technique. If you have built Drill from sources, you can review Drill's own function implementations for additional hints and ideas. Look in java-exec/target/generated-sources/
in the org.apache.drill.exec.expr.fn.impl
package and its sub-packages. The above example comes from CastDecimal9VarChar
. Note, however, that these classes are generated, so you won't find these classes in GitHub, you have to build Drill to see them.
Drill supports two kinds of functions: simple row-by-row functions (FunctionScope.SIMPLE
), and group-based aggregates (FunctionScope.POINT_AGGREGATE
). Most of our examples focus on the simple functions. We'll cover aggregates in a later section ((need link)) once we've mastered simple functions.
Drill defines two additional aggregate types: HOLISTIC_AGGREGATE
and RANGE_AGGREGATE
. But, a search of the code suggests that they are never used, and so we won't discuss them further.
Every Drill function must implement an interface that matches the scope
setting. Here we use the DrillSimpleFunc
interface defined as:
public interface DrillSimpleFunc extends DrillFunc {
public void setup();
public void eval();
}
We will discuss the methods in the next section.
The relationship between scope and interface is:
Scope | Interface |
---|---|
SIMPLE |
DrillSimpleFunc |
POINT_AGGREGATE |
DrillAggFunc |
In the theory section, we discussed the most common rule for handling nulls: "if any input value is null, then the output is also null." The rule is so common that Drill will do the work for you, as shown in the simplified examples on that page. The result is that, if your function follows that rule, your function is called only when all the inputs are non-null. To select this mode, we specify nulls = NullHandling.NULL_IF_NULL
.
On the other hand, there may be cases were that rule does not apply. For example, suppose we are implementing the Boolean OR
operation. If the first argument is TRUE
, then the result is also TRUE
regardless of the value of the second argument. In this case, we specify nulls = NullHandling.INTERNAL
.
As discussed in the code generation section, Drill will automatically inline our code. Drill needs to know how to find our input parameters and output value. We do that using annotations:
import org.apache.drill.exec.expr.annotations.Output;
import org.apache.drill.exec.expr.annotations.Param;
import org.apache.drill.exec.expr.holders.Float8Holder;
public class Log2Function implements DrillSimpleFunc {
@Param public Float8Holder x;
@Output public Float8Holder out;
The @Param
argument declares the parameters in the order they are passed to a function. (We'll see a more advanced example later.) The @Output parameter declares the return value (the output.) (See also the For Drill documentation.)
The fields must be one of Drill's "holder" types. Drill will work backward from the type of holder to the data type (and cardinality) of the input and output. For example, the use of Float8Holder
above tells Drill that our function accepts a non-nullable FLOAT8
argument, and returns the same type. A list of the data type/holder pairs is ((need link)).
The arguments in the example are declared as public
, but those in the Drill example default to protected. Which is right? As it turns out, either is fine: Drill never actually uses your compiled code. We have marked them public so we can more easily create unit tests.
Functions sometimes need temporary internal state. Aggregates need a place to store the running totals. Drill provides the @Workspace
annotation to indicate such fields. For example, in the log2
function, we might want to pre-compute part of the expression:
public class Log2Function implements DrillSimpleFunc {
@Workspace private double LOG_2;
A very common, and mysterious, mistake is to omit the @Workspace
annotation. If you do that, you will get a spectacular set of error messages in your log file, along with multiple copies of the generated source code. (See the troubleshooting section.) If you see that, verify that every field has one of the three annotations: @Param
, @Output
or @Workspace
.
Note also that it is common to declare constants for cases such as that shown above.
public class FunctionImpl {
// DO NOT do this!
private static final double LOG_2 = Math.log(2.0D);
public static final double log2(double x) {
return Math.log(x) / LOG_2;
}
}
But, constants do not work with UDFs. You won't get an error, but your constant will never be initialized, resulting in mysterious errors.
Drill implements each UDF as a class. How does Drill know which classes are UDFs? By using a Drill-defined annotation and implementing a Drill-defined interface. Following the documentation, let's explain this in the context of an example function that that fills a gap in Drill's math and trig functions: the sin
function:
import org.apache.drill.exec.expr.DrillSimpleFunc;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;
@FunctionTemplate(
name = "sin",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public static class SinFunction implements DrillSimpleFunc {
Here we immediately see a benefit of working with the Drill project: we can traverse to the source for these various items, including the annotation. Fortunately, the documentation for the FunctionTemplate
has some very helpful explanations.
For now, let's use the scope
and nulls
value above; we'll discuss them at length later.