Skip to content
Paul Rogers edited this page Jan 5, 2018 · 20 revisions

In this section we will discuss the structure of a UDF. As we go along, keep in mind that a UDF is not just a Java function wrapped in a funny class structure. Instead, it is a DSL that happens to use Java syntax. Because of the way that Drill generates code using the UDF, you are severely limited by what you can do in a UDF. However, we'll later propose a convenient structure that gives you back the full power of Java.

As we go, we will refer to the log2 example discussed ((need link)).

Declaring a UDF

Drill has the job of starting with a SQL statement of the form:

SELECT log2(4) FROM (VALUES (1))

And producing the generated code discussed in the previous section. A key aspect of that task is to find an implementation for the log2 function. To do that, Drill creates an internal function registry of all functions known to Drill. These include "built-in" functions, "static" UDFs (given to Drill as jar files) and "dynamic" UDFs (loaded at run time.)

In each case, Drill is given a collection of Java class files. Drill has to sort though them all to find those that implement Drill functions (including UDFs.) To do that, Drill requires that functions include an annotation:

import org.apache.drill.exec.expr.DrillSimpleFunc;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;

@FunctionTemplate(
    name = "log2",
    scope = FunctionScope.SIMPLE,
    nulls = NullHandling.NULL_IF_NULL)

public static class Log2Function implements DrillSimpleFunc {

Function Name

The implementation class name must follow Java naming conventions and cannot be the same as an existing class. So, we can't name our class "log2" (as classes must start with an upper-case letter). Instead, the name of the class is independent of the function name. We set the function name using the name argument of the FunctionTemplate annotation.

Function names need follow very few rules:

  • Cannot be blank or null.
  • Should be a valid SQL identifier.
  • Should not duplicate a SQL keyword.
  • Must not duplicate Drill built-in function or UDF you have imported (except that you can declare multiple functions of the same name, but with different argument types.) ((Need link.))

You can us an invalid or duplicate identifier, but you must escape the name. (Quoting does not help, however, if your function name duplicates another.) For example, we can name our function something like ++ as long as we quote it:

SELECT `++`(1) FROM VALUES(1);

Quoting is, however, a nuisance, so instead follow the naming rules to choose a valid, unique name.

Multiple Names

If you are using the ((link needed)) source-based debugging techniques, you can ask your IDE to display the source code for the FunctionTemplate. Doing so is well worth the effort: the is one of the few classes in Drill that provides good documentation.

In that documentation we find that the annotation supports an alternative form for names: the names field that can hold a list of aliases. For example, for the functions that add dates:

    @FunctionTemplate(names = {"date_add", "add"}, ...

You can consider this approach if you name your method something, but then want to change the name, but don't want to break existing queries. Just give your function two names: the old one and the new one.

By the way, this points out another handy technique. If you have built Drill from sources, you can review Drill's own function implementations for additional hints and ideas. Look in java-exec/target/generated-sources/ in the org.apache.drill.exec.expr.fn.impl package and its sub-packages. The above example comes from CastDecimal9VarChar. Note, however, that these classes are generated, so you won't find these classes in GitHub, you have to build Drill to see them.

Function Scope

Drill supports two kinds of functions: simple row-by-row functions (FunctionScope.SIMPLE), and group-based aggregates (FunctionScope.POINT_AGGREGATE). Most of our examples focus on the simple functions. We'll cover aggregates in a later section ((need link)) once we've mastered simple functions.

Drill defines two additional aggregate types: HOLISTIC_AGGREGATE and RANGE_AGGREGATE. But, a search of the code suggests that they are never used, and so we won't discuss them further.

Null Handling

In the theory section, we discussed the most common rule for handling nulls: "if any input value is null, then the output is also null." The rule is so common that Drill will do the work for you, as shown in the simplified examples on that page. The result is that, if your function follows that rule, your function is called only when all the inputs are non-null. To select this mode, we specify nulls = NullHandling.NULL_IF_NULL.

On the other hand, there may be cases were that rule does not apply. For example, suppose we are implementing the Boolean OR operation. If the first argument is TRUE, then the result is also TRUE regardless of the value of the second argument. In this case, we specify nulls = NullHandling.INTERNAL.

Declaring Parameters and Return Value

As discussed in the code generation section, Drill will automatically inline our code. Drill needs to know how to find our input parameters and output value. We do that using annotations:

import org.apache.drill.exec.expr.annotations.Output;
import org.apache.drill.exec.expr.annotations.Param;
import org.apache.drill.exec.expr.holders.Float8Holder;

public class Log2Function implements DrillSimpleFunc {

  @Param public Float8Holder x;
  @Output public Float8Holder out;

The @Param argument declares the parameters in the order they are passed to a function. (We'll see a more advanced example later.) The @Output parameter declares the return value (the output.) (See also the For Drill documentation.)

The fields must be one of Drill's "holder" types. Drill will work backward from the type of holder to the data type (and cardinality) of the input and output. For example, the use of Float8Holder above tells Drill that our function accepts a non-nullable FLOAT8 argument, and returns the same type. A list of the data type/holder pairs is ((need link)).

The arguments in the example are declared as public, but those in the Drill example default to protected. Which is right? As it turns out, either is fine: Drill never actually uses your compiled code. We have marked them public so we can more easily create unit tests.

Declaring "Workspace" Fields

The Setup Method

The Eval Method

External Dependencies

Allowable Code

Restrictions

Declaring a Class as a UDF

Drill implements each UDF as a class. How does Drill know which classes are UDFs? By using a Drill-defined annotation and implementing a Drill-defined interface. Following the documentation, let's explain this in the context of an example function that that fills a gap in Drill's math and trig functions: the sin function:

import org.apache.drill.exec.expr.DrillSimpleFunc;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;

@FunctionTemplate(
    name = "sin",
    scope = FunctionScope.SIMPLE,
    nulls = NullHandling.NULL_IF_NULL)

public static class SinFunction implements DrillSimpleFunc {

Here we immediately see a benefit of working with the Drill project: we can traverse to the source for these various items, including the annotation. Fortunately, the documentation for the FunctionTemplate has some very helpful explanations.

For now, let's use the scope and nulls value above; we'll discuss them at length later.

Clone this wiki locally