Structure of a UDF

In this section we will discuss the structure of a UDF. As we go along, keep in mind that a UDF is not just a Java function wrapped in a funny class structure. Instead, it is a DSL that happens to use Java syntax. Because of the way that Drill generates code using the UDF, you are severely limited by what you can do in a UDF. However, we'll later propose a convenient structure that gives you back the full power of Java.

As we go, we will refer to the log2 example discussed in the Debugging UDFs section.

Declaring a UDF

Drill has the job of starting with a SQL statement of the form:

SELECT log2(4) FROM (VALUES (1))

From which it must produce the generated code discussed in the previous section. A key aspect of that task is to find an implementation for the log2 function. To do that, Drill creates an internal function registry of all functions known to Drill. These include "built-in" functions, "static" UDFs (given to Drill as jar files) and "dynamic" UDFs (loaded at run time.)

In each case, Drill is given a collection of Java class files. Drill has to sort though them all to find those that implement Drill functions (including UDFs.) To do that, Drill requires that functions include an annotation:

import org.apache.drill.exec.expr.DrillSimpleFunc;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;

@FunctionTemplate(
    name = "log2",
    scope = FunctionScope.SIMPLE,
    nulls = NullHandling.NULL_IF_NULL)

public static class Log2Function implements DrillSimpleFunc {

Note that the name of the annotation, @FunctionTemplate, reminds us that this is a template for generating the function code, it is not the function itself. That is, the Java class we create uses a simplified DSL that Drill translates into generated Java code.

Function Name

The implementation class name must follow Java naming conventions and cannot be the same as an existing class. So, we can't name our class "log2" (as classes must start with an upper-case letter). Instead, the name of the class is independent of the function name. We set the function name using the name argument of the FunctionTemplate annotation.

Function names need follow very few rules:

Cannot be blank or null.
Should be a valid SQL identifier.
Should not duplicate a SQL keyword.
Must not duplicate Drill built-in function or UDF you have imported (except that you can declare multiple functions of the same name, but with different argument types.)

You can us an identifier which is invalid for SQL, but if so, you must escape the name. For example, we can name our function something like ++ as long as we quote it:

SELECT `++`(1) FROM VALUES(1);

Quoting is, however, a nuisance, so instead follow the naming rules to choose a valid name which does not duplicate an existing function name or SQL keyword.

Multiple Names

If you are using the ((link needed)) source-based debugging techniques, you can ask your IDE to display the source code for the FunctionTemplate. Doing so is well worth the effort: the is one of the few classes in Drill that provides good documentation.

In that documentation we find that the annotation supports an alternative form for names: the names field that can hold a list of aliases. For example, for the functions that add dates:

    @FunctionTemplate(names = {"date_add", "add"}, ...

You can consider this approach if you name your method something, but then want to change the name, but don't want to break existing queries. Just give your function two names: the old one and the new one.

By the way, this points out another handy technique. If you have built Drill from sources, you can review Drill's own function implementations for additional hints and ideas. Look in java-exec/target/generated-sources/ in the org.apache.drill.exec.expr.fn.impl package and its sub-packages. The above example comes from CastDecimal9VarChar. Note, however, that these classes are generated, so you won't find these classes in GitHub, you have to build Drill to see them.

Function Scope

Drill supports two kinds of functions: simple row-by-row functions (FunctionScope.SIMPLE), and group-based aggregates (FunctionScope.POINT_AGGREGATE). Most of our examples focus on the simple functions. We'll cover aggregates in a later section ((need link)) once we've mastered simple functions.

Drill defines two additional aggregate types: HOLISTIC_AGGREGATE and RANGE_AGGREGATE. But, a search of the code suggests that they are never used, and so we won't discuss them further.

The Function Interface

Every Drill function must implement an interface that matches the scope setting. Here we use the DrillSimpleFunc interface defined as:

public interface DrillSimpleFunc extends DrillFunc {
  public void setup();
  public void eval();
}

We will discuss the methods in the next section.

The relationship between scope and interface is:

Scope	Interface
`SIMPLE`	`DrillSimpleFunc`
`POINT_AGGREGATE`	`DrillAggFunc`

Null Handling

In the theory section, we discussed the most common rule for handling nulls: "if any input value is null, then the output is also null." The rule is so common that Drill will do the work for you, as shown in the simplified examples on that page. The result is that, if your function follows that rule, your function is called only when all the inputs are non-null. To select this mode, we specify nulls = NullHandling.NULL_IF_NULL.

On the other hand, there may be cases where that rule does not apply. For example, suppose we are implementing the Boolean OR operation. If the first argument is TRUE, then the result is TRUE regardless of the value of the second argument. In this case, we specify nulls = NullHandling.INTERNAL.

Declaring Parameters and Return Value

As discussed in the code generation section, Drill will automatically inline our code. Drill needs to know how to find our input parameters and output value. We do that using annotations:

import org.apache.drill.exec.expr.annotations.Output;
import org.apache.drill.exec.expr.annotations.Param;
import org.apache.drill.exec.expr.holders.Float8Holder;

public class Log2Function implements DrillSimpleFunc {

  @Param public Float8Holder x;
  @Output public Float8Holder out;

The @Param argument declares the parameters in the order they are passed to a function. (We'll see a more advanced example later.) The @Output parameter declares the return value (the output.) (See also the For Drill documentation.)

The arguments in the example are declared as public, but those in the Drill example default to protected. Which is right? As it turns out, either is fine: Drill never actually uses your compiled code. We have marked them public so we can more easily create unit tests.

Constant Parameter

As shown in the documentation, you can add a constant field to the @Param annotation. At present, the constant attribute is used only to enforce that the argument is, in fact, a constant rather than a column reference. If not, you'll see this error in the log:

The argument 'yourArg' of Function 'yourFunc' has to be constant!

The example from the documentation:

public class SimpleMaskFunc implements DrillSimpleFunc {
    @Param NullableVarCharHolder input;
    @Param(constant = true) VarCharHolder mask;
    @Param(constant = true) IntHolder toReplace;

Holders

The fields must be one of Drill's "holder" types. Drill will work backward from the type of holder to the data type (and cardinality) of the input and output. For example, the use of Float8Holder above tells Drill that our function accepts a non-nullable FLOAT8 argument, and returns the same type. Holders are explained in depth in Data Types and Holders for UDFs.

Overloaded Functions

Drill (and SQL) support overloaded functions which occurs when two or more functions have the same name, but distinct parameter types. Thus, the name does not uniquely identify a function, instead, think of the function identity as:

name(arg1_type, arg2_type, ...)

Thus, the identify of our example log2 function is: log2(FLOAT8).

We could also define other versions, using the same name, if we wanted:

log2(INT-REQUIRED)
log2(INT-OPTIONAL)
log2(BIGINT-REQUIRED)
log2(DECIMAL9-REQUIRED)

Drill does a search ((need link)) to find the best match. If no exact match is found, Drill will attempt to convert parameters from their actual types to the type needed by the method. For example, if we call our log2() function on an INT column, Drill will convert the INT value to a FLOAT8. Providing multiple versions (in this case, one that takes an INT argument) can potentially save the (negligible) cost of the conversion. Of course, in some cases (such as the DECIMAL9 version) we may not want Drill to convert a decimal to a float. Good testing will help you determine when you need to provide multiple versions of your function.

Drill also considers cardinality (AKA "mode" or nullability) as part of the argument type. For example, in the list above, versions of log2 appear for both nullable and non-nullable INT. Generally, if you let Drill handle nulls, you only need create the required versions.

Declaring "Workspace" Fields

Functions sometimes need temporary internal state. Aggregates need a place to store the running totals. Drill provides the @Workspace annotation to indicate such fields. For example, in the log2 function, we might want to pre-compute part of the expression:

public class Log2Function implements DrillSimpleFunc {
  @Workspace private double LOG_2;

A very common, and mysterious, mistake is to omit the @Workspace annotation. If you do that, you will get a spectacular set of error messages in your log file, along with multiple copies of the generated source code. (See the troubleshooting section.) If you see that, verify that every field has one of the three annotations: @Param, @Output or @Workspace.

Note also that it is common to declare constants for cases such as that shown above.

public class FunctionImpl {
  // DO NOT do this!
  private static final double LOG_2 = Math.log(2.0D);
  
  public static final double log2(double x) {
    return Math.log(x) / LOG_2;
  }
}

But, constants do not work with UDFs. You won't get an error, but your constant will never be initialized, resulting in mysterious errors.

While the @Param and @Output field must be declared as holders, the @Workspace fields can be of any (supported) type. (But, as shown above, must be instance fields, not static.)

@Inject Annotation

When working with VARCHAR or VARBINARY data you must negotiate with Drill a place to store the data using the @Inject annotation. We'll discuss this annotation in the Working with VARCHAR Data in UDFs section.

The Setup Method

A function must provide a setup() method, but it can be empty. For simple functions, setup() is handy to do one-time initialization. For example, to define the "constant" discussed above:

public class Log2Function implements DrillSimpleFunc {
  @Workspace private double LOG_2;

  public void setup() {
    LOG_2 = Math.log(2.0D);
  }

For simple functions, setup() is called once per batch. Actually, the method is not "called." Instead, just like the eval() method, is is inlined into the generated code. Recall we showed that code generation creates two methods, also called setup() and eval(). Your function's setup() method is inlined into the generated setup() method. (For this to work, the workspace fields are also inlined.)

The setup() method plays a large role in aggregate functions.

The Eval Method

The core of a UDF is the eval() method, called once per row. For example:

  @Override
  public void eval() {
    out.value = Math.log(x.value) / LOG_2;
  }

As noted, the eval() method is not actually "called", it is instead copied inline into the generated code. This is why Drill must have visibility to both your class (jar) files (in order to get the annotations) and your source code (to inline the methods.)

Although your eval() method is inlined, it cannot access any information other than that passed in via the @Param arguments. Since the @Param arguments are holders, your function has no visibility to Drill value vectors or other internals.

External Dependencies

We write UDFs using Java, but are restricted to a small subset. One restriction is that Drill will not honor Java import statements. Suppose we had an existing MathUtils.log2() function we wanted to use. We cannot do this:

// DO NOT do this!
import org.yoyodyne.utils.MathUtils;
...
  @Override
  public void eval() {
    out.value = MathUtils.log2(x.value);
  }

Doing the above will also result in a spectacular set of error messages in your Drill log file. Instead, you must put an fully-qualfied reference inline so that Drill properly copies it to the generated code:

  @Override
  public void eval() {
    // Use fully-qualified references
    out.value = org.yoyodyne.utils.MathUtils.log2(x.value);
  }

No Implicit References

Suppose that, like the above case, we want to define our function in another class. Suppose this time the class is in the same package as our UDF. Normally in Java we can just say:

  @Override
  public void eval() {
    // DON'T DO THIS!
    // Implicit references to same-package classes not supported.
    out.value = MathUtils.log2(x.value);
  }

The same error will result as if we used import. Why? Drill simply copies the code, but puts it into Drill's own org.apache.drill.exec.test.generated package. Since the code is now in a different package, Java can no longer locate our MathUtils class and a Java compile error results when Drill compiles the generated code.

The solution is to always use fully-qualified references to all classes except for those defined in the JDK, or the subset of Java classes that Drill will import for you. (The set of imported classes are the holders and a few others.)

So, the proper form is:

package org.apache.drill.contrib.mathUtils;
...
  @Override
  public void eval() {
    // Use fully-qualified references
    out.value = org.apache.drill.contrib.mathUtils.MathUtils.log2(x.value);
  }

Allowable Code

Drill uses a fantastically complex process to reverse engineer your source code, forward engineer (e.g. generate) the Java code, compile Java, then merge byte codes. This process is (supposed to be) fast, but it is very fragile. In general, the process cannot handle complex Java and is design for very simple declarative statements with shallow nesting. Said another way, don't try to implement K-means directly in a UDF.

It turns out, however, that once we are aware of the limitations, an easy workaround exists. Simply put all the complex code into another function, and leave the UDF as a wrapper that calls our function. That way, Drill will muck about only with the wrapper, not with the actual function implementation.

Access to Drill Internals

Because of the way that UDFs are copied into generated code, they have access to nothing inside Drill except the holders. In particular, they cannot access:

The operator that runs the code.
The value vectors that hold other rows within the current batch.
Values other than those passed as parameters.

On the one hand, the above restrictions make UDFs quite simple and robust (once we get them running.) On the other, the restrictions limit what you can accomplish with a UDF.

Summary of Restrictions

A quick summary of (some of the) restrictions on a UDF includes:

All fields must have annotations.
No static or final fields.
No imports or implicit same-package class references.
No complex code.
No access to any Drill internals other than the arguments to the function.
UDFs can access nullable or non-nullable columns, but not repeated (array) columns.
Do not call methods on the holder objects.
Do not pass holder objects to other methods.
Function classes cannot extend other classes or interfaces; only the Drill function interfaces described here.
No switch statements that refer to enum types.

Presumably, all this extra work for the developer pays off in slightly faster runtime because, again presumably, Drill can generate better code than Java can (a highly dubious proposition, but there we have it.)

If you must use a forbidden feature then you can use the workaround which Drill itself uses: move your logic to a public static method of another class, and call that method from within your UDF.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly