-
Notifications
You must be signed in to change notification settings - Fork 981
Debugging UDFs
So far we've explained how UDFs work internally and the structure of a UDF. It is time to move to the more practical concern of how we develop a UDF. Drill provides documentation about how to create a User Defined Function (UDF). The information is procedural and walks you through the steps. Here we expand on those basics to relate the required code to our background discussion. We will also show how to quickly debug a UDF using Drill's own test tools.
Drill provides functions to compute natural (log
) and decimal (log10
) logarithms. Sometimes it is useful to compute the base-2 log. We could just write out the formula in each query:
SELECT log(x) / log(2) as myLog FROM ...
But, we are lazy and want to create a log2
function to do the job:
SELECT log2(x) as myLog FROM ...
The code here is available on GitHub ((need link.))
While Drill's UDF mechanism is optimized for execution speed, it is, shall we say, sub-optimal for development speed. Frankly, a drawback of Drill's interface is that UDFs are very hard to unit test. (We all unit test our code before bolting it onto Drill, don't we? Good, I thought so.)
The documentation explains how to create a project external to Drill to hold your UDF. This is certainly the form you want to use once your code works. But, to debug your UDF, and to look at the source code referenced here, we have to use an alternative structure temporarily.
Drill provides no API in the normal sense. Instead, Drill provides all sources (it is open source.) Drill assumes that each developer (of a UDF, or storage plugin, etc.) will use the sources needed for that project.
Drill also provides testing tools that we will want to use. But, because of the way has been set up to work with Maven, those tools are available only if your code lives within Drill's java-exec
package. (Yes, Drill could use some work to improve it's API. Volunteers?)
So, the easiest path when learning UDFs, or working on complex cases, is to debug the UDF in the context of a full Drill build.
One key benefit of working with the Drill project: we can traverse to the source for the classes, interfaces and enums mentioned in these notes.
So, for our function, we will create the following new Java package within java-exec
: org.apache.drill.exec.expr.contrib.udfExample
.
When using this debugging technique, it is vital that you use a package within org.apache.drill.exec.expr
or Drill will not find your function and you'll get lots of very cryptic errors. ((Need link to troubleshooting.)). Use of the org.apache.drill.exec.expr.contrib
package ensures that your code will not collide with Drill's own code. You can move the code to a different package later for deployment. ((Need link.))
The technique shown here works with the open source Apache Drill for which source code is available. If you work with a commercial distribution, source is unavailable to you. You have two choices:
- Skip this step, and simply code your function and deploy it to a working cluster.
- Do your development with Apache Drill, then deploy to your commercial cluster.
Steps to set up the environment:
- Download and build Drill as explained in the documentation.
- Using your favorite Git tool, create a new branch from
master
calledudf-example
. - Use
mvn clean install
to build Drill from sources. - Load Drill into your favorite IDE (IntelliJ or Eclipse.)
- Within
drill-java-exec
, undersrc/main/java
, create theorg.apache.drill.exec.expr.contrib.udfExample
package. - Within
drill-java-exec
, undersrc/test/java
, also create theorg.apache.drill.exec.expr.contrib.udfExample
package.
Why have we done this? So we can now follow good Test-driven-development (TDD) practice and start with a test. Let's deviate from TDD a bit and create a test that passes using the test framework.
In Eclipse:
- Select the test package we just created.
- Choose New → JUnit Test Case.
- Name:
ExampleUdfTest
. - Superclass:
org.apache.drill.test.ClusterTest
. - Click Finish.
You now have a blank test case. We need to do two things to get started.
First, we must put the Apache copyright at the top of the file. Just pick any other Java file in Drill and copy the copyright notice. (If you forget to do that, Drill's build will fail when next you build from Maven.)
Then, we need a magic bit of code that will start an embedded Drillbit for us.
Drill has a very clever mechanism to register functions that builds up a list of functions at build time. Unfortunately, that step is done only by Maven, nor your IDE. So, for your function to be seen by Drill, you must disable class path caching in your tests using the config setting shown in the test:
ppublic class ExampleUdfTest extends ClusterTest {
@ClassRule
public static final BaseDirTestWatcher dirTestWatcher = new BaseDirTestWatcher();
@BeforeClass
public static void setup() throws Exception {
ClusterFixtureBuilder builder = ClusterFixture.builder(dirTestWatcher)
.configProperty("drill.classpath.scanning.cache.enabled", false);
startCluster(builder);
}
}
(The need for the dirTestWatcher
may be removed in an upcoming commit. Once it is, you can use the one in a super class.)
Next, let's create a demo test:
@Test
public void demoTest() {
String sql = "SELECT * FROM `cp`.`employee.json` LIMIT 3";
client.queryBuilder().sql(sql).printCsv();
}
Run this test as a JUnit test and verify that it does, in fact, print three lines of output. If so, you have verified that that you have a working Drill environment. Also, we now have a handy fixture to use to exercise our UDF as we build it.
Let's follow the UDF Semantics material to define our UDF. We'll do it two ways: first in the "traditional" format, then again using the suggested framework ((Need link.))
Place the code into the src/main/java
version of the package:
package org.apache.drill.exec.expr.contrib.udfExample;
import org.apache.drill.exec.expr.DrillSimpleFunc;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;
import org.apache.drill.exec.expr.annotations.Output;
import org.apache.drill.exec.expr.annotations.Param;
import org.apache.drill.exec.expr.annotations.Workspace;
import org.apache.drill.exec.expr.holders.Float8Holder;
@FunctionTemplate(
name = "log2",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public class Log2Function implements DrillSimpleFunc {
@Workspace private double LOG_2;
@Param public Float8Holder x;
@Output public Float8Holder out;
@Override
public void setup() {
LOG_2 = Math.log(2.0D);
}
@Override
public void eval() {
out.value = Math.log(x.value) / LOG_2;
}
}
Debugging UDFs can often be a black box. Sometimes things work and sometimes they don't. It can be hard to know where to look for the problem. One way to reduce the frustration is to test early and often. It both verifies our code and builds our confidence that we are, in fact, on the right path.
Here we will test the annotation just created. This lets us look at the function the way Drill does.
@Test
public void testAnnotation() {
Class<? extends DrillSimpleFunc> fnClass = Log2Function.class;
FunctionTemplate fnDefn = fnClass.getAnnotation(FunctionTemplate.class);
assertNotNull(fnDefn);
assertEquals("log2", fnDefn.name());
assertEquals(FunctionScope.SIMPLE, fnDefn.scope());
assertEquals(NullHandling.NULL_IF_NULL, fnDefn.nulls());
}
The code grabs the class we just created, fetches the annotation, and verifies the three values we set. You can use a variation on this theme to use your debugger (or print statements) to look at the annotation fields.
Next we add a test for the function itself, calling it as Drill does (again, this is not really what Drill does, since Drill inlines the function, but it is close):
@Test
public void testFn() {
Log2Function logFn = new Log2Function();
logFn.setup();
logFn.x = new Float8Holder();
logFn.out = new Float8Holder();
logFn.x.value = 2;
logFn.eval();
assertEquals(1.0D, logFn.out.value, 0.001D);
}
The above is perfectly fine, but tedious. What if we want to test ten different values? To make life easier, we can add some helper methods to the test class. (One might think that these methods should go onto the UDF class. But, to avoid problems with Drill's code generation engine, let's avoid adding anything extra to the UDF.)
private static Log2Function instance() {
Log2Function fn = new Log2Function();
fn.x = new Float8Holder();
fn.out = new Float8Holder();
fn.setup();
return fn;
}
public static double call(Log2Function fn, double x) {
fn.x.value = x;
fn.eval();
return fn.out.value;
}
@Test
public void testFn() {
Log2Function logFn = instance();
assertEquals(1D, call(logFn, 2), 0.001D);
assertEquals(2D, call(logFn, 4), 0.001D);
assertEquals(-1D, call(logFn, 1.0/2.0), 0.001D);
}
Much better: we can now easily test all interesting cases.
(If you are following along, you should now experience the beauty of this form of testing: we are always just seconds away from running our next test, and we can easily step through the code.)
The next step is to test the function with Drill itself. Because our code is within Drill, and we are using a test framework that starts the server, we need only add a test:
@Test
public void testIntegration() {
String sql = "SELECT log2w(4) FROM (VALUES (1))";
client.queryBuilder().sql(sql).printCsv();
}
Before we run the test, we have to take one additional (if rather cryptic) step. Recall that we said that Drill does not call our code, but rather copies bits of our code inline into the generated code. In order to do that, Drill needs visibility to the class source code. (Drill still uses the compiled code, but only to find the class and extract information from the annotations.) So, in order to prevent mysterious failures, we have to add our source code to the runtime class path.
In Eclipse:
- Run → Debug configurations...
- Pick your test configuration
- Classpath tab
- Select the User Entries node
- Click Advanced...
- Select Add Folders and click OK
- Find the drill-java-exec project in the list
- Expand this node and select src/main/java
- Repeat and select target/generated-sources
Once this is done, we can now run our test. We should see the following output:
Running org.apache.drill.exec.expr.contrib.udfExample.AdHocTest#test
1 row(s):
EXPR$0<FLOAT8(REQUIRED)>
2.0
Total rows returned : 1. Returned in 1803ms.
If so, then Hawt Dog!, it worked!
Basically the same process is used to test a UDF that follows the simplified framework. See that section for how the framework simplifies testing of the function itself.
The integration test shown above was not much of a test; it just dumped the query results to the console. In general, we want to verify the results. There are many ways to do so. If we are debugging within Drill itself, we can use a handy mechanism called the RowSet
as shown in the ExampleTest
in the Drill source code:
@Test
public void test() throws Exception {
String sql = "SELECT log2w(4) FROM (VALUES (1))";
RowSet actual = client.queryBuilder().sql(sql).rowSet();
BatchSchema expectedSchema = new SchemaBuilder()
.addNullable("EXPR$0", MinorType.FLOAT8)
.build();
RowSet expected = client.rowSetBuilder(expectedSchema)
.addRow(2.0D)
.build();
new RowSetComparison(expected).verifyAndClearAll(actual);
}
The above runs the query as before, but this time captures the output as a RowSet
. (A RowSet
is just a convenient wrapper around Drill's internal value vectors.) Next we define the schema we expect. Since we didn't give the column an alias, Drill named it for us: EXPR$0
. Then comes the fancy bit: we create a second row set that identifies the results we expect. Finally, we compare the two and release their memory.
If you run Drill using the test framework described above, then a quick-and-dirty method to debug your method is to insert print statements:
@Param IntHolder input;
...
System.out.print("Input: ");
System.out.println(input.value);
You can also use Drill's logging mechanism. However, you can't just use the technique you'll see in Drill's own code:
public class Log2Function implements DrillSimpleFunc {
// DO NOT do this!
static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(DrillSimpleFunc.class);
...
@Override
public void eval() {
logger.debug("log2( {} )", x.value);
out.value = Math.log(x.value) / LOG_2;
}
We can't include static members, remember? Instead, we can use the technique described in Simplified UDF Framework to split our class into two:
package org.apache.drill.exec.expr.contrib.udfExample;
...
public class Log2Impl {
public static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(Log2Impl.class);
...
public class Log2Function implements DrillSimpleFunc {
...
@Override
public void eval() {
org.apache.drill.exec.expr.contrib.udfExample.Log2Impl.logger.debug("log2( {} )", x.value);
out.value = Math.log(x.value) / LOG_2;
}
Note that we must make the logger public (so it can be called from the Drill-generated code), and we reference it with a fully-qualified name.
Drill uses the Logback Classic logger. Refer to its web site for more information. (Or just scan the Drill source for examples.)
The above is clearly much more tedious than the pure-JUnit tests we can write using the simplified framework. Reserve the above just to verify that your UDF works with Drill.