-
Notifications
You must be signed in to change notification settings - Fork 981
EVF Tutorial External Schema
Drill 1.16 introduced the CREATE SCHEMA
statement that lets you specify a schema for a file.
The changes made thus far to the log reader make it trivially easy to use such a schema; we change just one (code) line in LogFormatPlugin:
@Override
protected FileScanBuilder frameworkBuilder(
OptionManager options, EasySubScan scan) throws ExecutionSetupException {
...
// This plugin was created before the concept of "provided schema" was
// available. Use the schema obtained from config as the provided schema.
// However if a schema is provided, use that instead. No attempt is made
// to merge the two schemas: a provided schema simply replaces that defined
// in the plugin config. The normal use case would be to define columns in
// the plugin config, types in the provided schema.
TupleMetadata finalSchema = scan.getSchema() == null ? outputSchema : scan.getSchema();
builder.typeConverterBuilder().providedSchema(finalSchema);
For most plugins you don't even need to do this much. If you use the Easy framework, then that framework will pass along the provided schema automatically. Here we had to add code because we generate an output schema from the plugin config; most other plugins do not work that same way.
Let's make sure this works as intended. We can add a unit test to TestLogReader:
private static void defineRegexPlugin() throws ExecutionSetupException {
...
// Config with no type info. Types
// will be provided via the provided schema mechanism. Column names
// are required so that the format and provided schemas match up.
LogFormatConfig untypedConfig = new LogFormatConfig();
untypedConfig.setExtension("logu");
untypedConfig.setRegex(DATE_ONLY_PATTERN);
untypedConfig.setSchema();
untypedConfig.getSchema().add( new LogFormatField("year"));
untypedConfig.getSchema().add( new LogFormatField("month"));
untypedConfig.getSchema().add( new LogFormatField("day"));
...
}
public void testProvidedSchema() throws Exception {
...
try {
client.alterSession(ExecConstants.STORE_TABLE_USE_SCHEMA_FILE, true);
String schemaSql = "create schema (`year` int not null, `month` int not null, " +
"`day` int not null) " +
"for table " + tablePath;
run(schemaSql);
String sql = "SELECT * FROM %s";
RowSet results = client.queryBuilder().sql(sql, tablePath).rowSet();
BatchSchema expectedSchema = new SchemaBuilder()
.add("year", MinorType.INT)
.add("month", MinorType.INT)
.add("day", MinorType.INT)
.build();
RowSet expected = client.rowSetBuilder(expectedSchema)
.addRow(2017, 12, 17)
.addRow(2017, 12, 18)
.addRow(2017, 12, 19)
.build();
RowSetUtilities.verify(expected, results);
} finally {
client.resetSession(ExecConstants.STORE_TABLE_USE_SCHEMA_FILE);
}
}
A bunch of tedious setup code is omitted here. We simply:
- Define a storage plugin config with field names, but no type information.
- The provided schema is still experimental, so we must set a session option to enable it. We use a
try
/catch
block to ensure the option is turned off at the end of the test. - Define the schema using
CREATE SCHEMA
. - Run a query.
- Verify that the results are of the types specified in the schema.
The log reader config allows you to specify a type, but not a mode: all columns are nullable. The provided schema lets you specify a mode; here we chose not null
(called REQUIRED
in Drill code.)
If you define column names in the plugin config, but do not provide a schema for that column, it's type defaults to VARCHAR
.
As noted in the code, because of the unique way that the log format plugin works, if you specify a provided schema, then any type information in the format plugin is ignored. (The code does not attempt to merge the types, though it could.)