The Substrait project aims to create a well-defined, cross-language specification for data compute operations. The specification declares a set of common operations, defines their semantics, and describes their behavior unambiguously. The project also defines extension points and serialized representations of the specification.
In many ways, the goal of this project is similar to that of the Apache Arrow project. Arrow is focused on a standardized memory representation of columnar data. Substrait is focused on what should be done to data.
SQL is a well known language for describing queries against relational data. It is designed to be simple and allow reading and writing by humans. Substrait is not intended as a replacement for SQL and works alongside SQL to provide capabilities that SQL lacks. SQL is not a great fit for systems that actually satisfy the query because it does not provide sufficient detail and is not represented in a format that is easy for processing. Because of this, most modern systems will first translate the SQL query into a query plan, sometimes called the execution plan. There can be multiple levels of a query plan (e.g. physical and logical), a query plan may be split up and distributed across multiple systems, and a query plan often undergoes simplifying or optimizing transformations. The SQL standard does not define the format of the query or execution plan and there is no open format that is supported by a broad set of systems. Substrait was created to provide a standard and open format for these query plans.
Why not just do this within an existing OSS project?¶
A key goal of the Substrait project is to not be coupled to any single existing technology. Trying to get people involved in something can be difficult when it seems to be primarily driven by the opinions and habits of a single community. In many ways, this situation is similar to the early situation with Arrow. The precursor to Arrow was the Apache Drill ValueVectors concepts. As part of creating Arrow, Wes and Jacques recognized the need to create a new community to build a fresh consensus (beyond just what the Apache Drill community wanted). This separation and new independent community was a key ingredient to Arrow’s current success. The needs here are much the same: many separate communities could benefit from Substrait, but each have their own pain points, type systems, development processes and timelines. To help resolve these tensions, one of the approaches proposed in Substrait is to set a bar that at least two of the top four OSS data technologies (Arrow, Spark, Iceberg, Trino) supports something before incorporating it directly into the Substrait specification. (Another goal is to support strong extension points at key locations to avoid this bar being a limiter to broad adoption.)
Apache Calcite: Many ideas in Substrait are inspired by the Calcite project. Calcite is a great JVM-based SQL query parsing and optimization framework. A key goal of the Substrait project is to expose Calcite capabilities more easily to non-JVM technologies as well as expose query planning operations as microservices.
Apache Arrow: The Arrow format for data is what the Substrait specification attempts to be for compute expressions. A key goal of Substrait is to enable Substrait producers to execute work within the Arrow Rust and C++ compute kernels.
A strait is a narrow connector of water between two other pieces of water. In analytics, data is often thought of as water. Substrait is focused on instructions related to the data. In other words, what defines or supports the movement of water between one or more larger systems. Thus, the underlayment for the strait connecting different pools of water => sub-strait.
The Substrait project aims to create a well-defined, cross-language specification for data compute operations. The specification declares a set of common operations, defines their semantics, and describes their behavior unambiguously. The project also defines extension points and serialized representations of the specification.
In many ways, the goal of this project is similar to that of the Apache Arrow project. Arrow is focused on a standardized memory representation of columnar data. Substrait is focused on what should be done to data.
SQL is a well known language for describing queries against relational data. It is designed to be simple and allow reading and writing by humans. Substrait is not intended as a replacement for SQL and works alongside SQL to provide capabilities that SQL lacks. SQL is not a great fit for systems that actually satisfy the query because it does not provide sufficient detail and is not represented in a format that is easy for processing. Because of this, most modern systems will first translate the SQL query into a query plan, sometimes called the execution plan. There can be multiple levels of a query plan (e.g. physical and logical), a query plan may be split up and distributed across multiple systems, and a query plan often undergoes simplifying or optimizing transformations. The SQL standard does not define the format of the query or execution plan and there is no open format that is supported by a broad set of systems. Substrait was created to provide a standard and open format for these query plans.
Why not just do this within an existing OSS project?¶
A key goal of the Substrait project is to not be coupled to any single existing technology. Trying to get people involved in something can be difficult when it seems to be primarily driven by the opinions and habits of a single community. In many ways, this situation is similar to the early situation with Arrow. The precursor to Arrow was the Apache Drill ValueVectors concepts. As part of creating Arrow, Wes and Jacques recognized the need to create a new community to build a fresh consensus (beyond just what the Apache Drill community wanted). This separation and new independent community was a key ingredient to Arrow’s current success. The needs here are much the same: many separate communities could benefit from Substrait, but each have their own pain points, type systems, development processes and timelines. To help resolve these tensions, one of the approaches proposed in Substrait is to set a bar that at least two of the top four OSS data technologies (Arrow, Spark, Iceberg, Trino) supports something before incorporating it directly into the Substrait specification. (Another goal is to support strong extension points at key locations to avoid this bar being a limiter to broad adoption.)
Apache Calcite: Many ideas in Substrait are inspired by the Calcite project. Calcite is a great JVM-based SQL query parsing and optimization framework. A key goal of the Substrait project is to expose Calcite capabilities more easily to non-JVM technologies as well as expose query planning operations as microservices.
Apache Arrow: The Arrow format for data is what the Substrait specification attempts to be for compute expressions. A key goal of Substrait is to enable Substrait producers to execute work within the Arrow Rust and C++ compute kernels.
A strait is a narrow connector of water between two other pieces of water. In analytics, data is often thought of as water. Substrait is focused on instructions related to the data. In other words, what defines or supports the movement of water between one or more larger systems. Thus, the underlayment for the strait connecting different pools of water => sub-strait.
Substrait is developed as a consensus-driven open source product under the Apache 2.0 license. Development is done in the open leveraging GitHub issues and PRs.
Substrait is developed via GitHub issues and pull requests. If you see a problem or want to enhance the product, we suggest you file a GitHub issue for developers to review.
Our website is all maintained in our source repository. If there is something you think can be improved, feel free to fork our repository and post a pull request.
Meetings
Our community meets every other week on Wednesday.
Substrait is developed as a consensus-driven open source product under the Apache 2.0 license. Development is done in the open leveraging GitHub issues and PRs.
Substrait is developed via GitHub issues and pull requests. If you see a problem or want to enhance the product, we suggest you file a GitHub issue for developers to review.
Our website is all maintained in our source repository. If there is something you think can be improved, feel free to fork our repository and post a pull request.
Meetings
Our community meets every other week on Wednesday.
In addition to the work maintained in repositories within the substrait-io GitHub organization, a growing list of other open source projects have adopted Substrait.
ADBC (Arrow Database Connectivity) is an API specification for Apache Arrow-based database access. ADBC allows applications to pass queries either as SQL strings or Substrait plans.
Arrow Flight SQL is a client-server protocol for interacting with databases and query engines using the Apache Arrow in-memory columnar format and the Arrow Flight RPC framework. Arrow Flight SQL allows clients to send queries as SQL strings or Substrait plans.
DataFusion is an extensible query planning, optimization, and execution framework, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion provides a Substrait producer and consumer that can convert DataFusion logical plans to and from Substrait plans. It can be used through the DataFusion Python bindings.
DuckDB is an in-process SQL OLAP database management system. DuckDB provides a Substrait extension that allows users to produce and consume Substrait plans through DuckDB’s SQL, Python, and R APIs.
Gluten is a plugin for Apache Spark that allows computation to be offloaded to engines that have better performance or efficiency than Spark’s built-in JVM-based engine. Gluten converts Spark physical plans to Substrait plans.
Ibis is a Python library that provides a lightweight, universal interface for data wrangling. It includes a dataframe API for Python with support for more than 10 query execution engines, plus a Substrait producer to enable support for Substrait-consuming execution engines.
The Substrait R interface package allows users to construct Substrait plans from R for evaluation by Substrait-consuming execution engines. The package provides a dplyr backend as well as lower-level interfaces for creating Substrait plans and integrations with Acero and DuckDB.
Velox is a unified execution engine aimed at accelerating data management systems and streamlining their development. Velox provides a Substrait consumer interface.
To add your project to this list, please open a pull request.
In addition to the work maintained in repositories within the substrait-io GitHub organization, a growing list of other open source projects have adopted Substrait.
ADBC (Arrow Database Connectivity) is an API specification for Apache Arrow-based database access. ADBC allows applications to pass queries either as SQL strings or Substrait plans.
Arrow Flight SQL is a client-server protocol for interacting with databases and query engines using the Apache Arrow in-memory columnar format and the Arrow Flight RPC framework. Arrow Flight SQL allows clients to send queries as SQL strings or Substrait plans.
DataFusion is an extensible query planning, optimization, and execution framework, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion provides a Substrait producer and consumer that can convert DataFusion logical plans to and from Substrait plans. It can be used through the DataFusion Python bindings.
DuckDB is an in-process SQL OLAP database management system. DuckDB provides a Substrait extension that allows users to produce and consume Substrait plans through DuckDB’s SQL, Python, and R APIs.
Gluten is a plugin for Apache Spark that allows computation to be offloaded to engines that have better performance or efficiency than Spark’s built-in JVM-based engine. Gluten converts Spark physical plans to Substrait plans.
Ibis is a Python library that provides a lightweight, universal interface for data wrangling. It includes a dataframe API for Python with support for more than 10 query execution engines, plus a Substrait producer to enable support for Substrait-consuming execution engines.
The Substrait R interface package allows users to construct Substrait plans from R for evaluation by Substrait-consuming execution engines. The package provides a dplyr backend as well as lower-level interfaces for creating Substrait plans and integrations with Acero and DuckDB.
Velox is a unified execution engine aimed at accelerating data management systems and streamlining their development. Velox provides a Substrait consumer interface.
To add your project to this list, please open a pull request.
Aggregate functions are functions that define an operation which consumes values from multiple records to a produce a single output. Aggregate functions in SQL are typically used in GROUP BY functions. Aggregate functions are similar to scalar functions and function signatures with a small set of different properties.
Aggregate function signatures contain all the properties defined for scalar functions. Additionally, they contain the properties below:
Property
Description
Required
Inherits
All properties defined for scalar function.
N/A
Ordered
Whether the result of this function is sensitive to sort order.
Optional, defaults to false
Maximum set size
Maximum allowed set size as an unsigned integer.
Optional, defaults to unlimited
Decomposable
Whether the function can be executed in one or more intermediate steps. Valid options are: NONE, ONE, MANY, describing how intermediate steps can be taken.
Optional, defaults to NONE
Intermediate Output Type
If the function is decomposable, represents the intermediate output type that is used, if the function is defined as either ONE or MANY decomposable. Will be a struct in many cases.
Required for ONE and MANY.
Invocation
Whether the function uses all or only distinct values in the aggregation calculation. Valid options are: ALL, DISTINCT.
When binding an aggregate function, the binding must include the following additional properties beyond the standard scalar binding properties:
Property
Description
Phase
Describes the input type of the data: [INITIAL_TO_INTERMEDIATE, INTERMEDIATE_TO_INTERMEDIATE, INITIAL_TO_RESULT, INTERMEDIATE_TO_RESULT] describing what portion of the operation is required. For functions that are NOT decomposable, the only valid option will be INITIAL_TO_RESULT.
Ordering
Zero or more ordering keys along with key order (ASC|DESC|NULL FIRST, etc.), declared similar to the sort keys in an ORDER BY relational operation. If no sorts are specified, the records are not sorted prior to being passed to the aggregate function.
Aggregate functions are functions that define an operation which consumes values from multiple records to a produce a single output. Aggregate functions in SQL are typically used in GROUP BY functions. Aggregate functions are similar to scalar functions and function signatures with a small set of different properties.
Aggregate function signatures contain all the properties defined for scalar functions. Additionally, they contain the properties below:
Property
Description
Required
Inherits
All properties defined for scalar function.
N/A
Ordered
Whether the result of this function is sensitive to sort order.
Optional, defaults to false
Maximum set size
Maximum allowed set size as an unsigned integer.
Optional, defaults to unlimited
Decomposable
Whether the function can be executed in one or more intermediate steps. Valid options are: NONE, ONE, MANY, describing how intermediate steps can be taken.
Optional, defaults to NONE
Intermediate Output Type
If the function is decomposable, represents the intermediate output type that is used, if the function is defined as either ONE or MANY decomposable. Will be a struct in many cases.
Required for ONE and MANY.
Invocation
Whether the function uses all or only distinct values in the aggregation calculation. Valid options are: ALL, DISTINCT.
When binding an aggregate function, the binding must include the following additional properties beyond the standard scalar binding properties:
Property
Description
Phase
Describes the input type of the data: [INITIAL_TO_INTERMEDIATE, INTERMEDIATE_TO_INTERMEDIATE, INITIAL_TO_RESULT, INTERMEDIATE_TO_RESULT] describing what portion of the operation is required. For functions that are NOT decomposable, the only valid option will be INITIAL_TO_RESULT.
Ordering
Zero or more ordering keys along with key order (ASC|DESC|NULL FIRST, etc.), declared similar to the sort keys in an ORDER BY relational operation. If no sorts are specified, the records are not sorted prior to being passed to the aggregate function.
Embedded functions are a special kind of function where the implementation is embedded within the actual plan. They are commonly used in tools where a user intersperses business logic within a data pipeline. This is more common in data science workflows than traditional SQL workflows.
Embedded functions are not pre-registered. Embedded functions require that data be consumed and produced with a standard API, may require memory allocation and have determinate error reporting behavior. They may also have specific runtime dependencies. For example, a Python pickle function may depend on pyarrow 5.0 and pynessie 1.0.
Properties for an embedded function include:
Property
Description
Required
Function Type
The type of embedded function presented.
Required
Function Properties
Function properties, one of those items defined below.
Required
Output Type
The fully resolved output type for this embedded function.
Required
The binary representation of an embedded function is:
Embedded functions are a special kind of function where the implementation is embedded within the actual plan. They are commonly used in tools where a user intersperses business logic within a data pipeline. This is more common in data science workflows than traditional SQL workflows.
Embedded functions are not pre-registered. Embedded functions require that data be consumed and produced with a standard API, may require memory allocation and have determinate error reporting behavior. They may also have specific runtime dependencies. For example, a Python pickle function may depend on pyarrow 5.0 and pynessie 1.0.
Properties for an embedded function include:
Property
Description
Required
Function Type
The type of embedded function presented.
Required
Function Properties
Function properties, one of those items defined below.
Required
Output Type
The fully resolved output type for this embedded function.
Required
The binary representation of an embedded function is:
Extended Expression messages are provided for expression-level protocols as an alternative to using a Plan. They mainly target expression-only evaluations, such as those computed in Filter/Project/Aggregation rels. Unlike the original Expression defined in the substrait protocol, Extended Expression messages require more information to completely describe the computation context including: input data schema, referred function signatures, and output schema.
Since Extended Expression will be used seperately from the Plan rel representation, it will need to include basic fields like Version.
Extended Expression messages are provided for expression-level protocols as an alternative to using a Plan. They mainly target expression-only evaluations, such as those computed in Filter/Project/Aggregation rels. Unlike the original Expression defined in the substrait protocol, Extended Expression messages require more information to completely describe the computation context including: input data schema, referred function signatures, and output schema.
Since Extended Expression will be used seperately from the Plan rel representation, it will need to include basic fields like Version.
messageExtendedExpression{// Substrait version of the expression. Optional up to 0.17.0, required for later// versions.Versionversion=7;
diff --git a/expressions/field_references/index.html b/expressions/field_references/index.html
index 73f26ba..af906a7 100644
--- a/expressions/field_references/index.html
+++ b/expressions/field_references/index.html
@@ -1,4 +1,4 @@
- Field References - Substrait: Cross-Language Serialization for Relational Algebra
In Substrait, all fields are dealt with on a positional basis. Field names are only used at the edge of a plan, for the purposes of naming fields for the outside world. Each operation returns a simple or compound data type. Additional operations can refer to data within that initial operation using field references. To reference a field, you use a reference based on the type of field position you want to reference.
Reference Type
Properties
Type Applicability
Type return
Struct Field
Ordinal position. Zero-based. Only legal within the range of possible fields within a struct. Selecting an ordinal outside the applicable field range results in an invalid plan.
struct
Type of field referenced
Array Value
Array offset. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Negative and positive overflows return null values (no wrapping).
list
type of list
Array Slice
Array offset and element count. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Position does not wrap, nor does length.
list
Same type as original list
Map Key
A map value that is matched exactly against available map keys and returned.
map
Value type of map
Map KeyExpression
A wildcard string that is matched against a simplified form of regular expressions. Requires the key type of the map to be a character type. [Format detail needed, intention to include basic regex concepts such as greedy/non-greedy.]
map
List of map value type
Masked Complex Expression
An expression that provides a mask over a schema declaring which portions of the schema should be presented. This allows a user to select a portion of a complex object but mask certain subsections of that same object.
In Substrait, all fields are dealt with on a positional basis. Field names are only used at the edge of a plan, for the purposes of naming fields for the outside world. Each operation returns a simple or compound data type. Additional operations can refer to data within that initial operation using field references. To reference a field, you use a reference based on the type of field position you want to reference.
Reference Type
Properties
Type Applicability
Type return
Struct Field
Ordinal position. Zero-based. Only legal within the range of possible fields within a struct. Selecting an ordinal outside the applicable field range results in an invalid plan.
struct
Type of field referenced
Array Value
Array offset. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Negative and positive overflows return null values (no wrapping).
list
type of list
Array Slice
Array offset and element count. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Position does not wrap, nor does length.
list
Same type as original list
Map Key
A map value that is matched exactly against available map keys and returned.
map
Value type of map
Map KeyExpression
A wildcard string that is matched against a simplified form of regular expressions. Requires the key type of the map to be a character type. [Format detail needed, intention to include basic regex concepts such as greedy/non-greedy.]
map
List of map value type
Masked Complex Expression
An expression that provides a mask over a schema declaring which portions of the schema should be presented. This allows a user to select a portion of a complex object but mask certain subsections of that same object.
A function is a scalar function if that function takes in values from a single record and produces an output value. To clearly specify the definition of functions, Substrait declares an extensible specification plus binding approach to function resolution. A scalar function signature includes the following properties:
Property
Description
Required
Name
One or more user-friendly UTF-8 strings that are used to reference this function.
At least one value is required.
List of arguments
Argument properties are defined below. Arguments can be fully defined or calculated with a type expression. See further details below.
Optional, defaults to niladic.
Deterministic
Whether this function is expected to reproduce the same output when it is invoked multiple times with the same input. This informs a plan consumer on whether it can constant-reduce the defined function. An example would be a random() function, which is typically expected to be evaluated repeatedly despite having the same set of inputs.
Optional, defaults to true.
Session Dependent
Whether this function is influenced by the session context it is invoked within. For example, a function may be influenced by a user who is invoking the function, the time zone of a session, or some other non-obvious parameter. This can inform caching systems on whether a particular function is cacheable.
Optional, defaults to false.
Variadic Behavior
Whether the last argument of the function is variadic or a single argument. If variadic, the argument can optionally have a lower bound (minimum number of instances) and an upper bound (maximum number of instances).
Optional, defaults to single value.
Nullability Handling
Describes how nullability of input arguments maps to nullability of output arguments. Three options are: MIRROR, DECLARED_OUTPUT and DISCRETE. More details about nullability handling are listed below.
Optional, defaults to MIRROR
Description
Additional description of function for implementers or users. Should be written human-readable to allow exposure to end users. Presented as a map with language => description mappings. E.g. { "en": "This adds two numbers together.", "fr": "cela ajoute deux nombres"}.
Optional
Return Value
The output type of the expression. Return types can be expressed as a fully-defined type or a type expression. See below for more on type expressions.
Required
Implementation Map
A map of implementation locations for one or more implementations of the given function. Each key is a function implementation type. Implementation types include examples such as: AthenaArrowLambda, TrinoV361Jar, ArrowCppKernelEnum, GandivaEnum, LinkedIn Transport Jar, etc. [Definition TBD]. Implementation type has one or more properties associated with retrieval of that implementation.
There are three main types of arguments: value arguments, type arguments, and enumerations. Every defined arguments must be specified in every invocation of the function. When specified, the position of these arguments in the function invocation must match the position of the arguments as defined in the YAML function definition.
Value arguments: arguments that refer to a data value. These could be constants (literal expressions defined in the plan) or variables (a reference expression that references data being processed by the plan). This is the most common type of argument. The value of a value argument is not available in output derivation, but its type is. Value arguments can be declared in one of two ways: concrete or parameterized. Concrete types are either simple types or compound types with all parameters fully defined (without referencing any type arguments). Examples include i32, fp32, VARCHAR<20>, List<fp32>, etc. Parameterized types are discussed further below.
Type arguments: arguments that are used only to inform the evaluation and/or type derivation of the function. For example, you might have a function which is truncate(<type> DECIMAL<P0,S0>, <value> DECIMAL<P1, S1>, <value> i32). This function declares two value arguments and a type argument. The difference between them is that the type argument has no value at runtime, while the value arguments do.
Enumeration: arguments that support a fixed set of declared values as constant arguments. These arguments must be specified as part of an expression. While these could also have been implemented as constant string value arguments, they are formally included to improve validation/contextual help/etc. for frontend processors and IDEs. An example might be extract([DAY|YEAR|MONTH], <date value>). In this example, a producer must specify a type of date part to extract. Note, the value of a required enumeration cannot be used in type derivation.
A human-readable name for this argument to help clarify use.
Optional, defaults to a name based on position (e.g. arg0)
Description
Additional description of this argument.
Optional
Value
A fully defined type or a type expression.
Required
Constant
Whether this argument is required to be a constant for invocation. For example, in some system a regular expression pattern would only be accepted as a literal and not a column value reference.
In addition to arguments each call may specify zero or more options. These are similar to a required enumeration but more focused on supporting alternative behaviors. Options can be left unspecified and the consumer is free to choose which implementation to use. An example use case might be OVERFLOW_BEHAVIOR:[OVERFLOW, SATURATE, ERROR] If unspecified, an engine is free to use any of the three choices or even some alternative behavior (e.g. setting the value to null on overflow). If specified, the engine would be expected to behave as specified or fail. Note, the value of an optional enumeration cannot be used in type derivation.
A producer may specify multiple values for an option. If the producer does so then the consumer must deliver the first behavior in the list of values that the consumer is capable of delivering. For example, considering overflow as defined above, if a producer specified [ERROR, SATURATE] then the consumer must deliver ERROR if it is capable of doing so. If it is not then it may deliver SATURATE. If the consumer cannot deliver either behavior then it is an error and the consumer must reject the plan.
This means that the function has the behavior that if at least one of the input arguments are nullable, the return type is also nullable. If all arguments are non-nullable, the return type will be non-nullable. An example might be the + function.
DECLARED_OUTPUT
Input arguments are accepted of any mix of nullability. The nullability of the output function is whatever the return type expression states. Example use might be the function is_null() where the output is always boolean independent of the nullability of the input.
DISCRETE
The input and arguments all define concrete nullability and can only be bound to the types that have those nullability. For example, if a type input is declared i64? and one has an i64 literal, the i64 literal must be specifically cast to i64? to allow the operation to bind.
Types are parameterized by two types of values: by inner types (e.g. List<K>) and numeric values (e.g. DECIMAL<P,S>). Parameter names are simple strings (frequently a single character). There are two types of parameters: integer parameters and type parameters.
When the same parameter name is used multiple times in a function definition, the function can only bind if the exact same value is used for all parameters of that name. For example, if one had a function with a signature of fn(VARCHAR<N>, VARCHAR<N>), the function would be only be usable if both VARCHAR types had the same length value N. This necessitates that all instances of the same parameter name must be of the same parameter type (all instances are a type parameter or all instances are an integer parameter).
When the last argument of a function is variadic and declares a type parameter e.g. fn(A, B, C...), the C parameter can be marked as either consistent or inconsistent. If marked as consistent, the function can only be bound to arguments where all the C types are the same concrete type. If marked as inconsistent, each unique C can be bound to a different type within the constraints of what T allows.
A concrete return type is one that is fully known at function definition time. Examples of simple concrete return types would be things such as i32, fp32. For compound types, a concrete return type must be fully declared. Example of fully defined compound types: VARCHAR<20>, DECIMAL<25,5>
Any function can declare a return type expression. A return type expression uses a simplified set of expressions to describe how the return type should be returned. For example, a return expression could be as simple as the return of parameter declared in the arguments. For example f(List<K>) => K or can be a simple mathematical or conditional expression such as add(decimal<a,b>, decimal<c,d>) => decimal<a+c, b+d>. For the simple expression language, there is a very narrow set of types:
Integer: 64-bit signed integer (can be a literal or a parameter value)
Boolean: True and False
Type: A Substrait type (with possibly additional embedded expressions)
These types are evaluated using a small set of operations to support common scenarios. List of valid operations:
Math: +, -, *, /, min, max
+ Scalar Functions - Substrait: Cross-Language Serialization for Relational Algebra
A function is a scalar function if that function takes in values from a single record and produces an output value. To clearly specify the definition of functions, Substrait declares an extensible specification plus binding approach to function resolution. A scalar function signature includes the following properties:
Property
Description
Required
Name
One or more user-friendly UTF-8 strings that are used to reference this function.
At least one value is required.
List of arguments
Argument properties are defined below. Arguments can be fully defined or calculated with a type expression. See further details below.
Optional, defaults to niladic.
Deterministic
Whether this function is expected to reproduce the same output when it is invoked multiple times with the same input. This informs a plan consumer on whether it can constant-reduce the defined function. An example would be a random() function, which is typically expected to be evaluated repeatedly despite having the same set of inputs.
Optional, defaults to true.
Session Dependent
Whether this function is influenced by the session context it is invoked within. For example, a function may be influenced by a user who is invoking the function, the time zone of a session, or some other non-obvious parameter. This can inform caching systems on whether a particular function is cacheable.
Optional, defaults to false.
Variadic Behavior
Whether the last argument of the function is variadic or a single argument. If variadic, the argument can optionally have a lower bound (minimum number of instances) and an upper bound (maximum number of instances).
Optional, defaults to single value.
Nullability Handling
Describes how nullability of input arguments maps to nullability of output arguments. Three options are: MIRROR, DECLARED_OUTPUT and DISCRETE. More details about nullability handling are listed below.
Optional, defaults to MIRROR
Description
Additional description of function for implementers or users. Should be written human-readable to allow exposure to end users. Presented as a map with language => description mappings. E.g. { "en": "This adds two numbers together.", "fr": "cela ajoute deux nombres"}.
Optional
Return Value
The output type of the expression. Return types can be expressed as a fully-defined type or a type expression. See below for more on type expressions.
Required
Implementation Map
A map of implementation locations for one or more implementations of the given function. Each key is a function implementation type. Implementation types include examples such as: AthenaArrowLambda, TrinoV361Jar, ArrowCppKernelEnum, GandivaEnum, LinkedIn Transport Jar, etc. [Definition TBD]. Implementation type has one or more properties associated with retrieval of that implementation.
There are three main types of arguments: value arguments, type arguments, and enumerations. Every defined arguments must be specified in every invocation of the function. When specified, the position of these arguments in the function invocation must match the position of the arguments as defined in the YAML function definition.
Value arguments: arguments that refer to a data value. These could be constants (literal expressions defined in the plan) or variables (a reference expression that references data being processed by the plan). This is the most common type of argument. The value of a value argument is not available in output derivation, but its type is. Value arguments can be declared in one of two ways: concrete or parameterized. Concrete types are either simple types or compound types with all parameters fully defined (without referencing any type arguments). Examples include i32, fp32, VARCHAR<20>, List<fp32>, etc. Parameterized types are discussed further below.
Type arguments: arguments that are used only to inform the evaluation and/or type derivation of the function. For example, you might have a function which is truncate(<type> DECIMAL<P0,S0>, <value> DECIMAL<P1, S1>, <value> i32). This function declares two value arguments and a type argument. The difference between them is that the type argument has no value at runtime, while the value arguments do.
Enumeration: arguments that support a fixed set of declared values as constant arguments. These arguments must be specified as part of an expression. While these could also have been implemented as constant string value arguments, they are formally included to improve validation/contextual help/etc. for frontend processors and IDEs. An example might be extract([DAY|YEAR|MONTH], <date value>). In this example, a producer must specify a type of date part to extract. Note, the value of a required enumeration cannot be used in type derivation.
A human-readable name for this argument to help clarify use.
Optional, defaults to a name based on position (e.g. arg0)
Description
Additional description of this argument.
Optional
Value
A fully defined type or a type expression.
Required
Constant
Whether this argument is required to be a constant for invocation. For example, in some system a regular expression pattern would only be accepted as a literal and not a column value reference.
In addition to arguments each call may specify zero or more options. These are similar to a required enumeration but more focused on supporting alternative behaviors. Options can be left unspecified and the consumer is free to choose which implementation to use. An example use case might be OVERFLOW_BEHAVIOR:[OVERFLOW, SATURATE, ERROR] If unspecified, an engine is free to use any of the three choices or even some alternative behavior (e.g. setting the value to null on overflow). If specified, the engine would be expected to behave as specified or fail. Note, the value of an optional enumeration cannot be used in type derivation.
A producer may specify multiple values for an option. If the producer does so then the consumer must deliver the first behavior in the list of values that the consumer is capable of delivering. For example, considering overflow as defined above, if a producer specified [ERROR, SATURATE] then the consumer must deliver ERROR if it is capable of doing so. If it is not then it may deliver SATURATE. If the consumer cannot deliver either behavior then it is an error and the consumer must reject the plan.
This means that the function has the behavior that if at least one of the input arguments are nullable, the return type is also nullable. If all arguments are non-nullable, the return type will be non-nullable. An example might be the + function.
DECLARED_OUTPUT
Input arguments are accepted of any mix of nullability. The nullability of the output function is whatever the return type expression states. Example use might be the function is_null() where the output is always boolean independent of the nullability of the input.
DISCRETE
The input and arguments all define concrete nullability and can only be bound to the types that have those nullability. For example, if a type input is declared i64? and one has an i64 literal, the i64 literal must be specifically cast to i64? to allow the operation to bind.
Types are parameterized by two types of values: by inner types (e.g. List<K>) and numeric values (e.g. DECIMAL<P,S>). Parameter names are simple strings (frequently a single character). There are two types of parameters: integer parameters and type parameters.
When the same parameter name is used multiple times in a function definition, the function can only bind if the exact same value is used for all parameters of that name. For example, if one had a function with a signature of fn(VARCHAR<N>, VARCHAR<N>), the function would be only be usable if both VARCHAR types had the same length value N. This necessitates that all instances of the same parameter name must be of the same parameter type (all instances are a type parameter or all instances are an integer parameter).
When the last argument of a function is variadic and declares a type parameter e.g. fn(A, B, C...), the C parameter can be marked as either consistent or inconsistent. If marked as consistent, the function can only be bound to arguments where all the C types are the same concrete type. If marked as inconsistent, each unique C can be bound to a different type within the constraints of what T allows.
A concrete return type is one that is fully known at function definition time. Examples of simple concrete return types would be things such as i32, fp32. For compound types, a concrete return type must be fully declared. Example of fully defined compound types: VARCHAR<20>, DECIMAL<25,5>
Any function can declare a return type expression. A return type expression uses a simplified set of expressions to describe how the return type should be returned. For example, a return expression could be as simple as the return of parameter declared in the arguments. For example f(List<K>) => K or can be a simple mathematical or conditional expression such as add(decimal<a,b>, decimal<c,d>) => decimal<a+c, b+d>. For the simple expression language, there is a very narrow set of types:
Integer: 64-bit signed integer (can be a literal or a parameter value)
Boolean: True and False
Type: A Substrait type (with possibly additional embedded expressions)
These types are evaluated using a small set of operations to support common scenarios. List of valid operations:
While all types of operations could be reduced to functions, in some cases this would be overly simplistic. Instead, it is helpful to construct some other expression constructs.
These constructs should be focused on different expression types as opposed to something that directly related to syntactic sugar. For example, CAST and EXTRACT or SQL operations that are presented using specialized syntax. However, they can easily be modeled using a function paradigm with minimal complexity.
For each data type, it is possible to create a literal value for that data type. The representation depends on the serialization format. Literal expressions include both a type literal and a possibly null value.
These expressions allow structs, lists, and maps to be constructed from a set of expressions. For example, they allow a struct expression like (field 0 - field 1, field 0 + field 1) to be represented.
To convert a value from one type to another, Substrait defines a cast expression. Cast expressions declare an expected type, an input argument and an enumeration specifying failure behavior, indicating whether cast should return null on failure or throw an exception.
Note that Substrait always requires a cast expression whenever the current type is not exactly equal to (one of) the expected types. For example, it is illegal to directly pass a value of type i8[0] to a function that only supports an i8?[0] argument.
An if value expression is an expression composed of one if clause, zero or more else if clauses and an else clause. In pseudocode, they are envisioned as:
if <boolean expression> then <result expression 1>
+ Specialized Record Expressions - Substrait: Cross-Language Serialization for Relational Algebra
While all types of operations could be reduced to functions, in some cases this would be overly simplistic. Instead, it is helpful to construct some other expression constructs.
These constructs should be focused on different expression types as opposed to something that directly related to syntactic sugar. For example, CAST and EXTRACT or SQL operations that are presented using specialized syntax. However, they can easily be modeled using a function paradigm with minimal complexity.
For each data type, it is possible to create a literal value for that data type. The representation depends on the serialization format. Literal expressions include both a type literal and a possibly null value.
These expressions allow structs, lists, and maps to be constructed from a set of expressions. For example, they allow a struct expression like (field 0 - field 1, field 0 + field 1) to be represented.
To convert a value from one type to another, Substrait defines a cast expression. Cast expressions declare an expected type, an input argument and an enumeration specifying failure behavior, indicating whether cast should return null on failure or throw an exception.
Note that Substrait always requires a cast expression whenever the current type is not exactly equal to (one of) the expected types. For example, it is illegal to directly pass a value of type i8[0] to a function that only supports an i8?[0] argument.
An if value expression is an expression composed of one if clause, zero or more else if clauses and an else clause. In pseudocode, they are envisioned as:
if <boolean expression> then <result expression 1>
else if <boolean expression> then <result expression 2> (zero or more times)
else <result expression 3>
When an if expression is declared, all return expressions must be the same identical type.
An if expression is expected to logically short-circuit on a positive outcome. This means that a skipped else/elseif expression cannot cause an error. For example, this should not actually throw an error despite the fact that the cast operation should fail.
if 'value' = 'value' then 0
diff --git a/expressions/subqueries/index.html b/expressions/subqueries/index.html
index 741bb65..5d03501 100644
--- a/expressions/subqueries/index.html
+++ b/expressions/subqueries/index.html
@@ -1,4 +1,4 @@
- Subqueries - Substrait: Cross-Language Serialization for Relational Algebra
Table functions produce zero or more records for each input record. Table functions use a signature similar to scalar functions. However, they are not allowed in the same contexts.
Table functions produce zero or more records for each input record. Table functions use a signature similar to scalar functions. However, they are not allowed in the same contexts.
Substrait supports the creation of custom functions using simple extensions, using the facilities described in scalar functions. The functions defined by Substrait use the same mechanism. The extension files for standard functions can be found here.
Here’s an example function that doubles its input:
Implementation Note
This implementation is only defined on 32-bit floats and integers but could be defined on all numbers (and even lists and strings). The user of the implementation can specify what happens when the resulting value falls outside of the valid range for a 32-bit float (either return NAN or raise an error).
Substrait supports the creation of custom functions using simple extensions, using the facilities described in scalar functions. The functions defined by Substrait use the same mechanism. The extension files for standard functions can be found here.
Here’s an example function that doubles its input:
Implementation Note
This implementation is only defined on 32-bit floats and integers but could be defined on all numbers (and even lists and strings). The user of the implementation can specify what happens when the resulting value falls outside of the valid range for a 32-bit float (either return NAN or raise an error).
Window functions are functions which consume values from multiple records to produce a single output. They are similar to aggregate functions, but also have a focused window of analysis to compare to their partition window. Window functions are similar to scalar values to an end user, producing a single value for each input record. However, the consumption visibility for the production of each single record can be many records.
Window function signatures contain all the properties defined for aggregate functions. Additionally, they contain the properties below
Property
Description
Required
Inherits
All properties defined for aggregate functions.
N/A
Window Type
STREAMING or PARTITION. Describes whether the function needs to see all data for the specific partition operation simultaneously. Operations like SUM can produce values in a streaming manner with no complete visibility of the partition. NTILE requires visibility of the entire partition before it can start producing values.
Optional, defaults to PARTITION
When binding an aggregate function, the binding must include the following additional properties beyond the standard scalar binding properties:
Property
Description
Required
Partition
A list of partitioning expressions.
False, defaults to a single partition for the entire dataset
Lower Bound
Bound Following(int64), Bound Trailing(int64) or CurrentRow.
False, defaults to start of partition
Upper Bound
Bound Following(int64), Bound Trailing(int64) or CurrentRow.
Window functions are functions which consume values from multiple records to produce a single output. They are similar to aggregate functions, but also have a focused window of analysis to compare to their partition window. Window functions are similar to scalar values to an end user, producing a single value for each input record. However, the consumption visibility for the production of each single record can be many records.
Window function signatures contain all the properties defined for aggregate functions. Additionally, they contain the properties below
Property
Description
Required
Inherits
All properties defined for aggregate functions.
N/A
Window Type
STREAMING or PARTITION. Describes whether the function needs to see all data for the specific partition operation simultaneously. Operations like SUM can produce values in a streaming manner with no complete visibility of the partition. NTILE requires visibility of the entire partition before it can start producing values.
Optional, defaults to PARTITION
When binding an aggregate function, the binding must include the following additional properties beyond the standard scalar binding properties:
Property
Description
Required
Partition
A list of partitioning expressions.
False, defaults to a single partition for the entire dataset
Lower Bound
Bound Following(int64), Bound Trailing(int64) or CurrentRow.
False, defaults to start of partition
Upper Bound
Bound Following(int64), Bound Trailing(int64) or CurrentRow.
Calculates the approximate number of rows that contain distinct values of the expression argument using HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result.
Calculates the approximate number of rows that contain distinct values of the expression argument using HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result.
Calculates the approximate number of rows that contain distinct values of the expression argument using HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result. Result is returned as a decimal instead of i64.
Calculates the approximate number of rows that contain distinct values of the expression argument using HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result. Result is returned as a decimal instead of i64.
*Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0). The on_division_by_zero option governs behavior in cases where y is 0. If the option is IEEE then the IEEE754 standard is followed: all values except ±infinity return NaN and ±infinity are unchanged. If the option is LIMIT then the result is ±infinity in all cases. If either x or y are NaN then behavior will be governed by on_domain_error. If x and y are both ±infinity, behavior will be governed by on_domain_error. *
*Calculate the remainder ® when dividing dividend (x) by divisor (y). In mathematics, many conventions for the modulus (mod) operation exists. The result of a mod operation depends on the software implementation and underlying hardware. Substrait is a format for describing compute operations on structured data and designed for interoperability. Therefore the user is responsible for determining a definition of division as defined by the quotient (q). The following basic conditions of division are satisfied: (1) q ∈ ℤ (the quotient is an integer) (2) x = y * q + r (division rule) (3) abs® < abs(y) where q is the quotient. The division_type option determines the mathematical definition of quotient to use in the above definition of division. When division_type=TRUNCATE, q = trunc(x/y). When division_type=FLOOR, q = floor(x/y). In the cases of TRUNCATE and FLOOR division: remainder r = x - round_func(x/y) The on_domain_error option governs behavior in cases where y is 0, y is ±inf, or x is ±inf. In these cases the mod is undefined. The overflow option governs behavior when integer overflow occurs. If x and y are both 0 or both ±infinity, behavior will be governed by on_domain_error. *
*Calculate the absolute value of the argument. Integer values allow the specification of overflow behavior to handle the unevenness of the twos complement, e.g. Int8 range [-128 : 127]. *
*Return the signedness of the argument. Integer values return signedness with the same type as the input. Possible return values are [-1, 0, 1] Floating point values return signedness with the same type as the input. Possible return values are [-1.0, -0.0, 0.0, 1.0, NaN] *
*Calculate the median for a set of values. Returns null if applied to zero records. For the integer implementations, the rounding option determines how the median should be rounded if it ends up midway between two values. For the floating point implementations, they specify the usual floating point rounding mode. *
*Calculates quantiles for a set of values. This function will divide the aggregated values (passed via the distribution argument) over N equally-sized bins, where N is passed via a constant argument. It will then return the values at the boundaries of these bins in list form. If the input is appropriately sorted, this computes the quantiles of the distribution. The function can optionally return the first and/or last element of the input, as specified by the boundaries argument. If the input is appropriately sorted, this will thus be the minimum and/or maximum values of the distribution. When the boundaries do not lie exactly on elements of the incoming distribution, the function will interpolate between the two nearby elements. If the interpolated value cannot be represented exactly, the rounding option controls how the value should be selected or computed. The function fails and returns null in the following cases: - n is null or less than one; - any value in distribution is null.
The function returns an empty list if n equals 1 and boundaries is set to NEITHER. *
*Returns a value from the nth row based on the window_offset. window_offset should be a positive integer. If the value of the window_offset is outside the range of the window, null is returned. The on_domain_error option governs behavior in cases where window_offset is not a positive integer or null. *
*Return a value from a following row based on a specified physical offset. This allows you to compare a value in the current row against a following row. The expression is evaluated against a row that comes after the current row based on the row_offset. The row_offset should be a positive integer and is set to 1 if not specified explicitly. If the row_offset is negative, the expression will be evaluated against a row coming before the current row, similar to the lag function. A row_offset of null will return null. The function returns the default input value if row_offset goes beyond the scope of the window. If a default value is not specified, it is set to null. Example comparing the sales of the current year to the following year. row_offset of 1. | year | sales | next_year_sales | | 2019 | 20.50 | 30.00 | | 2020 | 30.00 | 45.99 | | 2021 | 45.99 | null | *
*Return a column value from a previous row based on a specified physical offset. This allows you to compare a value in the current row against a previous row. The expression is evaluated against a row that comes before the current row based on the row_offset. The expression can be a column, expression or subquery that evaluates to a single value. The row_offset should be a positive integer and is set to 1 if not specified explicitly. If the row_offset is negative, the expression will be evaluated against a row coming after the current row, similar to the lead function. A row_offset of null will return null. The function returns the default input value if row_offset goes beyond the scope of the partition. If a default value is not specified, it is set to null. Example comparing the sales of the current year to the previous year. row_offset of 1. | year | sales | previous_year_sales | | 2019 | 20.50 | null | | 2020 | 30.00 | 20.50 | | 2021 | 45.99 | 30.00 | *
*Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0). The on_division_by_zero option governs behavior in cases where y is 0. If the option is IEEE then the IEEE754 standard is followed: all values except ±infinity return NaN and ±infinity are unchanged. If the option is LIMIT then the result is ±infinity in all cases. If either x or y are NaN then behavior will be governed by on_domain_error. If x and y are both ±infinity, behavior will be governed by on_domain_error. *
*Calculate the remainder ® when dividing dividend (x) by divisor (y). In mathematics, many conventions for the modulus (mod) operation exists. The result of a mod operation depends on the software implementation and underlying hardware. Substrait is a format for describing compute operations on structured data and designed for interoperability. Therefore the user is responsible for determining a definition of division as defined by the quotient (q). The following basic conditions of division are satisfied: (1) q ∈ ℤ (the quotient is an integer) (2) x = y * q + r (division rule) (3) abs® < abs(y) where q is the quotient. The division_type option determines the mathematical definition of quotient to use in the above definition of division. When division_type=TRUNCATE, q = trunc(x/y). When division_type=FLOOR, q = floor(x/y). In the cases of TRUNCATE and FLOOR division: remainder r = x - round_func(x/y) The on_domain_error option governs behavior in cases where y is 0, y is ±inf, or x is ±inf. In these cases the mod is undefined. The overflow option governs behavior when integer overflow occurs. If x and y are both 0 or both ±infinity, behavior will be governed by on_domain_error. *
*Calculate the absolute value of the argument. Integer values allow the specification of overflow behavior to handle the unevenness of the twos complement, e.g. Int8 range [-128 : 127]. *
*Return the signedness of the argument. Integer values return signedness with the same type as the input. Possible return values are [-1, 0, 1] Floating point values return signedness with the same type as the input. Possible return values are [-1.0, -0.0, 0.0, 1.0, NaN] *
*Calculate the median for a set of values. Returns null if applied to zero records. For the integer implementations, the rounding option determines how the median should be rounded if it ends up midway between two values. For the floating point implementations, they specify the usual floating point rounding mode. *
*Calculates quantiles for a set of values. This function will divide the aggregated values (passed via the distribution argument) over N equally-sized bins, where N is passed via a constant argument. It will then return the values at the boundaries of these bins in list form. If the input is appropriately sorted, this computes the quantiles of the distribution. The function can optionally return the first and/or last element of the input, as specified by the boundaries argument. If the input is appropriately sorted, this will thus be the minimum and/or maximum values of the distribution. When the boundaries do not lie exactly on elements of the incoming distribution, the function will interpolate between the two nearby elements. If the interpolated value cannot be represented exactly, the rounding option controls how the value should be selected or computed. The function fails and returns null in the following cases: - n is null or less than one; - any value in distribution is null.
The function returns an empty list if n equals 1 and boundaries is set to NEITHER. *
*Returns a value from the nth row based on the window_offset. window_offset should be a positive integer. If the value of the window_offset is outside the range of the window, null is returned. The on_domain_error option governs behavior in cases where window_offset is not a positive integer or null. *
*Return a value from a following row based on a specified physical offset. This allows you to compare a value in the current row against a following row. The expression is evaluated against a row that comes after the current row based on the row_offset. The row_offset should be a positive integer and is set to 1 if not specified explicitly. If the row_offset is negative, the expression will be evaluated against a row coming before the current row, similar to the lag function. A row_offset of null will return null. The function returns the default input value if row_offset goes beyond the scope of the window. If a default value is not specified, it is set to null. Example comparing the sales of the current year to the following year. row_offset of 1. | year | sales | next_year_sales | | 2019 | 20.50 | 30.00 | | 2020 | 30.00 | 45.99 | | 2021 | 45.99 | null | *
*Return a column value from a previous row based on a specified physical offset. This allows you to compare a value in the current row against a previous row. The expression is evaluated against a row that comes before the current row based on the row_offset. The expression can be a column, expression or subquery that evaluates to a single value. The row_offset should be a positive integer and is set to 1 if not specified explicitly. If the row_offset is negative, the expression will be evaluated against a row coming after the current row, similar to the lead function. A row_offset of null will return null. The function returns the default input value if row_offset goes beyond the scope of the partition. If a default value is not specified, it is set to null. Example comparing the sales of the current year to the previous year. row_offset of 1. | year | sales | previous_year_sales | | 2019 | 20.50 | null | | 2020 | 30.00 | 20.50 | | 2021 | 45.99 | 30.00 | *
*Whether two values are equal. This function treats null values as comparable, so is_not_distinct_from(null, null) == True This is in contrast to equal, in which null values do not compare. *
*Whether two values are not equal. This function treats null values as comparable, so is_distinct_from(null, null) == False This is in contrast to equal, in which null values do not compare. *
expression: The expression to test for in the range defined by `low` and `high`.
low: The value to check if greater than or equal to.
high: The value to check if less than or equal to.
0. between(any1, any1, any1): -> boolean
Whether the expression is greater than or equal to low and less than or equal to high. expression BETWEEN low AND high If low, high, or expression are null, null is returned.
Evaluate arguments from left to right and return the first argument that is not null. Once a non-null argument is found, the remaining arguments are not evaluated. If all arguments are null, return null.
*Whether two values are equal. This function treats null values as comparable, so is_not_distinct_from(null, null) == True This is in contrast to equal, in which null values do not compare. *
*Whether two values are not equal. This function treats null values as comparable, so is_distinct_from(null, null) == False This is in contrast to equal, in which null values do not compare. *
expression: The expression to test for in the range defined by `low` and `high`.
low: The value to check if greater than or equal to.
high: The value to check if less than or equal to.
0. between(any1, any1, any1): -> boolean
Whether the expression is greater than or equal to low and less than or equal to high. expression BETWEEN low AND high If low, high, or expression are null, null is returned.
Evaluate arguments from left to right and return the first argument that is not null. Once a non-null argument is found, the remaining arguments are not evaluated. If all arguments are null, return null.
Extract portion of a date/time value. * YEAR Return the year. * ISO_YEAR Return the ISO 8601 week-numbering year. First week of an ISO year has the majority (4 or more) of its days in January. * US_YEAR Return the US epidemiological year. First week of US epidemiological year has the majority (4 or more) of its days in January. Last week of US epidemiological year has the year’s last Wednesday in it. US epidemiological week starts on Sunday. * QUARTER Return the number of the quarter within the year. January 1 through March 31 map to the first quarter, April 1 through June 30 map to the second quarter, etc. * MONTH Return the number of the month within the year. * DAY Return the number of the day within the month. * DAY_OF_YEAR Return the number of the day within the year. January 1 maps to the first day, February 1 maps to the thirty-second day, etc. * MONDAY_DAY_OF_WEEK Return the number of the day within the week, from Monday (first day) to Sunday (seventh day). * SUNDAY_DAY_OF_WEEK Return the number of the day within the week, from Sunday (first day) to Saturday (seventh day). * MONDAY_WEEK Return the number of the week within the year. First week starts on first Monday of January. * SUNDAY_WEEK Return the number of the week within the year. First week starts on first Sunday of January. * ISO_WEEK Return the number of the ISO week within the ISO year. First ISO week has the majority (4 or more) of its days in January. ISO week starts on Monday. * US_WEEK Return the number of the US week within the US year. First US week has the majority (4 or more) of its days in January. US week starts on Sunday. * HOUR Return the hour (0-23). * MINUTE Return the minute (0-59). * SECOND Return the second (0-59). * MILLISECOND Return number of milliseconds since the last full second. * MICROSECOND Return number of microseconds since the last full millisecond. * NANOSECOND Return number of nanoseconds since the last full microsecond. * SUBSECOND Return number of microseconds since the last full second of the given timestamp. * UNIX_TIME Return number of seconds that have elapsed since 1970-01-01 00:00:00 UTC, ignoring leap seconds. * TIMEZONE_OFFSET Return number of seconds of timezone offset to UTC. The range of values returned for QUARTER, MONTH, DAY, DAY_OF_YEAR, MONDAY_DAY_OF_WEEK, SUNDAY_DAY_OF_WEEK, MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK depends on whether counting starts at 1 or 0. This is governed by the indexing option. When indexing is ONE: * QUARTER returns values in range 1-4 * MONTH returns values in range 1-12 * DAY returns values in range 1-31 * DAY_OF_YEAR returns values in range 1-366 * MONDAY_DAY_OF_WEEK and SUNDAY_DAY_OF_WEEK return values in range 1-7 * MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK return values in range 1-53 When indexing is ZERO: * QUARTER returns values in range 0-3 * MONTH returns values in range 0-11 * DAY returns values in range 0-30 * DAY_OF_YEAR returns values in range 0-365 * MONDAY_DAY_OF_WEEK and SUNDAY_DAY_OF_WEEK return values in range 0-6 * MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK return values in range 0-52 The indexing option must be specified when the component is QUARTER, MONTH, DAY, DAY_OF_YEAR, MONDAY_DAY_OF_WEEK, SUNDAY_DAY_OF_WEEK, MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, or US_WEEK. The indexing option cannot be specified when the component is YEAR, ISO_YEAR, US_YEAR, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, SUBSECOND, UNIX_TIME, or TIMEZONE_OFFSET. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
*Extract boolean values of a date/time value. * IS_LEAP_YEAR Return true if year of the given value is a leap year and false otherwise. * IS_DST Return true if DST (Daylight Savings Time) is observed at the given value in the given timezone.
Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.*
Add an interval to a date/time type. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
Subtract an interval from a date/time type. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
Convert local timestamp to UTC-relative timestamp_tz using given local time’s timezone. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
Convert UTC-relative timestamp_tz to local timestamp using given local time’s timezone. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
Parse string into timestamp using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference. If timezone is present in timestamp and provided as parameter an error is thrown. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is supplied as parameter and present in the parsed string the parsed timezone is used. If parameter supplied timezone is invalid an error is thrown.
Implementations: round_temporal(x, rounding, unit, multiple, origin): -> return_type 0. round_temporal(timestamp, rounding, unit, i64, timestamp): -> timestamp 1. round_temporal(precision_timestamp<P>, rounding, unit, i64, precision_timestamp<P>): -> precision_timestamp<P> 2. round_temporal(timestamp_tz, rounding, unit, i64, string, timestamp_tz): -> timestamp_tz 3. round_temporal(precision_timestamp_tz<P>, rounding, unit, i64, string, precision_timestamp_tz<P>): -> precision_timestamp_tz<P> 4. round_temporal(date, rounding, unit, i64, date): -> date 5. round_temporal(time, rounding, unit, i64, time): -> time
Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple from the origin in the given timezone, the resulting point is chosen as one of the two nearest multiples. Which of these is chosen is governed by rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
Implementations: round_calendar(x, rounding, unit, origin, multiple): -> return_type 0. round_calendar(timestamp, rounding, unit, origin, i64): -> timestamp 1. round_calendar(precision_timestamp<P>, rounding, unit, origin, i64): -> precision_timestamp<P> 2. round_calendar(timestamp_tz, rounding, unit, origin, i64, string): -> timestamp_tz 3. round_calendar(precision_timestamp_tz<P>, rounding, unit, origin, i64, string): -> precision_timestamp_tz<P> 4. round_calendar(date, rounding, unit, origin, i64, date): -> date 5. round_calendar(time, rounding, unit, origin, i64, time): -> time
Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple from the last origin unit in the given timezone, the resulting point is chosen as one of the two nearest multiples. Which of these is chosen is governed by rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
Extract portion of a date/time value. * YEAR Return the year. * ISO_YEAR Return the ISO 8601 week-numbering year. First week of an ISO year has the majority (4 or more) of its days in January. * US_YEAR Return the US epidemiological year. First week of US epidemiological year has the majority (4 or more) of its days in January. Last week of US epidemiological year has the year’s last Wednesday in it. US epidemiological week starts on Sunday. * QUARTER Return the number of the quarter within the year. January 1 through March 31 map to the first quarter, April 1 through June 30 map to the second quarter, etc. * MONTH Return the number of the month within the year. * DAY Return the number of the day within the month. * DAY_OF_YEAR Return the number of the day within the year. January 1 maps to the first day, February 1 maps to the thirty-second day, etc. * MONDAY_DAY_OF_WEEK Return the number of the day within the week, from Monday (first day) to Sunday (seventh day). * SUNDAY_DAY_OF_WEEK Return the number of the day within the week, from Sunday (first day) to Saturday (seventh day). * MONDAY_WEEK Return the number of the week within the year. First week starts on first Monday of January. * SUNDAY_WEEK Return the number of the week within the year. First week starts on first Sunday of January. * ISO_WEEK Return the number of the ISO week within the ISO year. First ISO week has the majority (4 or more) of its days in January. ISO week starts on Monday. * US_WEEK Return the number of the US week within the US year. First US week has the majority (4 or more) of its days in January. US week starts on Sunday. * HOUR Return the hour (0-23). * MINUTE Return the minute (0-59). * SECOND Return the second (0-59). * MILLISECOND Return number of milliseconds since the last full second. * MICROSECOND Return number of microseconds since the last full millisecond. * NANOSECOND Return number of nanoseconds since the last full microsecond. * SUBSECOND Return number of microseconds since the last full second of the given timestamp. * UNIX_TIME Return number of seconds that have elapsed since 1970-01-01 00:00:00 UTC, ignoring leap seconds. * TIMEZONE_OFFSET Return number of seconds of timezone offset to UTC. The range of values returned for QUARTER, MONTH, DAY, DAY_OF_YEAR, MONDAY_DAY_OF_WEEK, SUNDAY_DAY_OF_WEEK, MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK depends on whether counting starts at 1 or 0. This is governed by the indexing option. When indexing is ONE: * QUARTER returns values in range 1-4 * MONTH returns values in range 1-12 * DAY returns values in range 1-31 * DAY_OF_YEAR returns values in range 1-366 * MONDAY_DAY_OF_WEEK and SUNDAY_DAY_OF_WEEK return values in range 1-7 * MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK return values in range 1-53 When indexing is ZERO: * QUARTER returns values in range 0-3 * MONTH returns values in range 0-11 * DAY returns values in range 0-30 * DAY_OF_YEAR returns values in range 0-365 * MONDAY_DAY_OF_WEEK and SUNDAY_DAY_OF_WEEK return values in range 0-6 * MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK return values in range 0-52 The indexing option must be specified when the component is QUARTER, MONTH, DAY, DAY_OF_YEAR, MONDAY_DAY_OF_WEEK, SUNDAY_DAY_OF_WEEK, MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, or US_WEEK. The indexing option cannot be specified when the component is YEAR, ISO_YEAR, US_YEAR, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, SUBSECOND, UNIX_TIME, or TIMEZONE_OFFSET. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
*Extract boolean values of a date/time value. * IS_LEAP_YEAR Return true if year of the given value is a leap year and false otherwise. * IS_DST Return true if DST (Daylight Savings Time) is observed at the given value in the given timezone.
Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.*
Add an interval to a date/time type. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
Subtract an interval from a date/time type. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
Convert local timestamp to UTC-relative timestamp_tz using given local time’s timezone. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
Convert UTC-relative timestamp_tz to local timestamp using given local time’s timezone. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
Parse string into timestamp using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference. If timezone is present in timestamp and provided as parameter an error is thrown. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is supplied as parameter and present in the parsed string the parsed timezone is used. If parameter supplied timezone is invalid an error is thrown.
Implementations: round_temporal(x, rounding, unit, multiple, origin): -> return_type 0. round_temporal(timestamp, rounding, unit, i64, timestamp): -> timestamp 1. round_temporal(precision_timestamp<P>, rounding, unit, i64, precision_timestamp<P>): -> precision_timestamp<P> 2. round_temporal(timestamp_tz, rounding, unit, i64, string, timestamp_tz): -> timestamp_tz 3. round_temporal(precision_timestamp_tz<P>, rounding, unit, i64, string, precision_timestamp_tz<P>): -> precision_timestamp_tz<P> 4. round_temporal(date, rounding, unit, i64, date): -> date 5. round_temporal(time, rounding, unit, i64, time): -> time
Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple from the origin in the given timezone, the resulting point is chosen as one of the two nearest multiples. Which of these is chosen is governed by rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
Implementations: round_calendar(x, rounding, unit, origin, multiple): -> return_type 0. round_calendar(timestamp, rounding, unit, origin, i64): -> timestamp 1. round_calendar(precision_timestamp<P>, rounding, unit, origin, i64): -> precision_timestamp<P> 2. round_calendar(timestamp_tz, rounding, unit, origin, i64, string): -> timestamp_tz 3. round_calendar(precision_timestamp_tz<P>, rounding, unit, origin, i64, string): -> precision_timestamp_tz<P> 4. round_calendar(date, rounding, unit, origin, i64, date): -> date 5. round_calendar(time, rounding, unit, origin, i64, time): -> time
Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple from the last origin unit in the given timezone, the resulting point is chosen as one of the two nearest multiples. Which of these is chosen is governed by rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.
*Returns a linestring connecting the endpoint of geometry geom1 to the begin point of geometry geom2. Repeated points at the beginning of input geometries are collapsed to a single point. A linestring can be closed or simple. A closed linestring starts and ends on the same point. A simple linestring does not cross or touch itself. *
*Return the minimum bounding box for the input geometry as a geometry. The returned geometry is defined by the corner points of the bounding box. If the input geometry is a point or a line, the returned geometry can also be a point or line. *
*Return the dimension of the input geometry. If the input is a collection of geometries, return the largest dimension from the collection. Dimensionality is determined by the complexity of the input and not the coordinate system being used. Type dimensions: POINT - 0 LINE - 1 POLYGON - 2 *
*Return true if the input geometry is a valid 2D geometry. For 3 dimensional and 4 dimensional geometries, the validity is still only tested in 2 dimensions. *
*Given the input geometry collection, return a homogenous multi-geometry. All geometries in the multi-geometry will have the same dimension. If type is not specified, the multi-geometry will only contain geometries of the highest dimension. If type is specified, the multi-geometry will only contain geometries of that type. If there are no geometries of the specified type, an empty geometry is returned. Only points, linestrings, and polygons are supported. Type numbers: POINT - 0 LINE - 1 POLYGON - 2 *
*Return a version of the input geometry with the X and Y axis flipped. This operation can be performed on geometries with more than 2 dimensions. However, only X and Y axis will be flipped. *
*Return a version of the input geometry with duplicate consecutive points removed. If the tolerance argument is provided, consecutive points within the tolerance distance of one another are considered to be duplicates. *
*Compute and return an expanded version of the input geometry. All the points of the returned geometry are at a distance of buffer_radius away from the points of the input geometry. If a negative buffer_radius is provided, the geometry will shrink instead of expand. A negative buffer_radius may shrink the geometry completely, in which case an empty geometry is returned. For input the geometries of points or lines, a negative buffer_radius will always return an emtpy geometry. *
*Returns a linestring connecting the endpoint of geometry geom1 to the begin point of geometry geom2. Repeated points at the beginning of input geometries are collapsed to a single point. A linestring can be closed or simple. A closed linestring starts and ends on the same point. A simple linestring does not cross or touch itself. *
*Return the minimum bounding box for the input geometry as a geometry. The returned geometry is defined by the corner points of the bounding box. If the input geometry is a point or a line, the returned geometry can also be a point or line. *
*Return the dimension of the input geometry. If the input is a collection of geometries, return the largest dimension from the collection. Dimensionality is determined by the complexity of the input and not the coordinate system being used. Type dimensions: POINT - 0 LINE - 1 POLYGON - 2 *
*Return true if the input geometry is a valid 2D geometry. For 3 dimensional and 4 dimensional geometries, the validity is still only tested in 2 dimensions. *
*Given the input geometry collection, return a homogenous multi-geometry. All geometries in the multi-geometry will have the same dimension. If type is not specified, the multi-geometry will only contain geometries of the highest dimension. If type is specified, the multi-geometry will only contain geometries of that type. If there are no geometries of the specified type, an empty geometry is returned. Only points, linestrings, and polygons are supported. Type numbers: POINT - 0 LINE - 1 POLYGON - 2 *
*Return a version of the input geometry with the X and Y axis flipped. This operation can be performed on geometries with more than 2 dimensions. However, only X and Y axis will be flipped. *
*Return a version of the input geometry with duplicate consecutive points removed. If the tolerance argument is provided, consecutive points within the tolerance distance of one another are considered to be duplicates. *
*Compute and return an expanded version of the input geometry. All the points of the returned geometry are at a distance of buffer_radius away from the points of the input geometry. If a negative buffer_radius is provided, the geometry will shrink instead of expand. A negative buffer_radius may shrink the geometry completely, in which case an empty geometry is returned. For input the geometries of points or lines, a negative buffer_radius will always return an emtpy geometry. *
s: Number of decimal places to be rounded to. When `s` is a positive number, nothing will happen since `x` is an integer value. When `s` is a negative number, the rounding is performed to the nearest multiple of `10^(-s)`.
s: Number of decimal places to be rounded to. When `s` is a positive number, nothing will happen since `x` is an integer value. When `s` is a negative number, the rounding is performed to the nearest multiple of `10^(-s)`.
*Checks the membership of a value in a list of values Returns the first 0-based index value of some input needle if needle is equal to any element in haystack. Returns NULL if not found. If needle is NULL, returns NULL. If needle is NaN: - Returns 0-based index of NaN in input (default) - Returns NULL (if NAN_IS_NOT_NAN is specified) *
*Checks the membership of a value in a list of values Returns the first 0-based index value of some input needle if needle is equal to any element in haystack. Returns NULL if not found. If needle is NULL, returns NULL. If needle is NaN: - Returns 0-based index of NaN in input (default) - Returns NULL (if NAN_IS_NOT_NAN is specified) *
Concatenate strings. The null_handling option determines whether or not null values will be recognized by the function. If null_handling is set to IGNORE_NULLS, null value arguments will be ignored when strings are concatenated. If set to ACCEPT_NULLS, the result will be null if any argument passed to the concat function is null.
Extract a substring of a specified length starting from position start. A start value of 1 refers to the first characters of the string. When length is not specified the function will extract a substring starting from position start and ending at the end of the string. The negative_start option applies to the start parameter. WRAP_FROM_END means the index will start from the end of the input and move backwards. The last character has an index of -1, the second to last character has an index of -2, and so on. LEFT_OF_BEGINNING means the returned substring will start from the left of the first character. A start of -1 will begin 2 characters left of the the input, while a start of 0 begins 1 character left of the input.
Extract a substring that matches the given regular expression pattern. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The occurrence of the pattern to be extracted is specified using the occurrence argument. Specifying 1 means the first occurrence will be extracted, 2 means the second occurrence, and so on. The occurrence argument should be a positive non-zero integer. The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The regular expression capture group can be specified using the group argument. Specifying 0 will return the substring matching the full regular expression. Specifying 1 will return the substring matching only the first capture group, and so on. The group argument should be a non-negative integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the occurrence value is out of range, the position value is out of range, or the group value is out of range.
Extract all substrings that match the given regular expression pattern. This will return a list of extracted strings with one value for each occurrence of a match. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The regular expression capture group can be specified using the group argument. Specifying 0 will return substrings matching the full regular expression. Specifying 1 will return substrings matching only the first capture group, and so on. The group argument should be a non-negative integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the position value is out of range, or the group value is out of range.
Return the position of the first occurrence of a string in another string. The first character of the string is at position 1. If no occurrence is found, 0 is returned. The case_sensitivity option applies to the substring argument.
Return the position of an occurrence of the given regular expression pattern in a string. The first character of the string is at position 1. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. Which occurrence to return the position of is specified using the occurrence argument. Specifying 1 means the position first occurrence will be returned, 2 means the position of the second occurrence, and so on. The occurrence argument should be a positive non-zero integer. If no occurrence is found, 0 is returned. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the occurrence value is out of range, or the position value is out of range.
Return the number of non-overlapping occurrences of a regular expression pattern in an input string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile or the position value is out of range.
Replace a slice of the input string. A specified ‘length’ of characters will be deleted from the input string beginning at the ‘start’ position and will be replaced by a new string. A start value of 1 indicates the first character of the input string. If start is negative or zero, or greater than the length of the input string, a null string is returned. If ‘length’ is negative, a null string is returned. If ‘length’ is zero, inserting of the new string occurs at the specified ‘start’ position and no characters are deleted. If ‘length’ is greater than the input string, deletion will occur up to the last character of the input string.
Transform the string to lower case characters. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Transform the string to upper case characters. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Transform the string’s lowercase characters to uppercase and uppercase characters to lowercase. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Capitalize the first character of the input string. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Converts the input string into titlecase. Capitalize the first character of each word in the input string except for articles (a, an, the). Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Capitalizes the first character of each word in the input string, including articles, and lowercases the rest. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Search a string for a substring that matches a given regular expression pattern and replace it with a replacement string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github .io/icu/userguide/strings/regexp.html). The occurrence of the pattern to be replaced is specified using the occurrence argument. Specifying 1 means only the first occurrence will be replaced, 2 means the second occurrence, and so on. Specifying 0 means all occurrences will be replaced. The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The replacement string can capture groups using numbered backreferences. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the replacement contains an illegal back-reference, the occurrence value is out of range, or the position value is out of range.
Left-pad the input string with the string of ‘characters’ until the specified length of the string has been reached. If the input string is longer than ‘length’, remove characters from the right-side to shorten it to ‘length’ characters. If the string of ‘characters’ is longer than the remaining ‘length’ needed to be filled, only pad until ‘length’ has been reached. If ‘characters’ is not specified, the default value is a single space.
Right-pad the input string with the string of ‘characters’ until the specified length of the string has been reached. If the input string is longer than ‘length’, remove characters from the left-side to shorten it to ‘length’ characters. If the string of ‘characters’ is longer than the remaining ‘length’ needed to be filled, only pad until ‘length’ has been reached. If ‘characters’ is not specified, the default value is a single space.
Center the input string by padding the sides with a single character until the specified length of the string has been reached. By default, if the length will be reached with an uneven number of padding, the extra padding will be applied to the right side. The side with extra padding can be controlled with the padding option. Behavior is undefined if the number of characters passed to the character argument is not 1.
Split a string into a list of strings, based on a regular expression pattern. The substrings matched by the pattern will be used as the separators to split the input string and will not be included in the resulting list. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string.
Concatenate strings. The null_handling option determines whether or not null values will be recognized by the function. If null_handling is set to IGNORE_NULLS, null value arguments will be ignored when strings are concatenated. If set to ACCEPT_NULLS, the result will be null if any argument passed to the concat function is null.
Extract a substring of a specified length starting from position start. A start value of 1 refers to the first characters of the string. When length is not specified the function will extract a substring starting from position start and ending at the end of the string. The negative_start option applies to the start parameter. WRAP_FROM_END means the index will start from the end of the input and move backwards. The last character has an index of -1, the second to last character has an index of -2, and so on. LEFT_OF_BEGINNING means the returned substring will start from the left of the first character. A start of -1 will begin 2 characters left of the the input, while a start of 0 begins 1 character left of the input.
Extract a substring that matches the given regular expression pattern. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The occurrence of the pattern to be extracted is specified using the occurrence argument. Specifying 1 means the first occurrence will be extracted, 2 means the second occurrence, and so on. The occurrence argument should be a positive non-zero integer. The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The regular expression capture group can be specified using the group argument. Specifying 0 will return the substring matching the full regular expression. Specifying 1 will return the substring matching only the first capture group, and so on. The group argument should be a non-negative integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the occurrence value is out of range, the position value is out of range, or the group value is out of range.
Extract all substrings that match the given regular expression pattern. This will return a list of extracted strings with one value for each occurrence of a match. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The regular expression capture group can be specified using the group argument. Specifying 0 will return substrings matching the full regular expression. Specifying 1 will return substrings matching only the first capture group, and so on. The group argument should be a non-negative integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the position value is out of range, or the group value is out of range.
Return the position of the first occurrence of a string in another string. The first character of the string is at position 1. If no occurrence is found, 0 is returned. The case_sensitivity option applies to the substring argument.
Return the position of an occurrence of the given regular expression pattern in a string. The first character of the string is at position 1. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. Which occurrence to return the position of is specified using the occurrence argument. Specifying 1 means the position first occurrence will be returned, 2 means the position of the second occurrence, and so on. The occurrence argument should be a positive non-zero integer. If no occurrence is found, 0 is returned. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the occurrence value is out of range, or the position value is out of range.
Return the number of non-overlapping occurrences of a regular expression pattern in an input string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile or the position value is out of range.
Replace a slice of the input string. A specified ‘length’ of characters will be deleted from the input string beginning at the ‘start’ position and will be replaced by a new string. A start value of 1 indicates the first character of the input string. If start is negative or zero, or greater than the length of the input string, a null string is returned. If ‘length’ is negative, a null string is returned. If ‘length’ is zero, inserting of the new string occurs at the specified ‘start’ position and no characters are deleted. If ‘length’ is greater than the input string, deletion will occur up to the last character of the input string.
Transform the string to lower case characters. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Transform the string to upper case characters. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Transform the string’s lowercase characters to uppercase and uppercase characters to lowercase. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Capitalize the first character of the input string. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Converts the input string into titlecase. Capitalize the first character of each word in the input string except for articles (a, an, the). Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Capitalizes the first character of each word in the input string, including articles, and lowercases the rest. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Search a string for a substring that matches a given regular expression pattern and replace it with a replacement string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github .io/icu/userguide/strings/regexp.html). The occurrence of the pattern to be replaced is specified using the occurrence argument. Specifying 1 means only the first occurrence will be replaced, 2 means the second occurrence, and so on. Specifying 0 means all occurrences will be replaced. The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The replacement string can capture groups using numbered backreferences. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the replacement contains an illegal back-reference, the occurrence value is out of range, or the position value is out of range.
Left-pad the input string with the string of ‘characters’ until the specified length of the string has been reached. If the input string is longer than ‘length’, remove characters from the right-side to shorten it to ‘length’ characters. If the string of ‘characters’ is longer than the remaining ‘length’ needed to be filled, only pad until ‘length’ has been reached. If ‘characters’ is not specified, the default value is a single space.
Right-pad the input string with the string of ‘characters’ until the specified length of the string has been reached. If the input string is longer than ‘length’, remove characters from the left-side to shorten it to ‘length’ characters. If the string of ‘characters’ is longer than the remaining ‘length’ needed to be filled, only pad until ‘length’ has been reached. If ‘characters’ is not specified, the default value is a single space.
Center the input string by padding the sides with a single character until the specified length of the string has been reached. By default, if the length will be reached with an uneven number of padding, the extra padding will be applied to the right side. The side with extra padding can be controlled with the padding option. Behavior is undefined if the number of characters passed to the character argument is not 1.
Split a string into a list of strings, based on a regular expression pattern. The substrings matched by the pattern will be used as the separators to split the input string and will not be included in the resulting list. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string.
In many cases, the existing objects in Substrait will be sufficient to accomplish a particular use case. However, it is sometimes helpful to create a new data type, scalar function signature or some other custom representation within a system. For that, Substrait provides a number of extension points.
Some kinds of primitives are so frequently extended that Substrait defines a standard YAML format that describes how the extended functionality can be interpreted. This allows different projects/systems to use the YAML definition as a specification so that interoperability isn’t constrained to the base Substrait specification. The main types of extensions that are defined in this manner include the following:
Data types
Type variations
Scalar Functions
Aggregate Functions
Window Functions
Table Functions
To extend these items, developers can create one or more YAML files at a defined URI that describes the properties of each of these extensions. The YAML file is constructed according to the YAML Schema. Each definition in the file corresponds to the YAML-based serialization of the relevant data structure. If a user only wants to extend one of these types of objects (e.g. types), a developer does not have to provide definitions for the other extension points.
A Substrait plan can reference one or more YAML files via URI for extension. In the places where these entities are referenced, they will be referenced using a URI + name reference. The name scheme per type works as follows:
Category
Naming scheme
Type
The name as defined on the type object.
Type Variation
The name as defined on the type variation object.
Function Signature
A function signature compound name as described below.
A YAML file can also reference types and type variations defined in another YAML file. To do this, it must declare the YAML file it depends on using a key-value pair in the dependencies key, where the value is the URI to the YAML file, and the key is a valid identifier that can then be used as an identifier-safe alias for the URI. This alias can then be used as a .-separated namespace prefix wherever a type class or type variation name is expected.
For example, if the YAML file at file:///extension_types.yaml defines a type called point, a different YAML file can use the type in a function declaration as follows:
dependencies:
+ Extensions - Substrait: Cross-Language Serialization for Relational Algebra
In many cases, the existing objects in Substrait will be sufficient to accomplish a particular use case. However, it is sometimes helpful to create a new data type, scalar function signature or some other custom representation within a system. For that, Substrait provides a number of extension points.
Some kinds of primitives are so frequently extended that Substrait defines a standard YAML format that describes how the extended functionality can be interpreted. This allows different projects/systems to use the YAML definition as a specification so that interoperability isn’t constrained to the base Substrait specification. The main types of extensions that are defined in this manner include the following:
Data types
Type variations
Scalar Functions
Aggregate Functions
Window Functions
Table Functions
To extend these items, developers can create one or more YAML files at a defined URI that describes the properties of each of these extensions. The YAML file is constructed according to the YAML Schema. Each definition in the file corresponds to the YAML-based serialization of the relevant data structure. If a user only wants to extend one of these types of objects (e.g. types), a developer does not have to provide definitions for the other extension points.
A Substrait plan can reference one or more YAML files via URI for extension. In the places where these entities are referenced, they will be referenced using a URI + name reference. The name scheme per type works as follows:
Category
Naming scheme
Type
The name as defined on the type object.
Type Variation
The name as defined on the type variation object.
Function Signature
A function signature compound name as described below.
A YAML file can also reference types and type variations defined in another YAML file. To do this, it must declare the YAML file it depends on using a key-value pair in the dependencies key, where the value is the URI to the YAML file, and the key is a valid identifier that can then be used as an identifier-safe alias for the URI. This alias can then be used as a .-separated namespace prefix wherever a type class or type variation name is expected.
For example, if the YAML file at file:///extension_types.yaml defines a type called point, a different YAML file can use the type in a function declaration as follows:
The Substrait project is run by volunteers in a collaborative and open way. Its governance is inspired by the Apache Software Foundation. In most cases, people familiar with the ASF model can work with Substrait in the same way. The biggest differences between the models are:
Substrait does not have a separate infrastructure governing body that gatekeeps the adoption of new developer tools and technologies.
Substrait Management Committee (SMC) members are responsible for recognizing the corporate relationship of its members and ensuring diverse representation and corporate independence.
Substrait does not condone private mailing lists. All project business should be discussed in public The only exceptions to this are security escalations (security@substrait.io) and harassment (harassment@substrait.io).
Substrait has an automated continuous release process with no formal voting process per release.
More details about concrete things Substrait looks to avoid can be found below.
A user is someone who uses Substrait. They may contribute to Substrait by providing feedback to developers in the form of bug reports and feature suggestions. Users participate in the Substrait community by helping other users on mailing lists and user support forums.
A contributor is a user who contributes to the project in the form of code or documentation. They take extra steps to participate in the project (loosely defined as the set of repositories under the github substrait-io organization) , are active on the developer mailing list, participate in discussions, and provide patches, documentation, suggestions, and criticism.
A committer is a developer who has write access to the code repositories and has a signed Contributor License Agreement (CLA) on file. Not needing to depend on other people to make patches to the code or documentation, they are actually making short-term decisions for the project. The SMC can (even tacitly) agree and approve the changes into permanency, or they can reject them. Remember that the SMC makes the decisions, not the individual committers.
A SMC member is a committer who was elected due to merit for the evolution of the project. They have write access to the code repository, the right to cast binding votes on all proposals on community-related decisions,the right to propose other active contributors for committership, and the right to invite active committers to the SMC. The SMC as a whole is the entity that controls the project, nobody else. They are responsible for the continued shaping of this governance model.
The Substrait project is managed using a collaborative, consensus-based process. We do not have a hierarchical structure; rather, different groups of contributors have different rights and responsibilities in the organization.
Communication must be done via mailing lists, Slack, and/or Github. Communication is always done publicly. There are no private lists and all decisions related to the project are made in public. Communication is frequently done asynchronously since members of the community are distributed across many time zones.
The Substrait Management Committee is responsible for the active management of Substrait. The main role of the SMC is to further the long-term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration takes place. As part of this, the SMC is the primary approver of specification changes, ensuring that proposed changes represent a balanced and thorough examination of possibilities. This doesn’t mean that the SMC has to be involved in the minutiae of a particular specification change but should always shepard a healthy process around specification changes.
Because one of the fundamental aspects of accomplishing things is doing so by consensus, we need a way to tell whether we have reached consensus. We do this by voting. There are several different types of voting. In all cases, it is recommended that all community members vote. The number of binding votes required to move forward and the community members who have “binding” votes differs depending on the type of proposal made. In all cases, a veto of a binding voter results in an inability to move forward.
The rules require that a community member registering a negative vote must include an alternative proposal or a detailed explanation of the reasons for the negative vote. The community then tries to gather consensus on an alternative proposal that can resolve the issue. In the great majority of cases, the concerns leading to the negative vote can be addressed. This process is called “consensus gathering” and we consider it a very important indication of a healthy community.
+1 votes required
Binding voters
Voting Location
Process/Governance modifications & actions. This includes promoting new contributors to committer or SMC.
Substrait follows a review-then-commit policy. This requires that all changes receive consensus approval before being committed to the code base. The specific vote requirements follow the table above.
The voting process may seem more than a little weird if you’ve never encountered it before. Votes are represented as numbers between -1 and +1, with ‘-1’ meaning ‘no’ and ‘+1’ meaning ‘yes.’
The in-between values indicate how strongly the voting individual feels. Here are some examples of fractional votes and what the voter might be communicating with them:
+0: ‘I don’t feel strongly about it, but I’m okay with this.’
-0: ‘I won’t get in the way, but I’d rather we didn’t do this.’
-0.5: ‘I don’t like this idea, but I can’t find any rational justification for my feelings.’
++1: ‘Wow! I like this! Let’s do it!’
-0.9: ‘I really don’t like this, but I’m not going to stand in the way if everyone else wants to go ahead with it.’
+0.9: ‘This is a cool idea and I like it, but I don’t have time/the skills necessary to help out.’
For code-modification votes, +1 votes (review approvals in Github are considered equivalent to a +1) are in favor of the proposal, but -1 votes are vetoes and kill the proposal dead until all vetoers withdraw their -1 votes.
A -1 (or an unaddressed PR request for changes) vote by a qualified voter stops a code-modification proposal in its tracks. This constitutes a veto, and it cannot be overruled nor overridden by anyone. Vetoes stand until and unless the individual withdraws their veto.
To prevent vetoes from being used capriciously, the voter must provide with the veto a technical or community justification showing why the change is bad.
Votes help us to openly resolve conflicts. Without a process, people tend to avoid conflict and thrash around. Votes help to make sure we do the hard work of resolving the conflict.
Substrait is non-commercial but commercially-aware¶
Substrait’s mission is to produce software for the public good. All Substrait software is always available for free, and solely under the Apache License.
We’re happy to have third parties, including for-profit corporations, take our software and use it for their own purposes. However it is important in these cases to ensure that the third party does not misuse the brand and reputation of the Substrait project for its own purposes. It is important for the longevity and community health of Substrait that the community gets the appropriate credit for producing freely available software.
The SMC actively track the corporate allegiances of community members and strives to ensure influence around any particular aspect of the project isn’t overly skewed towards a single corporate entity.
Corporate Awareness: The ASF takes a blind-eye approach that has proven to be too slow to correct corporate influence which has substantially undermined many OSS projects. In contrast, Substrait SMC members are responsible for identifying corporate risks and over-representation and adjusting inclusion in the project based on that (limiting committership, SMC membership, etc). Each member of the SMC shares responsibility to expand the community and seek out corporate diversity.
Infrastructure: The ASF shows its age wrt to infrastructure, having been originally built on SVN. Some examples of requirements that Substrait is eschewing that exist in ASF include: custom git infrastructure, release process that is manual, project external gatekeeping around the use of new tools/technologies.
The Substrait project is run by volunteers in a collaborative and open way. Its governance is inspired by the Apache Software Foundation. In most cases, people familiar with the ASF model can work with Substrait in the same way. The biggest differences between the models are:
Substrait does not have a separate infrastructure governing body that gatekeeps the adoption of new developer tools and technologies.
Substrait Management Committee (SMC) members are responsible for recognizing the corporate relationship of its members and ensuring diverse representation and corporate independence.
Substrait does not condone private mailing lists. All project business should be discussed in public The only exceptions to this are security escalations (security@substrait.io) and harassment (harassment@substrait.io).
Substrait has an automated continuous release process with no formal voting process per release.
More details about concrete things Substrait looks to avoid can be found below.
A user is someone who uses Substrait. They may contribute to Substrait by providing feedback to developers in the form of bug reports and feature suggestions. Users participate in the Substrait community by helping other users on mailing lists and user support forums.
A contributor is a user who contributes to the project in the form of code or documentation. They take extra steps to participate in the project (loosely defined as the set of repositories under the github substrait-io organization) , are active on the developer mailing list, participate in discussions, and provide patches, documentation, suggestions, and criticism.
A committer is a developer who has write access to the code repositories and has a signed Contributor License Agreement (CLA) on file. Not needing to depend on other people to make patches to the code or documentation, they are actually making short-term decisions for the project. The SMC can (even tacitly) agree and approve the changes into permanency, or they can reject them. Remember that the SMC makes the decisions, not the individual committers.
A SMC member is a committer who was elected due to merit for the evolution of the project. They have write access to the code repository, the right to cast binding votes on all proposals on community-related decisions,the right to propose other active contributors for committership, and the right to invite active committers to the SMC. The SMC as a whole is the entity that controls the project, nobody else. They are responsible for the continued shaping of this governance model.
The Substrait project is managed using a collaborative, consensus-based process. We do not have a hierarchical structure; rather, different groups of contributors have different rights and responsibilities in the organization.
Communication must be done via mailing lists, Slack, and/or Github. Communication is always done publicly. There are no private lists and all decisions related to the project are made in public. Communication is frequently done asynchronously since members of the community are distributed across many time zones.
The Substrait Management Committee is responsible for the active management of Substrait. The main role of the SMC is to further the long-term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration takes place. As part of this, the SMC is the primary approver of specification changes, ensuring that proposed changes represent a balanced and thorough examination of possibilities. This doesn’t mean that the SMC has to be involved in the minutiae of a particular specification change but should always shepard a healthy process around specification changes.
Because one of the fundamental aspects of accomplishing things is doing so by consensus, we need a way to tell whether we have reached consensus. We do this by voting. There are several different types of voting. In all cases, it is recommended that all community members vote. The number of binding votes required to move forward and the community members who have “binding” votes differs depending on the type of proposal made. In all cases, a veto of a binding voter results in an inability to move forward.
The rules require that a community member registering a negative vote must include an alternative proposal or a detailed explanation of the reasons for the negative vote. The community then tries to gather consensus on an alternative proposal that can resolve the issue. In the great majority of cases, the concerns leading to the negative vote can be addressed. This process is called “consensus gathering” and we consider it a very important indication of a healthy community.
+1 votes required
Binding voters
Voting Location
Process/Governance modifications & actions. This includes promoting new contributors to committer or SMC.
Substrait follows a review-then-commit policy. This requires that all changes receive consensus approval before being committed to the code base. The specific vote requirements follow the table above.
The voting process may seem more than a little weird if you’ve never encountered it before. Votes are represented as numbers between -1 and +1, with ‘-1’ meaning ‘no’ and ‘+1’ meaning ‘yes.’
The in-between values indicate how strongly the voting individual feels. Here are some examples of fractional votes and what the voter might be communicating with them:
+0: ‘I don’t feel strongly about it, but I’m okay with this.’
-0: ‘I won’t get in the way, but I’d rather we didn’t do this.’
-0.5: ‘I don’t like this idea, but I can’t find any rational justification for my feelings.’
++1: ‘Wow! I like this! Let’s do it!’
-0.9: ‘I really don’t like this, but I’m not going to stand in the way if everyone else wants to go ahead with it.’
+0.9: ‘This is a cool idea and I like it, but I don’t have time/the skills necessary to help out.’
For code-modification votes, +1 votes (review approvals in Github are considered equivalent to a +1) are in favor of the proposal, but -1 votes are vetoes and kill the proposal dead until all vetoers withdraw their -1 votes.
A -1 (or an unaddressed PR request for changes) vote by a qualified voter stops a code-modification proposal in its tracks. This constitutes a veto, and it cannot be overruled nor overridden by anyone. Vetoes stand until and unless the individual withdraws their veto.
To prevent vetoes from being used capriciously, the voter must provide with the veto a technical or community justification showing why the change is bad.
Votes help us to openly resolve conflicts. Without a process, people tend to avoid conflict and thrash around. Votes help to make sure we do the hard work of resolving the conflict.
Substrait is non-commercial but commercially-aware¶
Substrait’s mission is to produce software for the public good. All Substrait software is always available for free, and solely under the Apache License.
We’re happy to have third parties, including for-profit corporations, take our software and use it for their own purposes. However it is important in these cases to ensure that the third party does not misuse the brand and reputation of the Substrait project for its own purposes. It is important for the longevity and community health of Substrait that the community gets the appropriate credit for producing freely available software.
The SMC actively track the corporate allegiances of community members and strives to ensure influence around any particular aspect of the project isn’t overly skewed towards a single corporate entity.
Corporate Awareness: The ASF takes a blind-eye approach that has proven to be too slow to correct corporate influence which has substantially undermined many OSS projects. In contrast, Substrait SMC members are responsible for identifying corporate risks and over-representation and adjusting inclusion in the project based on that (limiting committership, SMC membership, etc). Each member of the SMC shares responsibility to expand the community and seek out corporate diversity.
Infrastructure: The ASF shows its age wrt to infrastructure, having been originally built on SVN. Some examples of requirements that Substrait is eschewing that exist in ASF include: custom git infrastructure, release process that is manual, project external gatekeeping around the use of new tools/technologies.
Substrait is a format for describing compute operations on structured data. It is designed for interoperability across different languages and systems.
Substrait provides a well-defined, cross-language specification for data compute operations. This includes a consistent declaration of common operations, custom operations and one or more serialized representations of this specification. The spec focuses on the semantics of each operation. In addition to the specification the Substrait ecosystem also includes a number of libraries and useful tools.
We highly recommend the tutorial to learn how a Substrait plan is constructed.
Avoids every system needing to create a communication method between every other system – each system merely supports ingesting and producing Substrait and it instantly becomes a part of the greater ecosystem.
Makes every part of the system upgradable. There’s a new query engine that’s ten times faster? Just plug it in!
Enables heterogeneous environments – run on a cluster of an unknown set of execution engines!
The text version of the Substrait plan allows you to quickly see how a plan functions without needing a visualizer (although there are Substrait visualizers as well!).
Communicate a compute plan between a SQL parser and an execution engine (e.g. Calcite SQL parsing to Arrow C++ compute kernel)
Serialize a plan that represents a SQL view for consistent use in multiple systems (e.g. Iceberg views in Spark and Trino)
Submit a plan to different execution engines (e.g. Datafusion and Postgres) and get a consistent interpretation of the semantics.
Create an alternative plan generation implementation that can connect an existing end-user compute expression system to an existing end-user processing engine (e.g. Pandas operations executed inside SingleStore)
Build a pluggable plan visualization tool (e.g. D3 based plan visualizer)
Substrait is a format for describing compute operations on structured data. It is designed for interoperability across different languages and systems.
Substrait provides a well-defined, cross-language specification for data compute operations. This includes a consistent declaration of common operations, custom operations and one or more serialized representations of this specification. The spec focuses on the semantics of each operation. In addition to the specification the Substrait ecosystem also includes a number of libraries and useful tools.
We highly recommend the tutorial to learn how a Substrait plan is constructed.
Avoids every system needing to create a communication method between every other system – each system merely supports ingesting and producing Substrait and it instantly becomes a part of the greater ecosystem.
Makes every part of the system upgradable. There’s a new query engine that’s ten times faster? Just plug it in!
Enables heterogeneous environments – run on a cluster of an unknown set of execution engines!
The text version of the Substrait plan allows you to quickly see how a plan functions without needing a visualizer (although there are Substrait visualizers as well!).
Communicate a compute plan between a SQL parser and an execution engine (e.g. Calcite SQL parsing to Arrow C++ compute kernel)
Serialize a plan that represents a SQL view for consistent use in multiple systems (e.g. Iceberg views in Spark and Trino)
Submit a plan to different execution engines (e.g. Datafusion and Postgres) and get a consistent interpretation of the semantics.
Create an alternative plan generation implementation that can connect an existing end-user compute expression system to an existing end-user processing engine (e.g. Pandas operations executed inside SingleStore)
Build a pluggable plan visualization tool (e.g. D3 based plan visualizer)
Substrait is designed to allow a user to construct an arbitrarily complex data transformation plan. The plan is composed of one or more relational operations. Relational operations are well-defined transformation operations that work by taking zero or more input datasets and transforming them into zero or more output transformations. Substrait defines a core set of transformations, but users are also able to extend the operations with their own specialized operations.
Each relational operation is composed of several properties. Common properties for relational operations include the following:
Property
Description
Type
Emit
The set of columns output from this operation and the order of those columns.
Logical & Physical
Hints
A set of optionally provided, optionally consumed information about an operation that better informs execution. These might include estimated number of input and output records, estimated record size, likely filter reduction, estimated dictionary size, etc. These can also include implementation specific pieces of execution information.
Physical
Constraint
A set of runtime constraints around the operation, limiting its consumption based on real-world resources (CPU, memory) as well as virtual resources like number of records produced, the largest record size, etc.
In functions, function signatures are declared externally to the use of those signatures (function bindings). In the case of relational operations, signatures are declared directly in the specification. This is due to the speed of change and number of total operations. Relational operations in the specification are expected to be <100 for several years with additions being infrequent. On the other hand, there is an expectation of both a much larger number of functions (1,000s) and a much higher velocity of additions.
Each relational operation must declare the following:
Transformation logic around properties of the data. For example, does a relational operation maintain sortedness of a field? Does an operation change the distribution of data?
How many input relations does an operation require?
Does the operator produce an output (by specification, we limit relational operations to a single output at this time)
What is the schema and field ordering of an output (see emit below)?
A relational operation uses field references to access specific fields of the input stream. Field references are always ordinal based on the order of the incoming streams. Each relational operation must declare the order of its output data. To simplify things, each relational operation can be in one of two modes:
Direct output: The order of outputs is based on the definition declared by the relational operation.
Remap: A listed ordering of the direct outputs. This remapping can be also used to drop columns no longer used (such as a filter field or join keys after a join). Note that remapping/exclusion can only be done at the outputs root struct. Filtering of compound values or extracting subsets must be done through other operation types (e.g. projection).
When data is partitioned across multiple sibling sets, distribution describes that set of properties that apply to any one partition. This is based on a set of distribution expression properties. A distribution is declared as a set of one or more fields and a distribution type across all fields.
Property
Description
Required
Distribution Fields
List of fields references that describe distribution (e.g. [0,2:4,5:0:0]). The order of these references do not impact results.
Required for partitioned distribution type. Disallowed for singleton distribution type.
Distribution Type
PARTITIONED: For a discrete tuple of values for the declared distribution fields, all records with that tuple are located in the same partition. SINGLETON: there will only be a single partition for this operation.
A guarantee that data output from this operation is provided with a sort order. The sort order will be declared based on a set of sort field definitions based on the emitted output of this operation.
Property
Description
Required
Sort Fields
A list of fields that the data are ordered by. The list is in order of the sort. If we sort by [0,1] then this means we only consider the data for field 1 to be ordered within each discrete value of field 0.
At least one required.
Per - Sort Field
A field reference that the data is sorted by.
Required
Per - Sort Direction
The direction of the data. See direction options below.
Returns data in ascending order based on the quality function associated with the type. Nulls are included before any values.
First
Descending
Returns data in descending order based on the quality function associated with the type. Nulls are included before any values.
First
Ascending
Returns data in ascending order based on the quality function associated with the type. Nulls are included after any values.
Last
Descending
Returns data in descending order based on the quality function associated with the type. Nulls are included after any values.
Last
Custom function identifier
Returns data using a custom function that returns -1, 0, or 1 depending on the order of the data.
Per Function
Clustered
Ensures that all equal values are coalesced (but no ordering between values is defined). E.g. for values 1,2,3,1,2,3, output could be any of the following: 1,1,2,2,3,3 or 1,1,3,3,2,2 or 2,2,1,1,3,3 or 2,2,3,3,1,1 or 3,3,1,1,2,2 or 3,3,2,2,1,1.
N/A, may appear anywhere but will be coalesced.
Discussion Points
Should read definition types be more extensible in the same way that function signatures are? Are extensible read definition types necessary if we have custom relational operators?
How are decomposed reads expressed? For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object.
Substrait is designed to allow a user to describe arbitrarily complex data transformations. These transformations are composed of one or more relational operations. Relational operations are well-defined transformation operations that work by taking zero or more input datasets and transforming them into zero or more output transformations. Substrait defines a core set of transformations, but users are also able to extend the operations with their own specialized operations.
A plan is a tree of relations. The root of the tree is the final output of the plan. Each node in the tree is a relational operation. The children of a node are the inputs to the operation. The leaves of the tree are the input datasets to the plan.
Plans can be composed together using reference relations. This allows for the construction of common plans that can be reused in multiple places. If a plan has no cycles (there is only one plan or each reference relation only references later plans) then the plan will form a DAG (Directed Acyclic Graph).
Each relational operation is composed of several properties. Common properties for relational operations include the following:
Property
Description
Type
Emit
The set of columns output from this operation and the order of those columns.
Logical & Physical
Hints
A set of optionally provided, optionally consumed information about an operation that better informs execution. These might include estimated number of input and output records, estimated record size, likely filter reduction, estimated dictionary size, etc. These can also include implementation specific pieces of execution information.
Physical
Constraint
A set of runtime constraints around the operation, limiting its consumption based on real-world resources (CPU, memory) as well as virtual resources like number of records produced, the largest record size, etc.
In functions, function signatures are declared externally to the use of those signatures (function bindings). In the case of relational operations, signatures are declared directly in the specification. This is due to the speed of change and number of total operations. Relational operations in the specification are expected to be <100 for several years with additions being infrequent. On the other hand, there is an expectation of both a much larger number of functions (1,000s) and a much higher velocity of additions.
Each relational operation must declare the following:
Transformation logic around properties of the data. For example, does a relational operation maintain sortedness of a field? Does an operation change the distribution of data?
How many input relations does an operation require?
Does the operator produce an output (by specification, we limit relational operations to a single output at this time)
What is the schema and field ordering of an output (see emit below)?
A relational operation uses field references to access specific fields of the input stream. Field references are always ordinal based on the order of the incoming streams. Each relational operation must declare the order of its output data. To simplify things, each relational operation can be in one of two modes:
Direct output: The order of outputs is based on the definition declared by the relational operation.
Remap: A listed ordering of the direct outputs. This remapping can be also used to drop columns no longer used (such as a filter field or join keys after a join). Note that remapping/exclusion can only be done at the outputs root struct. Filtering of compound values or extracting subsets must be done through other operation types (e.g. projection).
When data is partitioned across multiple sibling sets, distribution describes that set of properties that apply to any one partition. This is based on a set of distribution expression properties. A distribution is declared as a set of one or more fields and a distribution type across all fields.
Property
Description
Required
Distribution Fields
List of fields references that describe distribution (e.g. [0,2:4,5:0:0]). The order of these references do not impact results.
Required for partitioned distribution type. Disallowed for singleton distribution type.
Distribution Type
PARTITIONED: For a discrete tuple of values for the declared distribution fields, all records with that tuple are located in the same partition. SINGLETON: there will only be a single partition for this operation.
A guarantee that data output from this operation is provided with a sort order. The sort order will be declared based on a set of sort field definitions based on the emitted output of this operation.
Property
Description
Required
Sort Fields
A list of fields that the data are ordered by. The list is in order of the sort. If we sort by [0,1] then this means we only consider the data for field 1 to be ordered within each discrete value of field 0.
At least one required.
Per - Sort Field
A field reference that the data is sorted by.
Required
Per - Sort Direction
The direction of the data. See direction options below.
Returns data in ascending order based on the quality function associated with the type. Nulls are included before any values.
First
Descending
Returns data in descending order based on the quality function associated with the type. Nulls are included before any values.
First
Ascending
Returns data in ascending order based on the quality function associated with the type. Nulls are included after any values.
Last
Descending
Returns data in descending order based on the quality function associated with the type. Nulls are included after any values.
Last
Custom function identifier
Returns data using a custom function that returns -1, 0, or 1 depending on the order of the data.
Per Function
Clustered
Ensures that all equal values are coalesced (but no ordering between values is defined). E.g. for values 1,2,3,1,2,3, output could be any of the following: 1,1,2,2,3,3 or 1,1,3,3,2,2 or 2,2,1,1,3,3 or 2,2,3,3,1,1 or 3,3,1,1,2,2 or 3,3,2,2,1,1.
N/A, may appear anywhere but will be coalesced.
Discussion Points
Should read definition types be more extensible in the same way that function signatures are? Are extensible read definition types necessary if we have custom relational operators?
How are decomposed reads expressed? For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object.
Embedded relations allow a Substrait producer to define a set operation that will be embedded in the plan.
TODO: define lots of details about what interfaces, languages, formats, etc. Should reasonably be an extension of embedded user defined table functions.
Embedded relations allow a Substrait producer to define a set operation that will be embedded in the plan.
TODO: define lots of details about what interfaces, languages, formats, etc. Should reasonably be an extension of embedded user defined table functions.
\ No newline at end of file
+-->
\ No newline at end of file
diff --git a/relations/logical_relations/index.html b/relations/logical_relations/index.html
index 6bc2078..a463ce5 100644
--- a/relations/logical_relations/index.html
+++ b/relations/logical_relations/index.html
@@ -1,4 +1,4 @@
- Logical Relations - Substrait: Cross-Language Serialization for Relational Algebra
The read operator is an operator that produces one output. A simple example would be the reading of a Parquet file. It is expected that many types of reads will be added over time.
Signature
Value
Inputs
0
Outputs
1
Property Maintenance
N/A (no inputs)
Direct Output Order
Defaults to the schema of the data read after the optional projection (masked complex expression) is applied.
The read relation has two different filter properties. A filter, which must be satisfied by the operator and a best effort filter, which does not have to be satisfied. This reflects the way that consumers are often implemented. A consumer is often only able to fully apply a limited set of operations in the scan. There can then be an extended set of operations which a consumer can apply in a best effort fashion. A producer, when setting these two fields, should take care to only use expressions that the consumer is capable of handling.
As an example, a consumer may only be able to fully apply (in the read relation) <, =, and > on integral types. The consumer may be able to apply <, =, and > in a best effort fashion on decimal and string types. Consider the filter expression my_int < 10 && my_string < "x" && upper(my_string) > "B". In this case the filter should be set to my_int < 10 and the best_effort_filter should be set to my_string < "x" and the remaining portion (upper(my_string) > "B") should be put into a filter relation.
A filter expression must be interpreted against the direct schema before the projection expression has been applied. As a result, fields may be referenced by the filter expression which are not included in the relation’s output.
A named table is a reference to data defined elsewhere. For example, there may be a catalog of tables with unique names that both the producer and consumer agree on. This catalog would provide the consumer with more information on how to retrieve the data.
Property
Description
Required
Names
A list of namespaced strings that, together, form the table name
A read operation is allowed to only read part of a file. This is convenient, for example, when distributing a read operation across several nodes. The slicing parameters are specified as byte offsets into the file.
Many file formats consist of indivisible “chunks” of data (e.g. Parquet row groups). If this happens the consumer can determine which slice a particular chunk belongs to. For example, one possible approach is that a chunk should only be read if the midpoint of the chunk (dividing by 2 and rounding down) is contained within the asked-for byte range.
The read operator is an operator that produces one output. A simple example would be the reading of a Parquet file. It is expected that many types of reads will be added over time.
Signature
Value
Inputs
0
Outputs
1
Property Maintenance
N/A (no inputs)
Direct Output Order
Defaults to the schema of the data read after the optional projection (masked complex expression) is applied.
The read relation has two different filter properties. A filter, which must be satisfied by the operator and a best effort filter, which does not have to be satisfied. This reflects the way that consumers are often implemented. A consumer is often only able to fully apply a limited set of operations in the scan. There can then be an extended set of operations which a consumer can apply in a best effort fashion. A producer, when setting these two fields, should take care to only use expressions that the consumer is capable of handling.
As an example, a consumer may only be able to fully apply (in the read relation) <, =, and > on integral types. The consumer may be able to apply <, =, and > in a best effort fashion on decimal and string types. Consider the filter expression my_int < 10 && my_string < "x" && upper(my_string) > "B". In this case the filter should be set to my_int < 10 and the best_effort_filter should be set to my_string < "x" and the remaining portion (upper(my_string) > "B") should be put into a filter relation.
A filter expression must be interpreted against the direct schema before the projection expression has been applied. As a result, fields may be referenced by the filter expression which are not included in the relation’s output.
A named table is a reference to data defined elsewhere. For example, there may be a catalog of tables with unique names that both the producer and consumer agree on. This catalog would provide the consumer with more information on how to retrieve the data.
Property
Description
Required
Names
A list of namespaced strings that, together, form the table name
A read operation is allowed to only read part of a file. This is convenient, for example, when distributing a read operation across several nodes. The slicing parameters are specified as byte offsets into the file.
Many file formats consist of indivisible “chunks” of data (e.g. Parquet row groups). If this happens the consumer can determine which slice a particular chunk belongs to. For example, one possible approach is that a chunk should only be read if the midpoint of the chunk (dividing by 2 and rounding down) is contained within the asked-for byte range.
There is no true distinction between logical and physical operations in Substrait. By convention, certain operations are classified as physical, but all operations can be potentially used in any kind of plan. A particular set of transformations or target operators may (by convention) be considered the “physical plan” but this is a characteristic of the system consuming substrait as opposed to a definition within Substrait.
The hash equijoin join operator will build a hash table out of the right input based on a set of join keys. It will then probe that hash table for incoming inputs, finding matches.
Signature
Value
Inputs
2
Outputs
1
Property Maintenance
Distribution is maintained. Orderedness of the left set is maintained in INNER join cases, otherwise it is eliminated.
References to the fields to join on in the left input.
Required
Right Keys
References to the fields to join on in the right input.
Required
Post Join Predicate
An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys.
Optional, defaults true.
Join Type
One of the join types defined in the Join operator.
The nested loop join operator does a join by holding the entire right input and then iterating over it using the left input, evaluating the join expression on the Cartesian product of all rows, only outputting rows where the expression is true. Will also include non-matching rows in the OUTER, LEFT and RIGHT operations per the join type requirements.
Signature
Value
Inputs
2
Outputs
1
Property Maintenance
Distribution is maintained. Orderedness is eliminated.
The merge equijoin does a join by taking advantage of two sets that are sorted on the join keys. This allows the join operation to be done in a streaming fashion.
Signature
Value
Inputs
2
Outputs
1
Property Maintenance
Distribution is maintained. Orderedness is eliminated.
References to the fields to join on in the left input.
Required
Right Keys
References to the fields to join on in the right input.
Reauired
Post Join Predicate
An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys.
Optional, defaults true.
Join Type
One of the join types defined in the Join operator.
The exchange operator will redistribute data based on an exchange type definition. Applying this operation will lead to an output that presents the desired distribution.
Signature
Value
Inputs
1
Outputs
1
Property Maintenance
Orderedness is maintained. Distribution is overwritten based on configuration.
Distribute data using a system defined hashing function that considers one or more fields. For the same type of fields and same ordering of values, the same partition target should be identified for different ExchangeRels
Single Bucket
Define an expression that provides a single i32 bucket number. Optionally define whether the expression will only return values within the valid number of partition counts. If not, the system should modulo the return value to determine a target partition.
Multi Bucket
Define an expression that provides a List<i32> of bucket numbers. Optionally define whether the expression will only return values within the valid number of partition counts. If not, the system should modulo the return value to determine a target partition. The records should be sent to all bucket numbers provided by the expression.
Broadcast
Send all records to all partitions.
Round Robin
Send records to each target in sequence. Can follow either exact or approximate behavior. Approximate will attempt to balance the number of records sent to each destination but may not exactly distribute evenly and may send batches of records to each target before moving to the next.
Optional. If not defined, implementation system should decide the number of partitions. Note that when not defined, single or multi bucket expressions should not be constrained to count.
Expression Mapping
Describes a relationship between each partition ID and the destination that partition should be sent to.
Optional. A partition may be sent to 0..N locations. Value can either be a URI or arbitrary value.
The top-N operator reorders a dataset based on one or more identified sort fields as well as a sorting function. Rather than sort the entire dataset, the top-N will only maintain the total number of records required to ensure a limited output. A top-n is a combination of a logical sort and logical fetch operations.
Signature
Value
Inputs
1
Outputs
1
Property Maintenance
Will update orderedness property to the output of the sort operation. Distribution property only remapped based on emit.
The streaming aggregate operation leverages data ordered by the grouping expressions to calculate data each grouping set tuple-by-tuple in streaming fashion. All grouping sets and orderings requested on each aggregate must be compatible to allow multiple grouping sets or aggregate orderings.
Signature
Value
Inputs
1
Outputs
1
Property Maintenance
Maintains distribution if all distribution fields are contained in every grouping set. Maintains input ordering.
A consistent partition window operation is a special type of project operation where every function is a window function and all of the window functions share the same sorting and partitioning. This allows for the sort and partition to be calculated once and shared between the various function evaluations.
Signature
Value
Inputs
1
Outputs
1
Property Maintenance
Maintains distribution and ordering.
Direct Output Order
Same as Project operator (input followed by each window expression).
The expand operation creates duplicates of input records based on the Expand Fields. Each Expand Field can be a Switching Field or an expression. Switching Fields are described below. If an Expand Field is an expression then its value is consistent across all duplicate rows.
Signature
Value
Inputs
1
Outputs
1
Property Maintenance
Distribution is maintained if all the distribution fields are consistent fields with direct references. Ordering can only be maintained down to the level of consistent fields that are kept.
Direct Output Order
The expand fields followed by an i32 column describing the index of the duplicate that the row is derived from.
A switching field is a field whose value is different in each duplicated row. All switching fields in an Expand Operation must have the same number of duplicates.
Property
Description
Required
Duplicates
List of one or more expressions. The output will contain a row for each expression.
There is no true distinction between logical and physical operations in Substrait. By convention, certain operations are classified as physical, but all operations can be potentially used in any kind of plan. A particular set of transformations or target operators may (by convention) be considered the “physical plan” but this is a characteristic of the system consuming substrait as opposed to a definition within Substrait.
The hash equijoin join operator will build a hash table out of the right input based on a set of join keys. It will then probe that hash table for incoming inputs, finding matches.
Signature
Value
Inputs
2
Outputs
1
Property Maintenance
Distribution is maintained. Orderedness of the left set is maintained in INNER join cases, otherwise it is eliminated.
References to the fields to join on in the left input.
Required
Right Keys
References to the fields to join on in the right input.
Required
Post Join Predicate
An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys.
Optional, defaults true.
Join Type
One of the join types defined in the Join operator.
The nested loop join operator does a join by holding the entire right input and then iterating over it using the left input, evaluating the join expression on the Cartesian product of all rows, only outputting rows where the expression is true. Will also include non-matching rows in the OUTER, LEFT and RIGHT operations per the join type requirements.
Signature
Value
Inputs
2
Outputs
1
Property Maintenance
Distribution is maintained. Orderedness is eliminated.
The merge equijoin does a join by taking advantage of two sets that are sorted on the join keys. This allows the join operation to be done in a streaming fashion.
Signature
Value
Inputs
2
Outputs
1
Property Maintenance
Distribution is maintained. Orderedness is eliminated.
References to the fields to join on in the left input.
Required
Right Keys
References to the fields to join on in the right input.
Reauired
Post Join Predicate
An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys.
Optional, defaults true.
Join Type
One of the join types defined in the Join operator.
The exchange operator will redistribute data based on an exchange type definition. Applying this operation will lead to an output that presents the desired distribution.
Signature
Value
Inputs
1
Outputs
1
Property Maintenance
Orderedness is maintained. Distribution is overwritten based on configuration.
Distribute data using a system defined hashing function that considers one or more fields. For the same type of fields and same ordering of values, the same partition target should be identified for different ExchangeRels
Single Bucket
Define an expression that provides a single i32 bucket number. Optionally define whether the expression will only return values within the valid number of partition counts. If not, the system should modulo the return value to determine a target partition.
Multi Bucket
Define an expression that provides a List<i32> of bucket numbers. Optionally define whether the expression will only return values within the valid number of partition counts. If not, the system should modulo the return value to determine a target partition. The records should be sent to all bucket numbers provided by the expression.
Broadcast
Send all records to all partitions.
Round Robin
Send records to each target in sequence. Can follow either exact or approximate behavior. Approximate will attempt to balance the number of records sent to each destination but may not exactly distribute evenly and may send batches of records to each target before moving to the next.
Optional. If not defined, implementation system should decide the number of partitions. Note that when not defined, single or multi bucket expressions should not be constrained to count.
Expression Mapping
Describes a relationship between each partition ID and the destination that partition should be sent to.
Optional. A partition may be sent to 0..N locations. Value can either be a URI or arbitrary value.
The top-N operator reorders a dataset based on one or more identified sort fields as well as a sorting function. Rather than sort the entire dataset, the top-N will only maintain the total number of records required to ensure a limited output. A top-n is a combination of a logical sort and logical fetch operations.
Signature
Value
Inputs
1
Outputs
1
Property Maintenance
Will update orderedness property to the output of the sort operation. Distribution property only remapped based on emit.
The streaming aggregate operation leverages data ordered by the grouping expressions to calculate data each grouping set tuple-by-tuple in streaming fashion. All grouping sets and orderings requested on each aggregate must be compatible to allow multiple grouping sets or aggregate orderings.
Signature
Value
Inputs
1
Outputs
1
Property Maintenance
Maintains distribution if all distribution fields are contained in every grouping set. Maintains input ordering.
A consistent partition window operation is a special type of project operation where every function is a window function and all of the window functions share the same sorting and partitioning. This allows for the sort and partition to be calculated once and shared between the various function evaluations.
Signature
Value
Inputs
1
Outputs
1
Property Maintenance
Maintains distribution and ordering.
Direct Output Order
Same as Project operator (input followed by each window expression).
The expand operation creates duplicates of input records based on the Expand Fields. Each Expand Field can be a Switching Field or an expression. Switching Fields are described below. If an Expand Field is an expression then its value is consistent across all duplicate rows.
Signature
Value
Inputs
1
Outputs
1
Property Maintenance
Distribution is maintained if all the distribution fields are consistent fields with direct references. Ordering can only be maintained down to the level of consistent fields that are kept.
Direct Output Order
The expand fields followed by an i32 column describing the index of the duplicate that the row is derived from.
A switching field is a field whose value is different in each duplicated row. All switching fields in an Expand Operation must have the same number of duplicates.
Property
Description
Required
Duplicates
List of one or more expressions. The output will contain a row for each expression.
Substrait is designed to be serialized into various different formats. Currently we support a binary serialization for transmission of plans between programs (e.g. IPC or network communication) and a text serialization for debugging and human readability. Other formats may be added in the future.
These formats serialize a collection of plans. Substrait does not define how a collection of plans is to be interpreted. For example, the following scenarios are all valid uses of a collection of plans:
A query engine receives a plan and executes it. It receives a collection of plans with a single root plan. The top-level node of the root plan defines the output of the query. Non-root plans may be included as common subplans which are referenced from the root plan.
A transpiler may convert plans from one dialect to another. It could take, as input, a single root plan. Then it could output a serialized binary containing multiple root plans. Each root plan is a representation of the input plan in a different dialect.
A distributed scheduler might expect 1+ root plans. Each root plan describes a different stage of computation.
Libraries should make sure to thoroughly describe the way plan collections will be produced or consumed.
We often refer to query plans as a graph of nodes (typically a DAG unless the query is recursive). However, we encode this graph as a collection of trees with a single root tree that references other trees (which may also transitively reference other trees). Plan serializations all have some way to indicate which plan(s) are “root” plans. Any plan that is not a root plan and is not referenced (directly or transitively) by some root plan can safely be ignored.
\ No newline at end of file
diff --git a/serialization/binary_serialization/index.html b/serialization/binary_serialization/index.html
index ee6b0c0..8bf27c8 100644
--- a/serialization/binary_serialization/index.html
+++ b/serialization/binary_serialization/index.html
@@ -1,4 +1,4 @@
- Binary Serialization - Substrait: Cross-Language Serialization for Relational Algebra
Substrait can be serialized into a protobuf-based binary representation. The proto schema/IDL files can be found on GitHub. Proto files are place in the io.substrait namespace for C++/Java and the Substrait.Protobuf namespace for C#.
The main top-level object used to communicate a Substrait plan using protobuf is a Plan message (see the ExtendedExpression for an alternative other top-level object). The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees.
Substrait can be serialized into a protobuf-based binary representation. The proto schema/IDL files can be found on GitHub. Proto files are place in the io.substrait namespace for C++/Java and the Substrait.Protobuf namespace for C#.
The main top-level object used to communicate a Substrait plan using protobuf is a Plan message (see the ExtendedExpression for an alternative other top-level object). The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees.
messagePlan{// Substrait version of the plan. Optional up to 0.17.0, required for later// versions.Versionversion=6;
@@ -126,4 +126,4 @@
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN THE SOFTWARE.
--->
\ No newline at end of file
+-->
\ No newline at end of file
diff --git a/serialization/text_serialization/index.html b/serialization/text_serialization/index.html
index 0ba0a51..dd0ee78 100644
--- a/serialization/text_serialization/index.html
+++ b/serialization/text_serialization/index.html
@@ -1,4 +1,4 @@
- Text Serialization - Substrait: Cross-Language Serialization for Relational Algebra
To maximize the new user experience, it is important for Substrait to have a text representation of plans. This allows people to experiment with basic tooling. Building simple CLI tools that do things like SQL > Plan and Plan > SQL or REPL plan construction can all be done relatively straightforwardly with a text representation.
The recommended text serialization format is JSON. Since the text format is not designed for performance, the format can be produced to maximize readability. This also allows nice symmetry between the construction of plans and the configuration of various extensions such as function signatures and user defined types.
To ensure the JSON is valid, the object will be defined using the OpenApi 3.1 specification. This not only allows strong validation, the OpenApi specification enables code generators to be easily used to produce plans in many languages.
While JSON will be used for much of the plan serialization, Substrait uses a custom simplistic grammar for record level expressions. While one can construct an equation such as (10 + 5)/2 using a tree of function and literal objects, it is much more human-readable to consume a plan when the information is written similarly to the way one typically consumes scalar expressions. This grammar will be maintained in an ANTLR grammar (targetable to multiple programming languages) and is also planned to be supported via JSON schema definition format tag so that the grammar can be validated as part of the schema validation.
To maximize the new user experience, it is important for Substrait to have a text representation of plans. This allows people to experiment with basic tooling. Building simple CLI tools that do things like SQL > Plan and Plan > SQL or REPL plan construction can all be done relatively straightforwardly with a text representation.
The recommended text serialization format is JSON. Since the text format is not designed for performance, the format can be produced to maximize readability. This also allows nice symmetry between the construction of plans and the configuration of various extensions such as function signatures and user defined types.
To ensure the JSON is valid, the object will be defined using the OpenApi 3.1 specification. This not only allows strong validation, the OpenApi specification enables code generators to be easily used to produce plans in many languages.
While JSON will be used for much of the plan serialization, Substrait uses a custom simplistic grammar for record level expressions. While one can construct an equation such as (10 + 5)/2 using a tree of function and literal objects, it is much more human-readable to consume a plan when the information is written similarly to the way one typically consumes scalar expressions. This grammar will be maintained in an ANTLR grammar (targetable to multiple programming languages) and is also planned to be supported via JSON schema definition format tag so that the grammar can be validated as part of the schema validation.
Substrait is a community project and requires consensus about new additions to the specification in order to maintain consistency. The best way to get consensus is to discuss ideas. The main ways to communicate are:
For complex features it is useful to discuss the change first. It will be useful to gather some background information to help get everyone on the same page.
Every engine has its own terminology. Every Spark user probably knows what an “attribute” is. Velox users will know what a “RowVector” means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible.
What problems does this relation solve? If it is a more logical relation then how does it allow users to express new capabilities? If it is more of an internal relation then how does it map to existing logical relations? How is it different than other existing relations? Why do we need this?
Provide example input and output for the relation. Show example plans. Try and motivate your examples, as best as possible, with something that looks like a real world problem. These will go a long ways towards helping others understand the purpose of a relation.
Substrait is designed around interoperability so a feature only used by a single system may not be accepted. But don’t dispair! Substrait has a highly developed extension system for this express purpose.
If you are hoping to add a feature and these criteria seem intimidating then feel free to start a mailing list discussion before you have all the information and ask for help. Investigating other implementations, in particular, is something that can be quite difficult to do on your own.
Substrait is a community project and requires consensus about new additions to the specification in order to maintain consistency. The best way to get consensus is to discuss ideas. The main ways to communicate are:
For complex features it is useful to discuss the change first. It will be useful to gather some background information to help get everyone on the same page.
Every engine has its own terminology. Every Spark user probably knows what an “attribute” is. Velox users will know what a “RowVector” means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible.
What problems does this relation solve? If it is a more logical relation then how does it allow users to express new capabilities? If it is more of an internal relation then how does it map to existing logical relations? How is it different than other existing relations? Why do we need this?
Provide example input and output for the relation. Show example plans. Try and motivate your examples, as best as possible, with something that looks like a real world problem. These will go a long ways towards helping others understand the purpose of a relation.
Substrait is designed around interoperability so a feature only used by a single system may not be accepted. But don’t dispair! Substrait has a highly developed extension system for this express purpose.
If you are hoping to add a feature and these criteria seem intimidating then feel free to start a mailing list discussion before you have all the information and ask for help. Investigating other implementations, in particular, is something that can be quite difficult to do on your own.
The specification has passed the initial design phase and is now in the final stages of being fleshed out. The community is encouraged to identify (and address) any perceived gaps in functionality using GitHub issues and PRs. Once all of the planned implementations have been completed all deprecated fields will be eliminated and version 1.0 will be released.
A way to describe the set of basic types that will be operated on within a plan. Only includes simple types such as integers and doubles (nothing configurable or compound).
Expression of types that go beyond simple scalar values. Key concepts here include: configurable types such as fixed length and numeric types as well as compound types such as structs, maps, lists, etc.
Specialized expression types that are more naturally expressed outside the function paradigm. Examples include items such as if/then/else and switch statements.
Functions that are expressed in aggregation operations. Examples include things such as SUM, COUNT, etc. Operations take many records and collapse them into a single (possibly compound) value.
Reusable named functions that are built beyond the core specification. Implementations are typically registered thorough external means (drop a file in a directory, send a special command with implementation, etc.)
Functions implementations embedded directly within the plan. Frequently used in data science workflows where business logic is interspersed with standard operations.
Specific execution sub-variations of common relational operations that describe have multiple unique physical variants associated with a single logical operation. Examples include hash join, merge join, nested loop join, etc.
The specification has passed the initial design phase and is now in the final stages of being fleshed out. The community is encouraged to identify (and address) any perceived gaps in functionality using GitHub issues and PRs. Once all of the planned implementations have been completed all deprecated fields will be eliminated and version 1.0 will be released.
A way to describe the set of basic types that will be operated on within a plan. Only includes simple types such as integers and doubles (nothing configurable or compound).
Expression of types that go beyond simple scalar values. Key concepts here include: configurable types such as fixed length and numeric types as well as compound types such as structs, maps, lists, etc.
Specialized expression types that are more naturally expressed outside the function paradigm. Examples include items such as if/then/else and switch statements.
Functions that are expressed in aggregation operations. Examples include things such as SUM, COUNT, etc. Operations take many records and collapse them into a single (possibly compound) value.
Reusable named functions that are built beyond the core specification. Implementations are typically registered thorough external means (drop a file in a directory, send a special command with implementation, etc.)
Functions implementations embedded directly within the plan. Frequently used in data science workflows where business logic is interspersed with standard operations.
Specific execution sub-variations of common relational operations that describe have multiple unique physical variants associated with a single logical operation. Examples include hash join, merge join, nested loop join, etc.
Provide a good suite of well-specified common functionality in databases and data science applications.
Make it easy for users to privately or publicly extend the representation to support specialized/custom operations.
Produce something that is language agnostic and requires minimal work to start developing against in a new language.
Drive towards a common format that avoids specialization for single favorite producer or consumer.
Establish clear delineation between specifications that MUST be respected to and those that can be optionally ignored.
Establish a forgiving compatibility approach and versioning scheme that supports cross-version compatibility in maximum number of cases.
Minimize the need for consumer intelligence by excluding concepts like overloading, type coercion, implicit casting, field name handling, etc. (Note: this is weak and should be better stated.)
Decomposability/severability: A particular producer or consumer should be able to produce or consume only a subset of the specification and interact well with any other Substrait system as long the specific operations requested fit within the subset of specification supported by the counter system.
Provide a good suite of well-specified common functionality in databases and data science applications.
Make it easy for users to privately or publicly extend the representation to support specialized/custom operations.
Produce something that is language agnostic and requires minimal work to start developing against in a new language.
Drive towards a common format that avoids specialization for single favorite producer or consumer.
Establish clear delineation between specifications that MUST be respected to and those that can be optionally ignored.
Establish a forgiving compatibility approach and versioning scheme that supports cross-version compatibility in maximum number of cases.
Minimize the need for consumer intelligence by excluding concepts like overloading, type coercion, implicit casting, field name handling, etc. (Note: this is weak and should be better stated.)
Decomposability/severability: A particular producer or consumer should be able to produce or consume only a subset of the specification and interact well with any other Substrait system as long the specific operations requested fit within the subset of specification supported by the counter system.
As an interface specification, the goal of Substrait is to reach a point where (breaking) changes will never need to happen again, or at least be few and far between. By analogy, Apache Arrow’s in-memory format specification has stayed functionally constant, despite many major library versions being released. However, we’re not there yet. When we believe that we’ve reached this point, we will signal this by releasing version 1.0.0. Until then, we will remain in the 0.x.x version regime.
Despite this, we strive to maintain backward compatibility for both the binary representation and the text representation by means of deprecation. When a breaking change cannot be reasonably avoided, we may remove previously deprecated fields. All deprecated fields will be removed for the 1.0.0 release.
Substrait uses semantic versioning for its version numbers, with the addition that, during 0.x.y, we increment the x digit for breaking changes and new features, and the y digit for fixes and other nonfunctional changes. The release process is currently automated and makes a new release every week, provided something has changed on the main branch since the previous release. This release cadence will likely be slowed down as stability increases over time. Conventional commits are used to distinguish between breaking changes, new features, and fixes, and GitHub actions are used to verify that there are indeed no breaking protobuf changes in a commit, unless the commit message states this.
As an interface specification, the goal of Substrait is to reach a point where (breaking) changes will never need to happen again, or at least be few and far between. By analogy, Apache Arrow’s in-memory format specification has stayed functionally constant, despite many major library versions being released. However, we’re not there yet. When we believe that we’ve reached this point, we will signal this by releasing version 1.0.0. Until then, we will remain in the 0.x.x version regime.
Despite this, we strive to maintain backward compatibility for both the binary representation and the text representation by means of deprecation. When a breaking change cannot be reasonably avoided, we may remove previously deprecated fields. All deprecated fields will be removed for the 1.0.0 release.
Substrait uses semantic versioning for its version numbers, with the addition that, during 0.x.y, we increment the x digit for breaking changes and new features, and the y digit for fixes and other nonfunctional changes. The release process is currently automated and makes a new release every week, provided something has changed on the main branch since the previous release. This release cadence will likely be slowed down as stability increases over time. Conventional commits are used to distinguish between breaking changes, new features, and fixes, and GitHub actions are used to verify that there are indeed no breaking protobuf changes in a commit, unless the commit message states this.
The substrait-tools python package provides a command line interface for producing/consuming substrait plans by leveraging the APIs from different producers and consumers.
The substrait-tools python package provides a command line interface for producing/consuming substrait plans by leveraging the APIs from different producers and consumers.
This is an introductory tutorial to learn the basics of Substrait for readers already familiar with SQL. We will look at how to construct a Substrait plan from an example query.
We’ll present the Substrait in JSON form to make it relatively readable to newcomers. Typically Substrait is exchanged as a protobuf message, but for debugging purposes it is often helpful to look at a serialized form. Plus, it’s not uncommon for unit tests to represent plans as JSON strings. So if you are developing with Substrait, it’s useful to have experience reading them.
Note
Substrait is currently only defined with Protobuf. The JSON provided here is the Protobuf JSON output, but it is not the official Substrait text format. Eventually, Substrait will define it’s own human-readable text format, but for now this tutorial will make due with what Protobuf provides.
Substrait is designed to communicate plans (mostly logical plans). Those plans contain types, schemas, expressions, extensions, and relations. We’ll look at them in that order, going from simplest to most complex until we can construct full plans.
This tutorial won’t cover all the details of each piece, but it will give you an idea of how they connect together. For a detailed reference of each individual field, the best place to look is reading the protobuf definitions. They represent the source-of-truth of the spec and are well-commented to address ambiguities.
This is an introductory tutorial to learn the basics of Substrait for readers already familiar with SQL. We will look at how to construct a Substrait plan from an example query.
We’ll present the Substrait in JSON form to make it relatively readable to newcomers. Typically Substrait is exchanged as a protobuf message, but for debugging purposes it is often helpful to look at a serialized form. Plus, it’s not uncommon for unit tests to represent plans as JSON strings. So if you are developing with Substrait, it’s useful to have experience reading them.
Note
Substrait is currently only defined with Protobuf. The JSON provided here is the Protobuf JSON output, but it is not the official Substrait text format. Eventually, Substrait will define it’s own human-readable text format, but for now this tutorial will make due with what Protobuf provides.
Substrait is designed to communicate plans (mostly logical plans). Those plans contain types, schemas, expressions, extensions, and relations. We’ll look at them in that order, going from simplest to most complex until we can construct full plans.
This tutorial won’t cover all the details of each piece, but it will give you an idea of how they connect together. For a detailed reference of each individual field, the best place to look is reading the protobuf definitions. They represent the source-of-truth of the spec and are well-commented to address ambiguities.
In Substrait, the “class” of a type, not to be confused with the concept from object-oriented programming, defines the set of non-null values that instances of a type may assume.
Implementations of a Substrait type must support at least this set of values, but may include more; for example, an i8 could be represented using the same in-memory format as an i32, as long as functions operating on i8 values within [-128..127] behave as specified (in this case, this means 8-bit overflow must work as expected). Operating on values outside the specified range is unspecified behavior.
Simple type classes are those that don’t support any form of configuration. For simplicity, any generic type that has only a small number of discrete implementations is declared directly, as opposed to via configuration.
Type Name
Description
Protobuf representation for literals
boolean
A value that is either True or False.
bool
i8
A signed integer within [-128..127], typically represented as an 8-bit two’s complement number.
int32
i16
A signed integer within [-32,768..32,767], typically represented as a 16-bit two’s complement number.
int32
i32
A signed integer within [-2147483648..2,147,483,647], typically represented as a 32-bit two’s complement number.
int32
i64
A signed integer within [−9,223,372,036,854,775,808..9,223,372,036,854,775,807], typically represented as a 64-bit two’s complement number.
A unicode string of text, [0..2,147,483,647] UTF-8 bytes in length.
string
binary
A binary value, [0..2,147,483,647] bytes in length.
binary
timestamp
A naive timestamp with microsecond precision. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python.
int64 microseconds since 1970-01-01 00:00:00.000000 (in an unspecified timezone)
timestamp_tz
A timezone-aware timestamp with microsecond precision. Similar to aware datetime in Python.
int64 microseconds since 1970-01-01 00:00:00.000000 UTC
date
A date within [1000-01-01..9999-12-31].
int32 days since 1970-01-01
time
A time since the beginning of any day. Range of [0..86,399,999,999] microseconds; leap seconds need not be supported.
int64 microseconds past midnight
interval_year
Interval year to month. Supports a range of [-10,000..10,000] years with month precision (= [-120,000..120,000] months). Usually stored as separate integers for years and months, but only the total number of months is significant, i.e. 1y 0m is considered equal to 0y 12m or 1001y -12000m.
int32 years and int32 months, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. -10000y 200000m is not allowed)
interval_day
Interval day to second. Supports a range of [-3,650,000..3,650,000] days with microsecond precision (= [-315,360,000,000,000,000..315,360,000,000,000,000] microseconds). Usually stored as separate integers for various components, but only the total number of microseconds is significant, i.e. 1d 0s is considered equal to 0d 86400s.
int32 days, int32 seconds, and int32 microseconds, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. 3650001d -86400s 0us is not allowed)
uuid
A universally-unique identifier composed of 128 bits. Typically presented to users in the following hexadecimal format: c48ffa9e-64f4-44cb-ae47-152b4e60e77b. Any 128-bit value is allowed, without specific adherence to RFC4122.
Compound type classes are type classes that need to be configured by means of a parameter pack.
Type Name
Description
Protobuf representation for literals
FIXEDCHAR<L>
A fixed-length unicode string of L characters. L must be within [1..2,147,483,647].
L-character string
VARCHAR<L>
A unicode string of at most L characters.L must be within [1..2,147,483,647].
string with at most L characters
FIXEDBINARY<L>
A binary string of L bytes. When casting, values shorter than L are padded with zeros, and values longer than L are right-trimmed.
L-byte bytes
DECIMAL<P, S>
A fixed-precision decimal value having precision (P, number of digits) <= 38 and scale (S, number of fractional digits) 0 <= S <= P.
16-byte bytes representing a little-endian 128-bit integer, to be divided by 10^S to get the decimal value
STRUCT<T1,…,Tn>
A list of types in a defined order.
repeated Literal, types matching T1..Tn
NSTRUCT<N:T1,…,N:Tn>
Pseudo-type: A struct that maps unique names to value types. Each name is a UTF-8-encoded string. Each value can have a distinct type. Note that NSTRUCT is actually a pseudo-type, because Substrait’s core type system is based entirely on ordinal positions, not named fields. Nonetheless, when working with systems outside Substrait, names are important.
n/a
LIST<T>
A list of values of type T. The list can be between [0..2,147,483,647] values in length.
repeated Literal, all types matching T
MAP<K, V>
An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null.
repeated KeyValue (in turn two Literals), all key types matching K and all value types matching V
PRECISIONTIMESTAMP<P>
A timestamp with fractional second precision (P, number of digits) 0 <= P <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python.
int64 seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone)
PRECISIONTIMESTAMPTZ<P>
A timezone-aware timestamp, with fractional second precision (P, number of digits) 0 <= P <= 9. Similar to aware datetime in Python.
int64 seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 UTC
User-defined type classes are defined as part of simple extensions. An extension can declare an arbitrary number of user-defined extension types. Once a type has been declared, it can be used in function declarations.
For example, the following declares a type named point (namespaced to the associated YAML file) and two scalar functions that operate on it.
types:
+ Type Classes - Substrait: Cross-Language Serialization for Relational Algebra
In Substrait, the “class” of a type, not to be confused with the concept from object-oriented programming, defines the set of non-null values that instances of a type may assume.
Implementations of a Substrait type must support at least this set of values, but may include more; for example, an i8 could be represented using the same in-memory format as an i32, as long as functions operating on i8 values within [-128..127] behave as specified (in this case, this means 8-bit overflow must work as expected). Operating on values outside the specified range is unspecified behavior.
Simple type classes are those that don’t support any form of configuration. For simplicity, any generic type that has only a small number of discrete implementations is declared directly, as opposed to via configuration.
Type Name
Description
Protobuf representation for literals
boolean
A value that is either True or False.
bool
i8
A signed integer within [-128..127], typically represented as an 8-bit two’s complement number.
int32
i16
A signed integer within [-32,768..32,767], typically represented as a 16-bit two’s complement number.
int32
i32
A signed integer within [-2147483648..2,147,483,647], typically represented as a 32-bit two’s complement number.
int32
i64
A signed integer within [−9,223,372,036,854,775,808..9,223,372,036,854,775,807], typically represented as a 64-bit two’s complement number.
A unicode string of text, [0..2,147,483,647] UTF-8 bytes in length.
string
binary
A binary value, [0..2,147,483,647] bytes in length.
binary
timestamp
A naive timestamp with microsecond precision. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python.
int64 microseconds since 1970-01-01 00:00:00.000000 (in an unspecified timezone)
timestamp_tz
A timezone-aware timestamp with microsecond precision. Similar to aware datetime in Python.
int64 microseconds since 1970-01-01 00:00:00.000000 UTC
date
A date within [1000-01-01..9999-12-31].
int32 days since 1970-01-01
time
A time since the beginning of any day. Range of [0..86,399,999,999] microseconds; leap seconds need not be supported.
int64 microseconds past midnight
interval_year
Interval year to month. Supports a range of [-10,000..10,000] years with month precision (= [-120,000..120,000] months). Usually stored as separate integers for years and months, but only the total number of months is significant, i.e. 1y 0m is considered equal to 0y 12m or 1001y -12000m.
int32 years and int32 months, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. -10000y 200000m is not allowed)
interval_day
Interval day to second. Supports a range of [-3,650,000..3,650,000] days with microsecond precision (= [-315,360,000,000,000,000..315,360,000,000,000,000] microseconds). Usually stored as separate integers for various components, but only the total number of microseconds is significant, i.e. 1d 0s is considered equal to 0d 86400s.
int32 days, int32 seconds, and int32 microseconds, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. 3650001d -86400s 0us is not allowed)
uuid
A universally-unique identifier composed of 128 bits. Typically presented to users in the following hexadecimal format: c48ffa9e-64f4-44cb-ae47-152b4e60e77b. Any 128-bit value is allowed, without specific adherence to RFC4122.
Compound type classes are type classes that need to be configured by means of a parameter pack.
Type Name
Description
Protobuf representation for literals
FIXEDCHAR<L>
A fixed-length unicode string of L characters. L must be within [1..2,147,483,647].
L-character string
VARCHAR<L>
A unicode string of at most L characters.L must be within [1..2,147,483,647].
string with at most L characters
FIXEDBINARY<L>
A binary string of L bytes. When casting, values shorter than L are padded with zeros, and values longer than L are right-trimmed.
L-byte bytes
DECIMAL<P, S>
A fixed-precision decimal value having precision (P, number of digits) <= 38 and scale (S, number of fractional digits) 0 <= S <= P.
16-byte bytes representing a little-endian 128-bit integer, to be divided by 10^S to get the decimal value
STRUCT<T1,…,Tn>
A list of types in a defined order.
repeated Literal, types matching T1..Tn
NSTRUCT<N:T1,…,N:Tn>
Pseudo-type: A struct that maps unique names to value types. Each name is a UTF-8-encoded string. Each value can have a distinct type. Note that NSTRUCT is actually a pseudo-type, because Substrait’s core type system is based entirely on ordinal positions, not named fields. Nonetheless, when working with systems outside Substrait, names are important.
n/a
LIST<T>
A list of values of type T. The list can be between [0..2,147,483,647] values in length.
repeated Literal, all types matching T
MAP<K, V>
An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null.
repeated KeyValue (in turn two Literals), all key types matching K and all value types matching V
PRECISIONTIMESTAMP<P>
A timestamp with fractional second precision (P, number of digits) 0 <= P <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python.
int64 seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone)
PRECISIONTIMESTAMPTZ<P>
A timezone-aware timestamp, with fractional second precision (P, number of digits) 0 <= P <= 9. Similar to aware datetime in Python.
int64 seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 UTC
User-defined type classes are defined as part of simple extensions. An extension can declare an arbitrary number of user-defined extension types. Once a type has been declared, it can be used in function declarations.
For example, the following declares a type named point (namespaced to the associated YAML file) and two scalar functions that operate on it.
types:-name:"point"scalar_functions:
diff --git a/types/type_parsing/index.html b/types/type_parsing/index.html
index 7cc662a..08c65e6 100644
--- a/types/type_parsing/index.html
+++ b/types/type_parsing/index.html
@@ -1,4 +1,4 @@
- Type Syntax Parsing - Substrait: Cross-Language Serialization for Relational Algebra
In many places, it is useful to have a human-readable string representation of data types. Substrait has a custom syntax for type declaration. The basic structure of a type declaration is:
name?[variation]<param0,...,paramN>
+ Type Syntax Parsing - Substrait: Cross-Language Serialization for Relational Algebra
In many places, it is useful to have a human-readable string representation of data types. Substrait has a custom syntax for type declaration. The basic structure of a type declaration is:
name?[variation]<param0,...,paramN>
The components of this expression are:
Component
Description
Required
Name
Each type has a name. A type is expressed by providing a name. This name can be expressed in arbitrary case (e.g. varchar and vArChAr are equivalent) although lowercase is preferred.
Nullability indicator
A type is either non-nullable or nullable. To express nullability, a question mark is added after the type name (before any parameters).
Optional, defaults to non-nullable
Variation
When expressing a type, a user can define the type based on a type variation. Some systems use type variations to describe different underlying representations of the same data type. This is expressed as a bracketed integer such as [2].
Optional, defaults to [0]
Parameters
Compound types may have one or more configurable properties. The two main types of properties are integer and type properties. The parameters for each type correspond to a list of known properties associated with a type as declared in the order defined in the type specification. For compound types (types that contain types), the data type syntax will include nested type declarations. The one exception is structs, which are further outlined below.
It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an ANTLR grammar to ease consumption and production of types. (The grammar also supports an entire language for representing plans as text.)
Structs are unique from other types because they have an arbitrary number of parameters. The parameters are recursive and may include their own subproperties. Struct parsing is declared in the following two ways:
# Struct
struct?[variation]<type0, type1,..., typeN>
diff --git a/types/type_system/index.html b/types/type_system/index.html
index f30487c..44abd74 100644
--- a/types/type_system/index.html
+++ b/types/type_system/index.html
@@ -1,4 +1,4 @@
- Type System - Substrait: Cross-Language Serialization for Relational Algebra
Together with the parameter pack, describes the set of non-null values supported by the type. Subdivided into simple and compound type classes.
Nullability
Always
Either NULLABLE (? suffix) or REQUIRED (no suffix)
Describes whether values of this type can be null. Note that null is considered to be a special value of a nullable type, rather than the only value of a special null type.
No suffix or explicitly [0] (system-preferred), or an extension
Allows different variations of the same type class to exist in a system at a time, usually distinguished by in-memory format.
Parameters
Compound types only
<10, 2> (for DECIMAL), <i32, string> (for STRUCT)
Some combination of zero or more data types or integers. The expected set of parameters and the significance of each parameter depends on the type class.
Refer to Type Parsing for a description of the syntax used to describe types.
Note
Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via cast expressions.
Together with the parameter pack, describes the set of non-null values supported by the type. Subdivided into simple and compound type classes.
Nullability
Always
Either NULLABLE (? suffix) or REQUIRED (no suffix)
Describes whether values of this type can be null. Note that null is considered to be a special value of a nullable type, rather than the only value of a special null type.
No suffix or explicitly [0] (system-preferred), or an extension
Allows different variations of the same type class to exist in a system at a time, usually distinguished by in-memory format.
Parameters
Compound types only
<10, 2> (for DECIMAL), <i32, string> (for STRUCT)
Some combination of zero or more data types or integers. The expected set of parameters and the significance of each parameter depends on the type class.
Refer to Type Parsing for a description of the syntax used to describe types.
Note
Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via cast expressions.
Type variations may be used to represent differences in representation between different consumers. For example, an engine might support dictionary encoding for a string, or could be using either a row-wise or columnar representation of a struct. All variations of a type are expected to have the same semantics when operated on by functions or other expressions.
All variations except the “system-preferred” variation (a.k.a. [0], see Type Parsing) must be defined using simple extensions. The key properties of these variations are:
Property
Description
Base Type Class
The type class that this variation belongs to.
Name
The name used to reference this type. Should be unique within type variations for this parent type within a simple extension.
Description
A human description of the purpose of this type variation.
Function Behavior
INHERITS or SEPARATE: whether functions that support the system-preferred variation implicitly also support this variation, or whether functions should be resolved independently. For example, if one has the function add(i8,i8) defined and then defines an i8 variation, this determines whether the i8 variation can be bound to the base add operation (inherits) or whether a specialized version of add needs to be defined specifically for this variation (separate). Defaults to inherits.
Type variations may be used to represent differences in representation between different consumers. For example, an engine might support dictionary encoding for a string, or could be using either a row-wise or columnar representation of a struct. All variations of a type are expected to have the same semantics when operated on by functions or other expressions.
All variations except the “system-preferred” variation (a.k.a. [0], see Type Parsing) must be defined using simple extensions. The key properties of these variations are:
Property
Description
Base Type Class
The type class that this variation belongs to.
Name
The name used to reference this type. Should be unique within type variations for this parent type within a simple extension.
Description
A human description of the purpose of this type variation.
Function Behavior
INHERITS or SEPARATE: whether functions that support the system-preferred variation implicitly also support this variation, or whether functions should be resolved independently. For example, if one has the function add(i8,i8) defined and then defines an i8 variation, this determines whether the i8 variation can be bound to the base add operation (inherits) or whether a specialized version of add needs to be defined specifically for this variation (separate). Defaults to inherits.