diff --git a/404.html b/404.html index aaa161c..89488ea 100644 --- a/404.html +++ b/404.html @@ -1,4 +1,4 @@ - Substrait: Cross-Language Serialization for Relational Algebra
GitHub
GitHub

Substrait: Cross-Language Serialization for Relational Algebra

Project Vision

The Substrait project aims to create a well-defined, cross-language specification for data compute operations. The specification declares a set of common operations, defines their semantics, and describes their behavior unambiguously. The project also defines extension points and serialized representations of the specification.

In many ways, the goal of this project is similar to that of the Apache Arrow project. Arrow is focused on a standardized memory representation of columnar data. Substrait is focused on what should be done to data.

Why not use SQL?

SQL is a well known language for describing queries against relational data. It is designed to be simple and allow reading and writing by humans. Substrait is not intended as a replacement for SQL and works alongside SQL to provide capabilities that SQL lacks. SQL is not a great fit for systems that actually satisfy the query because it does not provide sufficient detail and is not represented in a format that is easy for processing. Because of this, most modern systems will first translate the SQL query into a query plan, sometimes called the execution plan. There can be multiple levels of a query plan (e.g. physical and logical), a query plan may be split up and distributed across multiple systems, and a query plan often undergoes simplifying or optimizing transformations. The SQL standard does not define the format of the query or execution plan and there is no open format that is supported by a broad set of systems. Substrait was created to provide a standard and open format for these query plans.

Why not just do this within an existing OSS project?

A key goal of the Substrait project is to not be coupled to any single existing technology. Trying to get people involved in something can be difficult when it seems to be primarily driven by the opinions and habits of a single community. In many ways, this situation is similar to the early situation with Arrow. The precursor to Arrow was the Apache Drill ValueVectors concepts. As part of creating Arrow, Wes and Jacques recognized the need to create a new community to build a fresh consensus (beyond just what the Apache Drill community wanted). This separation and new independent community was a key ingredient to Arrow’s current success. The needs here are much the same: many separate communities could benefit from Substrait, but each have their own pain points, type systems, development processes and timelines. To help resolve these tensions, one of the approaches proposed in Substrait is to set a bar that at least two of the top four OSS data technologies (Arrow, Spark, Iceberg, Trino) supports something before incorporating it directly into the Substrait specification. (Another goal is to support strong extension points at key locations to avoid this bar being a limiter to broad adoption.)

  • Apache Calcite: Many ideas in Substrait are inspired by the Calcite project. Calcite is a great JVM-based SQL query parsing and optimization framework. A key goal of the Substrait project is to expose Calcite capabilities more easily to non-JVM technologies as well as expose query planning operations as microservices.
  • Apache Arrow: The Arrow format for data is what the Substrait specification attempts to be for compute expressions. A key goal of Substrait is to enable Substrait producers to execute work within the Arrow Rust and C++ compute kernels.

Why the name Substrait?

A strait is a narrow connector of water between two other pieces of water. In analytics, data is often thought of as water. Substrait is focused on instructions related to the data. In other words, what defines or supports the movement of water between one or more larger systems. Thus, the underlayment for the strait connecting different pools of water => sub-strait.

GitHub

Substrait: Cross-Language Serialization for Relational Algebra

Project Vision

The Substrait project aims to create a well-defined, cross-language specification for data compute operations. The specification declares a set of common operations, defines their semantics, and describes their behavior unambiguously. The project also defines extension points and serialized representations of the specification.

In many ways, the goal of this project is similar to that of the Apache Arrow project. Arrow is focused on a standardized memory representation of columnar data. Substrait is focused on what should be done to data.

Why not use SQL?

SQL is a well known language for describing queries against relational data. It is designed to be simple and allow reading and writing by humans. Substrait is not intended as a replacement for SQL and works alongside SQL to provide capabilities that SQL lacks. SQL is not a great fit for systems that actually satisfy the query because it does not provide sufficient detail and is not represented in a format that is easy for processing. Because of this, most modern systems will first translate the SQL query into a query plan, sometimes called the execution plan. There can be multiple levels of a query plan (e.g. physical and logical), a query plan may be split up and distributed across multiple systems, and a query plan often undergoes simplifying or optimizing transformations. The SQL standard does not define the format of the query or execution plan and there is no open format that is supported by a broad set of systems. Substrait was created to provide a standard and open format for these query plans.

Why not just do this within an existing OSS project?

A key goal of the Substrait project is to not be coupled to any single existing technology. Trying to get people involved in something can be difficult when it seems to be primarily driven by the opinions and habits of a single community. In many ways, this situation is similar to the early situation with Arrow. The precursor to Arrow was the Apache Drill ValueVectors concepts. As part of creating Arrow, Wes and Jacques recognized the need to create a new community to build a fresh consensus (beyond just what the Apache Drill community wanted). This separation and new independent community was a key ingredient to Arrow’s current success. The needs here are much the same: many separate communities could benefit from Substrait, but each have their own pain points, type systems, development processes and timelines. To help resolve these tensions, one of the approaches proposed in Substrait is to set a bar that at least two of the top four OSS data technologies (Arrow, Spark, Iceberg, Trino) supports something before incorporating it directly into the Substrait specification. (Another goal is to support strong extension points at key locations to avoid this bar being a limiter to broad adoption.)

  • Apache Calcite: Many ideas in Substrait are inspired by the Calcite project. Calcite is a great JVM-based SQL query parsing and optimization framework. A key goal of the Substrait project is to expose Calcite capabilities more easily to non-JVM technologies as well as expose query planning operations as microservices.
  • Apache Arrow: The Arrow format for data is what the Substrait specification attempts to be for compute expressions. A key goal of Substrait is to enable Substrait producers to execute work within the Arrow Rust and C++ compute kernels.

Why the name Substrait?

A strait is a narrow connector of water between two other pieces of water. In analytics, data is often thought of as water. Substrait is focused on instructions related to the data. In other words, what defines or supports the movement of water between one or more larger systems. Thus, the underlayment for the strait connecting different pools of water => sub-strait.

GitHub

Community

Substrait is developed as a consensus-driven open source product under the Apache 2.0 license. Development is done in the open leveraging GitHub issues and PRs.

Get In Touch

Mailing List/Google Group
We use the mailing list to discuss questions, formulate plans and collaborate asynchronously.
Slack Channel
The developers of Substrait frequent the Slack channel. You can get an invite to the channel by following this link.
GitHub Issues
Substrait is developed via GitHub issues and pull requests. If you see a problem or want to enhance the product, we suggest you file a GitHub issue for developers to review.
Twitter
The @substrait_io account on Twitter is our official account. Follow-up to keep to date on what is happening with Substrait!
Docs
Our website is all maintained in our source repository. If there is something you think can be improved, feel free to fork our repository and post a pull request.
Meetings
Our community meets every other week on Wednesday.

Talks

Want to learn more about Substrait? Try the following presentations and slide decks.

  • Substrait: A Common Representation for Data Compute Plans (Jacques Nadeau, April 2022) [slides]

Citation

If you use Substrait in your research, please cite it using the following BibTeX entry:

@misc{substrait,
+ Community - Substrait: Cross-Language Serialization for Relational Algebra      

Community

Substrait is developed as a consensus-driven open source product under the Apache 2.0 license. Development is done in the open leveraging GitHub issues and PRs.

Get In Touch

Mailing List/Google Group
We use the mailing list to discuss questions, formulate plans and collaborate asynchronously.
Slack Channel
The developers of Substrait frequent the Slack channel. You can get an invite to the channel by following this link.
GitHub Issues
Substrait is developed via GitHub issues and pull requests. If you see a problem or want to enhance the product, we suggest you file a GitHub issue for developers to review.
Twitter
The @substrait_io account on Twitter is our official account. Follow-up to keep to date on what is happening with Substrait!
Docs
Our website is all maintained in our source repository. If there is something you think can be improved, feel free to fork our repository and post a pull request.
Meetings
Our community meets every other week on Wednesday.

Talks

Want to learn more about Substrait? Try the following presentations and slide decks.

  • Substrait: A Common Representation for Data Compute Plans (Jacques Nadeau, April 2022) [slides]

Citation

If you use Substrait in your research, please cite it using the following BibTeX entry:

@misc{substrait,
   author = {substrait-io},
   title = {Substrait: Cross-Language Serialization for Relational Algebra},
   year = {2021},
diff --git a/community/powered_by/index.html b/community/powered_by/index.html
index 6c837b2..30daab8 100644
--- a/community/powered_by/index.html
+++ b/community/powered_by/index.html
@@ -1,4 +1,4 @@
- Powered by Substrait - Substrait: Cross-Language Serialization for Relational Algebra      

Powered by Substrait

In addition to the work maintained in repositories within the substrait-io GitHub organization, a growing list of other open source projects have adopted Substrait.

Acero
Acero is a query execution engine implemented as a part of the Apache Arrow C++ library. Acero provides a Substrait consumer interface.
ADBC
ADBC (Arrow Database Connectivity) is an API specification for Apache Arrow-based database access. ADBC allows applications to pass queries either as SQL strings or Substrait plans.
Arrow Flight SQL
Arrow Flight SQL is a client-server protocol for interacting with databases and query engines using the Apache Arrow in-memory columnar format and the Arrow Flight RPC framework. Arrow Flight SQL allows clients to send queries as SQL strings or Substrait plans.
DataFusion
DataFusion is an extensible query planning, optimization, and execution framework, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion provides a Substrait producer and consumer that can convert DataFusion logical plans to and from Substrait plans. It can be used through the DataFusion Python bindings.
DuckDB
DuckDB is an in-process SQL OLAP database management system. DuckDB provides a Substrait extension that allows users to produce and consume Substrait plans through DuckDB’s SQL, Python, and R APIs.
Gluten
Gluten is a plugin for Apache Spark that allows computation to be offloaded to engines that have better performance or efficiency than Spark’s built-in JVM-based engine. Gluten converts Spark physical plans to Substrait plans.
Ibis
Ibis is a Python library that provides a lightweight, universal interface for data wrangling. It includes a dataframe API for Python with support for more than 10 query execution engines, plus a Substrait producer to enable support for Substrait-consuming execution engines.
Substrait R Interface
The Substrait R interface package allows users to construct Substrait plans from R for evaluation by Substrait-consuming execution engines. The package provides a dplyr backend as well as lower-level interfaces for creating Substrait plans and integrations with Acero and DuckDB.
Velox
Velox is a unified execution engine aimed at accelerating data management systems and streamlining their development. Velox provides a Substrait consumer interface.

To add your project to this list, please open a pull request.

GitHub

Powered by Substrait

In addition to the work maintained in repositories within the substrait-io GitHub organization, a growing list of other open source projects have adopted Substrait.

Acero
Acero is a query execution engine implemented as a part of the Apache Arrow C++ library. Acero provides a Substrait consumer interface.
ADBC
ADBC (Arrow Database Connectivity) is an API specification for Apache Arrow-based database access. ADBC allows applications to pass queries either as SQL strings or Substrait plans.
Arrow Flight SQL
Arrow Flight SQL is a client-server protocol for interacting with databases and query engines using the Apache Arrow in-memory columnar format and the Arrow Flight RPC framework. Arrow Flight SQL allows clients to send queries as SQL strings or Substrait plans.
DataFusion
DataFusion is an extensible query planning, optimization, and execution framework, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion provides a Substrait producer and consumer that can convert DataFusion logical plans to and from Substrait plans. It can be used through the DataFusion Python bindings.
DuckDB
DuckDB is an in-process SQL OLAP database management system. DuckDB provides a Substrait extension that allows users to produce and consume Substrait plans through DuckDB’s SQL, Python, and R APIs.
Gluten
Gluten is a plugin for Apache Spark that allows computation to be offloaded to engines that have better performance or efficiency than Spark’s built-in JVM-based engine. Gluten converts Spark physical plans to Substrait plans.
Ibis
Ibis is a Python library that provides a lightweight, universal interface for data wrangling. It includes a dataframe API for Python with support for more than 10 query execution engines, plus a Substrait producer to enable support for Substrait-consuming execution engines.
Substrait R Interface
The Substrait R interface package allows users to construct Substrait plans from R for evaluation by Substrait-consuming execution engines. The package provides a dplyr backend as well as lower-level interfaces for creating Substrait plans and integrations with Acero and DuckDB.
Velox
Velox is a unified execution engine aimed at accelerating data management systems and streamlining their development. Velox provides a Substrait consumer interface.

To add your project to this list, please open a pull request.

GitHub

Aggregate Functions

Aggregate functions are functions that define an operation which consumes values from multiple records to a produce a single output. Aggregate functions in SQL are typically used in GROUP BY functions. Aggregate functions are similar to scalar functions and function signatures with a small set of different properties.

Aggregate function signatures contain all the properties defined for scalar functions. Additionally, they contain the properties below:

Property Description Required
Inherits All properties defined for scalar function. N/A
Ordered Whether the result of this function is sensitive to sort order. Optional, defaults to false
Maximum set size Maximum allowed set size as an unsigned integer. Optional, defaults to unlimited
Decomposable Whether the function can be executed in one or more intermediate steps. Valid options are: NONE, ONE, MANY, describing how intermediate steps can be taken. Optional, defaults to NONE
Intermediate Output Type If the function is decomposable, represents the intermediate output type that is used, if the function is defined as either ONE or MANY decomposable. Will be a struct in many cases. Required for ONE and MANY.
Invocation Whether the function uses all or only distinct values in the aggregation calculation. Valid options are: ALL, DISTINCT. Optional, defaults to ALL

Aggregate Binding

When binding an aggregate function, the binding must include the following additional properties beyond the standard scalar binding properties:

Property Description
Phase Describes the input type of the data: [INITIAL_TO_INTERMEDIATE, INTERMEDIATE_TO_INTERMEDIATE, INITIAL_TO_RESULT, INTERMEDIATE_TO_RESULT] describing what portion of the operation is required. For functions that are NOT decomposable, the only valid option will be INITIAL_TO_RESULT.
Ordering Zero or more ordering keys along with key order (ASC|DESC|NULL FIRST, etc.), declared similar to the sort keys in an ORDER BY relational operation. If no sorts are specified, the records are not sorted prior to being passed to the aggregate function.
GitHub

Aggregate Functions

Aggregate functions are functions that define an operation which consumes values from multiple records to a produce a single output. Aggregate functions in SQL are typically used in GROUP BY functions. Aggregate functions are similar to scalar functions and function signatures with a small set of different properties.

Aggregate function signatures contain all the properties defined for scalar functions. Additionally, they contain the properties below:

Property Description Required
Inherits All properties defined for scalar function. N/A
Ordered Whether the result of this function is sensitive to sort order. Optional, defaults to false
Maximum set size Maximum allowed set size as an unsigned integer. Optional, defaults to unlimited
Decomposable Whether the function can be executed in one or more intermediate steps. Valid options are: NONE, ONE, MANY, describing how intermediate steps can be taken. Optional, defaults to NONE
Intermediate Output Type If the function is decomposable, represents the intermediate output type that is used, if the function is defined as either ONE or MANY decomposable. Will be a struct in many cases. Required for ONE and MANY.
Invocation Whether the function uses all or only distinct values in the aggregation calculation. Valid options are: ALL, DISTINCT. Optional, defaults to ALL

Aggregate Binding

When binding an aggregate function, the binding must include the following additional properties beyond the standard scalar binding properties:

Property Description
Phase Describes the input type of the data: [INITIAL_TO_INTERMEDIATE, INTERMEDIATE_TO_INTERMEDIATE, INITIAL_TO_RESULT, INTERMEDIATE_TO_RESULT] describing what portion of the operation is required. For functions that are NOT decomposable, the only valid option will be INITIAL_TO_RESULT.
Ordering Zero or more ordering keys along with key order (ASC|DESC|NULL FIRST, etc.), declared similar to the sort keys in an ORDER BY relational operation. If no sorts are specified, the records are not sorted prior to being passed to the aggregate function.
GitHub

Embedded Functions

Embedded functions are a special kind of function where the implementation is embedded within the actual plan. They are commonly used in tools where a user intersperses business logic within a data pipeline. This is more common in data science workflows than traditional SQL workflows.

Embedded functions are not pre-registered. Embedded functions require that data be consumed and produced with a standard API, may require memory allocation and have determinate error reporting behavior. They may also have specific runtime dependencies. For example, a Python pickle function may depend on pyarrow 5.0 and pynessie 1.0.

Properties for an embedded function include:

Property Description Required
Function Type The type of embedded function presented. Required
Function Properties Function properties, one of those items defined below. Required
Output Type The fully resolved output type for this embedded function. Required

The binary representation of an embedded function is:

message EmbeddedFunction {
+ Embedded Functions - Substrait: Cross-Language Serialization for Relational Algebra      

Embedded Functions

Embedded functions are a special kind of function where the implementation is embedded within the actual plan. They are commonly used in tools where a user intersperses business logic within a data pipeline. This is more common in data science workflows than traditional SQL workflows.

Embedded functions are not pre-registered. Embedded functions require that data be consumed and produced with a standard API, may require memory allocation and have determinate error reporting behavior. They may also have specific runtime dependencies. For example, a Python pickle function may depend on pyarrow 5.0 and pynessie 1.0.

Properties for an embedded function include:

Property Description Required
Function Type The type of embedded function presented. Required
Function Properties Function properties, one of those items defined below. Required
Output Type The fully resolved output type for this embedded function. Required

The binary representation of an embedded function is:

message EmbeddedFunction {
   repeated Expression arguments = 1;
   Type output_type = 2;
   oneof kind {
diff --git a/expressions/extended_expression/index.html b/expressions/extended_expression/index.html
index 4d3bd3f..1e4d46b 100644
--- a/expressions/extended_expression/index.html
+++ b/expressions/extended_expression/index.html
@@ -1,4 +1,4 @@
- Extended Expression - Substrait: Cross-Language Serialization for Relational Algebra      

Extended Expression

Extended Expression messages are provided for expression-level protocols as an alternative to using a Plan. They mainly target expression-only evaluations, such as those computed in Filter/Project/Aggregation rels. Unlike the original Expression defined in the substrait protocol, Extended Expression messages require more information to completely describe the computation context including: input data schema, referred function signatures, and output schema.

Since Extended Expression will be used seperately from the Plan rel representation, it will need to include basic fields like Version.

message ExtendedExpression {
+ Extended Expression - Substrait: Cross-Language Serialization for Relational Algebra      

Extended Expression

Extended Expression messages are provided for expression-level protocols as an alternative to using a Plan. They mainly target expression-only evaluations, such as those computed in Filter/Project/Aggregation rels. Unlike the original Expression defined in the substrait protocol, Extended Expression messages require more information to completely describe the computation context including: input data schema, referred function signatures, and output schema.

Since Extended Expression will be used seperately from the Plan rel representation, it will need to include basic fields like Version.

message ExtendedExpression {
   // Substrait version of the expression. Optional up to 0.17.0, required for later
   // versions.
   Version version = 7;
diff --git a/expressions/field_references/index.html b/expressions/field_references/index.html
index 73f26ba..af906a7 100644
--- a/expressions/field_references/index.html
+++ b/expressions/field_references/index.html
@@ -1,4 +1,4 @@
- Field References - Substrait: Cross-Language Serialization for Relational Algebra      

Field References

In Substrait, all fields are dealt with on a positional basis. Field names are only used at the edge of a plan, for the purposes of naming fields for the outside world. Each operation returns a simple or compound data type. Additional operations can refer to data within that initial operation using field references. To reference a field, you use a reference based on the type of field position you want to reference.

Reference Type Properties Type Applicability Type return
Struct Field Ordinal position. Zero-based. Only legal within the range of possible fields within a struct. Selecting an ordinal outside the applicable field range results in an invalid plan. struct Type of field referenced
Array Value Array offset. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Negative and positive overflows return null values (no wrapping). list type of list
Array Slice Array offset and element count. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Position does not wrap, nor does length. list Same type as original list
Map Key A map value that is matched exactly against available map keys and returned. map Value type of map
Map KeyExpression A wildcard string that is matched against a simplified form of regular expressions. Requires the key type of the map to be a character type. [Format detail needed, intention to include basic regex concepts such as greedy/non-greedy.] map List of map value type
Masked Complex Expression An expression that provides a mask over a schema declaring which portions of the schema should be presented. This allows a user to select a portion of a complex object but mask certain subsections of that same object. any any

Compound References

References are typically constructed as a sequence. For example: [struct position 0, struct position 1, array offset 2, array slice 1..3].

Field references are in the same order they are defined in their schema. For example, let’s consider the following schema:

column a:
+ Field References - Substrait: Cross-Language Serialization for Relational Algebra      

Field References

In Substrait, all fields are dealt with on a positional basis. Field names are only used at the edge of a plan, for the purposes of naming fields for the outside world. Each operation returns a simple or compound data type. Additional operations can refer to data within that initial operation using field references. To reference a field, you use a reference based on the type of field position you want to reference.

Reference Type Properties Type Applicability Type return
Struct Field Ordinal position. Zero-based. Only legal within the range of possible fields within a struct. Selecting an ordinal outside the applicable field range results in an invalid plan. struct Type of field referenced
Array Value Array offset. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Negative and positive overflows return null values (no wrapping). list type of list
Array Slice Array offset and element count. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Position does not wrap, nor does length. list Same type as original list
Map Key A map value that is matched exactly against available map keys and returned. map Value type of map
Map KeyExpression A wildcard string that is matched against a simplified form of regular expressions. Requires the key type of the map to be a character type. [Format detail needed, intention to include basic regex concepts such as greedy/non-greedy.] map List of map value type
Masked Complex Expression An expression that provides a mask over a schema declaring which portions of the schema should be presented. This allows a user to select a portion of a complex object but mask certain subsections of that same object. any any

Compound References

References are typically constructed as a sequence. For example: [struct position 0, struct position 1, array offset 2, array slice 1..3].

Field references are in the same order they are defined in their schema. For example, let’s consider the following schema:

column a:
   struct<
     b: list<
       struct<
diff --git a/expressions/scalar_functions/index.html b/expressions/scalar_functions/index.html
index 113f054..c7d8574 100644
--- a/expressions/scalar_functions/index.html
+++ b/expressions/scalar_functions/index.html
@@ -1,4 +1,4 @@
- Scalar Functions - Substrait: Cross-Language Serialization for Relational Algebra      

Scalar Functions

A function is a scalar function if that function takes in values from a single record and produces an output value. To clearly specify the definition of functions, Substrait declares an extensible specification plus binding approach to function resolution. A scalar function signature includes the following properties:

Property Description Required
Name One or more user-friendly UTF-8 strings that are used to reference this function. At least one value is required.
List of arguments Argument properties are defined below. Arguments can be fully defined or calculated with a type expression. See further details below. Optional, defaults to niladic.
Deterministic Whether this function is expected to reproduce the same output when it is invoked multiple times with the same input. This informs a plan consumer on whether it can constant-reduce the defined function. An example would be a random() function, which is typically expected to be evaluated repeatedly despite having the same set of inputs. Optional, defaults to true.
Session Dependent Whether this function is influenced by the session context it is invoked within. For example, a function may be influenced by a user who is invoking the function, the time zone of a session, or some other non-obvious parameter. This can inform caching systems on whether a particular function is cacheable. Optional, defaults to false.
Variadic Behavior Whether the last argument of the function is variadic or a single argument. If variadic, the argument can optionally have a lower bound (minimum number of instances) and an upper bound (maximum number of instances). Optional, defaults to single value.
Nullability Handling Describes how nullability of input arguments maps to nullability of output arguments. Three options are: MIRROR, DECLARED_OUTPUT and DISCRETE. More details about nullability handling are listed below. Optional, defaults to MIRROR
Description Additional description of function for implementers or users. Should be written human-readable to allow exposure to end users. Presented as a map with language => description mappings. E.g. { "en": "This adds two numbers together.", "fr": "cela ajoute deux nombres"}. Optional
Return Value The output type of the expression. Return types can be expressed as a fully-defined type or a type expression. See below for more on type expressions. Required
Implementation Map A map of implementation locations for one or more implementations of the given function. Each key is a function implementation type. Implementation types include examples such as: AthenaArrowLambda, TrinoV361Jar, ArrowCppKernelEnum, GandivaEnum, LinkedIn Transport Jar, etc. [Definition TBD]. Implementation type has one or more properties associated with retrieval of that implementation. Optional

Argument Types

There are three main types of arguments: value arguments, type arguments, and enumerations. Every defined arguments must be specified in every invocation of the function. When specified, the position of these arguments in the function invocation must match the position of the arguments as defined in the YAML function definition.

  • Value arguments: arguments that refer to a data value. These could be constants (literal expressions defined in the plan) or variables (a reference expression that references data being processed by the plan). This is the most common type of argument. The value of a value argument is not available in output derivation, but its type is. Value arguments can be declared in one of two ways: concrete or parameterized. Concrete types are either simple types or compound types with all parameters fully defined (without referencing any type arguments). Examples include i32, fp32, VARCHAR<20>, List<fp32>, etc. Parameterized types are discussed further below.
  • Type arguments: arguments that are used only to inform the evaluation and/or type derivation of the function. For example, you might have a function which is truncate(<type> DECIMAL<P0,S0>, <value> DECIMAL<P1, S1>, <value> i32). This function declares two value arguments and a type argument. The difference between them is that the type argument has no value at runtime, while the value arguments do.
  • Enumeration: arguments that support a fixed set of declared values as constant arguments. These arguments must be specified as part of an expression. While these could also have been implemented as constant string value arguments, they are formally included to improve validation/contextual help/etc. for frontend processors and IDEs. An example might be extract([DAY|YEAR|MONTH], <date value>). In this example, a producer must specify a type of date part to extract. Note, the value of a required enumeration cannot be used in type derivation.

Value Argument Properties

Property Description Required
Name A human-readable name for this argument to help clarify use. Optional, defaults to a name based on position (e.g. arg0)
Description Additional description of this argument. Optional
Value A fully defined type or a type expression. Required
Constant Whether this argument is required to be a constant for invocation. For example, in some system a regular expression pattern would only be accepted as a literal and not a column value reference. Optional, defaults to false

Type Argument Properties

Property Description Required
Type A partially or completely parameterized type. E.g. List<K> or K Required
Name A human-readable name for this argument to help clarify use. Optional, defaults to a name based on position (e.g. arg0)
Description Additional description of this argument. Optional

Required Enumeration Properties

Property Description Required
Options List of valid string options for this argument Required
Name A human-readable name for this argument to help clarify use. Optional, defaults to a name based on position (e.g. arg0)
Description Additional description of this argument. Optional

Options

In addition to arguments each call may specify zero or more options. These are similar to a required enumeration but more focused on supporting alternative behaviors. Options can be left unspecified and the consumer is free to choose which implementation to use. An example use case might be OVERFLOW_BEHAVIOR:[OVERFLOW, SATURATE, ERROR] If unspecified, an engine is free to use any of the three choices or even some alternative behavior (e.g. setting the value to null on overflow). If specified, the engine would be expected to behave as specified or fail. Note, the value of an optional enumeration cannot be used in type derivation.

Option Preference

A producer may specify multiple values for an option. If the producer does so then the consumer must deliver the first behavior in the list of values that the consumer is capable of delivering. For example, considering overflow as defined above, if a producer specified [ERROR, SATURATE] then the consumer must deliver ERROR if it is capable of doing so. If it is not then it may deliver SATURATE. If the consumer cannot deliver either behavior then it is an error and the consumer must reject the plan.

Optional Properties

Property Description Required
Values A list of valid strings for this option. Required
Name A human-readable name for this option. Required

Nullability Handling

Mode Description
MIRROR This means that the function has the behavior that if at least one of the input arguments are nullable, the return type is also nullable. If all arguments are non-nullable, the return type will be non-nullable. An example might be the + function.
DECLARED_OUTPUT Input arguments are accepted of any mix of nullability. The nullability of the output function is whatever the return type expression states. Example use might be the function is_null() where the output is always boolean independent of the nullability of the input.
DISCRETE The input and arguments all define concrete nullability and can only be bound to the types that have those nullability. For example, if a type input is declared i64? and one has an i64 literal, the i64 literal must be specifically cast to i64? to allow the operation to bind.

Parameterized Types

Types are parameterized by two types of values: by inner types (e.g. List<K>) and numeric values (e.g. DECIMAL<P,S>). Parameter names are simple strings (frequently a single character). There are two types of parameters: integer parameters and type parameters.

When the same parameter name is used multiple times in a function definition, the function can only bind if the exact same value is used for all parameters of that name. For example, if one had a function with a signature of fn(VARCHAR<N>, VARCHAR<N>), the function would be only be usable if both VARCHAR types had the same length value N. This necessitates that all instances of the same parameter name must be of the same parameter type (all instances are a type parameter or all instances are an integer parameter).

Type Parameter Resolution in Variadic Functions

When the last argument of a function is variadic and declares a type parameter e.g. fn(A, B, C...), the C parameter can be marked as either consistent or inconsistent. If marked as consistent, the function can only be bound to arguments where all the C types are the same concrete type. If marked as inconsistent, each unique C can be bound to a different type within the constraints of what T allows.

Output Type Derivation

Concrete Return Types

A concrete return type is one that is fully known at function definition time. Examples of simple concrete return types would be things such as i32, fp32. For compound types, a concrete return type must be fully declared. Example of fully defined compound types: VARCHAR<20>, DECIMAL<25,5>

Return Type Expressions

Any function can declare a return type expression. A return type expression uses a simplified set of expressions to describe how the return type should be returned. For example, a return expression could be as simple as the return of parameter declared in the arguments. For example f(List<K>) => K or can be a simple mathematical or conditional expression such as add(decimal<a,b>, decimal<c,d>) => decimal<a+c, b+d>. For the simple expression language, there is a very narrow set of types:

  • Integer: 64-bit signed integer (can be a literal or a parameter value)
  • Boolean: True and False
  • Type: A Substrait type (with possibly additional embedded expressions)

These types are evaluated using a small set of operations to support common scenarios. List of valid operations:

Math: +, -, *, /, min, max
+ Scalar Functions - Substrait: Cross-Language Serialization for Relational Algebra      

Scalar Functions

A function is a scalar function if that function takes in values from a single record and produces an output value. To clearly specify the definition of functions, Substrait declares an extensible specification plus binding approach to function resolution. A scalar function signature includes the following properties:

Property Description Required
Name One or more user-friendly UTF-8 strings that are used to reference this function. At least one value is required.
List of arguments Argument properties are defined below. Arguments can be fully defined or calculated with a type expression. See further details below. Optional, defaults to niladic.
Deterministic Whether this function is expected to reproduce the same output when it is invoked multiple times with the same input. This informs a plan consumer on whether it can constant-reduce the defined function. An example would be a random() function, which is typically expected to be evaluated repeatedly despite having the same set of inputs. Optional, defaults to true.
Session Dependent Whether this function is influenced by the session context it is invoked within. For example, a function may be influenced by a user who is invoking the function, the time zone of a session, or some other non-obvious parameter. This can inform caching systems on whether a particular function is cacheable. Optional, defaults to false.
Variadic Behavior Whether the last argument of the function is variadic or a single argument. If variadic, the argument can optionally have a lower bound (minimum number of instances) and an upper bound (maximum number of instances). Optional, defaults to single value.
Nullability Handling Describes how nullability of input arguments maps to nullability of output arguments. Three options are: MIRROR, DECLARED_OUTPUT and DISCRETE. More details about nullability handling are listed below. Optional, defaults to MIRROR
Description Additional description of function for implementers or users. Should be written human-readable to allow exposure to end users. Presented as a map with language => description mappings. E.g. { "en": "This adds two numbers together.", "fr": "cela ajoute deux nombres"}. Optional
Return Value The output type of the expression. Return types can be expressed as a fully-defined type or a type expression. See below for more on type expressions. Required
Implementation Map A map of implementation locations for one or more implementations of the given function. Each key is a function implementation type. Implementation types include examples such as: AthenaArrowLambda, TrinoV361Jar, ArrowCppKernelEnum, GandivaEnum, LinkedIn Transport Jar, etc. [Definition TBD]. Implementation type has one or more properties associated with retrieval of that implementation. Optional

Argument Types

There are three main types of arguments: value arguments, type arguments, and enumerations. Every defined arguments must be specified in every invocation of the function. When specified, the position of these arguments in the function invocation must match the position of the arguments as defined in the YAML function definition.

  • Value arguments: arguments that refer to a data value. These could be constants (literal expressions defined in the plan) or variables (a reference expression that references data being processed by the plan). This is the most common type of argument. The value of a value argument is not available in output derivation, but its type is. Value arguments can be declared in one of two ways: concrete or parameterized. Concrete types are either simple types or compound types with all parameters fully defined (without referencing any type arguments). Examples include i32, fp32, VARCHAR<20>, List<fp32>, etc. Parameterized types are discussed further below.
  • Type arguments: arguments that are used only to inform the evaluation and/or type derivation of the function. For example, you might have a function which is truncate(<type> DECIMAL<P0,S0>, <value> DECIMAL<P1, S1>, <value> i32). This function declares two value arguments and a type argument. The difference between them is that the type argument has no value at runtime, while the value arguments do.
  • Enumeration: arguments that support a fixed set of declared values as constant arguments. These arguments must be specified as part of an expression. While these could also have been implemented as constant string value arguments, they are formally included to improve validation/contextual help/etc. for frontend processors and IDEs. An example might be extract([DAY|YEAR|MONTH], <date value>). In this example, a producer must specify a type of date part to extract. Note, the value of a required enumeration cannot be used in type derivation.

Value Argument Properties

Property Description Required
Name A human-readable name for this argument to help clarify use. Optional, defaults to a name based on position (e.g. arg0)
Description Additional description of this argument. Optional
Value A fully defined type or a type expression. Required
Constant Whether this argument is required to be a constant for invocation. For example, in some system a regular expression pattern would only be accepted as a literal and not a column value reference. Optional, defaults to false

Type Argument Properties

Property Description Required
Type A partially or completely parameterized type. E.g. List<K> or K Required
Name A human-readable name for this argument to help clarify use. Optional, defaults to a name based on position (e.g. arg0)
Description Additional description of this argument. Optional

Required Enumeration Properties

Property Description Required
Options List of valid string options for this argument Required
Name A human-readable name for this argument to help clarify use. Optional, defaults to a name based on position (e.g. arg0)
Description Additional description of this argument. Optional

Options

In addition to arguments each call may specify zero or more options. These are similar to a required enumeration but more focused on supporting alternative behaviors. Options can be left unspecified and the consumer is free to choose which implementation to use. An example use case might be OVERFLOW_BEHAVIOR:[OVERFLOW, SATURATE, ERROR] If unspecified, an engine is free to use any of the three choices or even some alternative behavior (e.g. setting the value to null on overflow). If specified, the engine would be expected to behave as specified or fail. Note, the value of an optional enumeration cannot be used in type derivation.

Option Preference

A producer may specify multiple values for an option. If the producer does so then the consumer must deliver the first behavior in the list of values that the consumer is capable of delivering. For example, considering overflow as defined above, if a producer specified [ERROR, SATURATE] then the consumer must deliver ERROR if it is capable of doing so. If it is not then it may deliver SATURATE. If the consumer cannot deliver either behavior then it is an error and the consumer must reject the plan.

Optional Properties

Property Description Required
Values A list of valid strings for this option. Required
Name A human-readable name for this option. Required

Nullability Handling

Mode Description
MIRROR This means that the function has the behavior that if at least one of the input arguments are nullable, the return type is also nullable. If all arguments are non-nullable, the return type will be non-nullable. An example might be the + function.
DECLARED_OUTPUT Input arguments are accepted of any mix of nullability. The nullability of the output function is whatever the return type expression states. Example use might be the function is_null() where the output is always boolean independent of the nullability of the input.
DISCRETE The input and arguments all define concrete nullability and can only be bound to the types that have those nullability. For example, if a type input is declared i64? and one has an i64 literal, the i64 literal must be specifically cast to i64? to allow the operation to bind.

Parameterized Types

Types are parameterized by two types of values: by inner types (e.g. List<K>) and numeric values (e.g. DECIMAL<P,S>). Parameter names are simple strings (frequently a single character). There are two types of parameters: integer parameters and type parameters.

When the same parameter name is used multiple times in a function definition, the function can only bind if the exact same value is used for all parameters of that name. For example, if one had a function with a signature of fn(VARCHAR<N>, VARCHAR<N>), the function would be only be usable if both VARCHAR types had the same length value N. This necessitates that all instances of the same parameter name must be of the same parameter type (all instances are a type parameter or all instances are an integer parameter).

Type Parameter Resolution in Variadic Functions

When the last argument of a function is variadic and declares a type parameter e.g. fn(A, B, C...), the C parameter can be marked as either consistent or inconsistent. If marked as consistent, the function can only be bound to arguments where all the C types are the same concrete type. If marked as inconsistent, each unique C can be bound to a different type within the constraints of what T allows.

Output Type Derivation

Concrete Return Types

A concrete return type is one that is fully known at function definition time. Examples of simple concrete return types would be things such as i32, fp32. For compound types, a concrete return type must be fully declared. Example of fully defined compound types: VARCHAR<20>, DECIMAL<25,5>

Return Type Expressions

Any function can declare a return type expression. A return type expression uses a simplified set of expressions to describe how the return type should be returned. For example, a return expression could be as simple as the return of parameter declared in the arguments. For example f(List<K>) => K or can be a simple mathematical or conditional expression such as add(decimal<a,b>, decimal<c,d>) => decimal<a+c, b+d>. For the simple expression language, there is a very narrow set of types:

  • Integer: 64-bit signed integer (can be a literal or a parameter value)
  • Boolean: True and False
  • Type: A Substrait type (with possibly additional embedded expressions)

These types are evaluated using a small set of operations to support common scenarios. List of valid operations:

Math: +, -, *, /, min, max
 Boolean: &&, ||, !, <, >, ==
 Parameters: type, integer
 Literals: type, integer
diff --git a/expressions/specialized_record_expressions/index.html b/expressions/specialized_record_expressions/index.html
index 80ed758..d801922 100644
--- a/expressions/specialized_record_expressions/index.html
+++ b/expressions/specialized_record_expressions/index.html
@@ -1,4 +1,4 @@
- Specialized Record Expressions - Substrait: Cross-Language Serialization for Relational Algebra      

Specialized Record Expressions

While all types of operations could be reduced to functions, in some cases this would be overly simplistic. Instead, it is helpful to construct some other expression constructs.

These constructs should be focused on different expression types as opposed to something that directly related to syntactic sugar. For example, CAST and EXTRACT or SQL operations that are presented using specialized syntax. However, they can easily be modeled using a function paradigm with minimal complexity.

Literal Expressions

For each data type, it is possible to create a literal value for that data type. The representation depends on the serialization format. Literal expressions include both a type literal and a possibly null value.

Nested Type Constructor Expressions

These expressions allow structs, lists, and maps to be constructed from a set of expressions. For example, they allow a struct expression like (field 0 - field 1, field 0 + field 1) to be represented.

Cast Expression

To convert a value from one type to another, Substrait defines a cast expression. Cast expressions declare an expected type, an input argument and an enumeration specifying failure behavior, indicating whether cast should return null on failure or throw an exception.

Note that Substrait always requires a cast expression whenever the current type is not exactly equal to (one of) the expected types. For example, it is illegal to directly pass a value of type i8[0] to a function that only supports an i8?[0] argument.

If Expression

An if value expression is an expression composed of one if clause, zero or more else if clauses and an else clause. In pseudocode, they are envisioned as:

if <boolean expression> then <result expression 1>
+ Specialized Record Expressions - Substrait: Cross-Language Serialization for Relational Algebra      

Specialized Record Expressions

While all types of operations could be reduced to functions, in some cases this would be overly simplistic. Instead, it is helpful to construct some other expression constructs.

These constructs should be focused on different expression types as opposed to something that directly related to syntactic sugar. For example, CAST and EXTRACT or SQL operations that are presented using specialized syntax. However, they can easily be modeled using a function paradigm with minimal complexity.

Literal Expressions

For each data type, it is possible to create a literal value for that data type. The representation depends on the serialization format. Literal expressions include both a type literal and a possibly null value.

Nested Type Constructor Expressions

These expressions allow structs, lists, and maps to be constructed from a set of expressions. For example, they allow a struct expression like (field 0 - field 1, field 0 + field 1) to be represented.

Cast Expression

To convert a value from one type to another, Substrait defines a cast expression. Cast expressions declare an expected type, an input argument and an enumeration specifying failure behavior, indicating whether cast should return null on failure or throw an exception.

Note that Substrait always requires a cast expression whenever the current type is not exactly equal to (one of) the expected types. For example, it is illegal to directly pass a value of type i8[0] to a function that only supports an i8?[0] argument.

If Expression

An if value expression is an expression composed of one if clause, zero or more else if clauses and an else clause. In pseudocode, they are envisioned as:

if <boolean expression> then <result expression 1>
 else if <boolean expression> then <result expression 2> (zero or more times)
 else <result expression 3>
 

When an if expression is declared, all return expressions must be the same identical type.

Shortcut Behavior

An if expression is expected to logically short-circuit on a positive outcome. This means that a skipped else/elseif expression cannot cause an error. For example, this should not actually throw an error despite the fact that the cast operation should fail.

if 'value' = 'value' then 0
diff --git a/expressions/subqueries/index.html b/expressions/subqueries/index.html
index 741bb65..5d03501 100644
--- a/expressions/subqueries/index.html
+++ b/expressions/subqueries/index.html
@@ -1,4 +1,4 @@
- Subqueries - Substrait: Cross-Language Serialization for Relational Algebra      

Subqueries

Subqueries are scalar expressions comprised of another query.

Forms

Scalar

Scalar subqueries are subqueries that return one row and one column.

Property Description Required
Input Input relation Yes

IN predicate

An IN subquery predicate checks that the left expression is contained in the right subquery.

Examples

SELECT *
+ Subqueries - Substrait: Cross-Language Serialization for Relational Algebra      

Subqueries

Subqueries are scalar expressions comprised of another query.

Forms

Scalar

Scalar subqueries are subqueries that return one row and one column.

Property Description Required
Input Input relation Yes

IN predicate

An IN subquery predicate checks that the left expression is contained in the right subquery.

Examples

SELECT *
 FROM t1
 WHERE x IN (SELECT * FROM t2)
 
SELECT *
diff --git a/expressions/table_functions/index.html b/expressions/table_functions/index.html
index 50729ab..e7b946c 100644
--- a/expressions/table_functions/index.html
+++ b/expressions/table_functions/index.html
@@ -1,4 +1,4 @@
- Table Functions - Substrait: Cross-Language Serialization for Relational Algebra      

Table Functions

Table functions produce zero or more records for each input record. Table functions use a signature similar to scalar functions. However, they are not allowed in the same contexts.

to be completed…

GitHub

Table Functions

Table functions produce zero or more records for each input record. Table functions use a signature similar to scalar functions. However, they are not allowed in the same contexts.

to be completed…

GitHub

User-Defined Functions

Substrait supports the creation of custom functions using simple extensions, using the facilities described in scalar functions. The functions defined by Substrait use the same mechanism. The extension files for standard functions can be found here.

Here’s an example function that doubles its input:

Implementation Note

This implementation is only defined on 32-bit floats and integers but could be defined on all numbers (and even lists and strings). The user of the implementation can specify what happens when the resulting value falls outside of the valid range for a 32-bit float (either return NAN or raise an error).

%YAML 1.2
+ User-Defined Functions - Substrait: Cross-Language Serialization for Relational Algebra      

User-Defined Functions

Substrait supports the creation of custom functions using simple extensions, using the facilities described in scalar functions. The functions defined by Substrait use the same mechanism. The extension files for standard functions can be found here.

Here’s an example function that doubles its input:

Implementation Note

This implementation is only defined on 32-bit floats and integers but could be defined on all numbers (and even lists and strings). The user of the implementation can specify what happens when the resulting value falls outside of the valid range for a 32-bit float (either return NAN or raise an error).

%YAML 1.2
 ---
 scalar_functions:
   -
diff --git a/expressions/window_functions/index.html b/expressions/window_functions/index.html
index 58681ae..03a38df 100644
--- a/expressions/window_functions/index.html
+++ b/expressions/window_functions/index.html
@@ -1,4 +1,4 @@
- Window Functions - Substrait: Cross-Language Serialization for Relational Algebra      

Window Functions

Window functions are functions which consume values from multiple records to produce a single output. They are similar to aggregate functions, but also have a focused window of analysis to compare to their partition window. Window functions are similar to scalar values to an end user, producing a single value for each input record. However, the consumption visibility for the production of each single record can be many records.

Window function signatures contain all the properties defined for aggregate functions. Additionally, they contain the properties below

Property Description Required
Inherits All properties defined for aggregate functions. N/A
Window Type STREAMING or PARTITION. Describes whether the function needs to see all data for the specific partition operation simultaneously. Operations like SUM can produce values in a streaming manner with no complete visibility of the partition. NTILE requires visibility of the entire partition before it can start producing values. Optional, defaults to PARTITION

When binding an aggregate function, the binding must include the following additional properties beyond the standard scalar binding properties:

Property Description Required
Partition A list of partitioning expressions. False, defaults to a single partition for the entire dataset
Lower Bound Bound Following(int64), Bound Trailing(int64) or CurrentRow. False, defaults to start of partition
Upper Bound Bound Following(int64), Bound Trailing(int64) or CurrentRow. False, defaults to end of partition

Aggregate Functions as Window Functions

Aggregate functions can be treated as a window functions with Window Type set to STREAMING.

AVG, COUNT, MAX, MIN and SUM are examples of aggregate functions that are commonly allowed in window contexts.

GitHub

Window Functions

Window functions are functions which consume values from multiple records to produce a single output. They are similar to aggregate functions, but also have a focused window of analysis to compare to their partition window. Window functions are similar to scalar values to an end user, producing a single value for each input record. However, the consumption visibility for the production of each single record can be many records.

Window function signatures contain all the properties defined for aggregate functions. Additionally, they contain the properties below

Property Description Required
Inherits All properties defined for aggregate functions. N/A
Window Type STREAMING or PARTITION. Describes whether the function needs to see all data for the specific partition operation simultaneously. Operations like SUM can produce values in a streaming manner with no complete visibility of the partition. NTILE requires visibility of the entire partition before it can start producing values. Optional, defaults to PARTITION

When binding an aggregate function, the binding must include the following additional properties beyond the standard scalar binding properties:

Property Description Required
Partition A list of partitioning expressions. False, defaults to a single partition for the entire dataset
Lower Bound Bound Following(int64), Bound Trailing(int64) or CurrentRow. False, defaults to start of partition
Upper Bound Bound Following(int64), Bound Trailing(int64) or CurrentRow. False, defaults to end of partition

Aggregate Functions as Window Functions

Aggregate functions can be treated as a window functions with Window Type set to STREAMING.

AVG, COUNT, MAX, MIN and SUM are examples of aggregate functions that are commonly allowed in window contexts.

GitHub

functions_aggregate_approx.yaml

This document file is generated for functions_aggregate_approx.yaml

Aggregate Functions

approx_count_distinct

Implementations:
approx_count_distinct(x): -> return_type
0. approx_count_distinct(any): -> i64

Calculates the approximate number of rows that contain distinct values of the expression argument using HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result.

GitHub

functions_aggregate_approx.yaml

This document file is generated for functions_aggregate_approx.yaml

Aggregate Functions

approx_count_distinct

Implementations:
approx_count_distinct(x): -> return_type
0. approx_count_distinct(any): -> i64

Calculates the approximate number of rows that contain distinct values of the expression argument using HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result.

GitHub

functions_aggregate_decimal_output.yaml

This document file is generated for functions_aggregate_decimal_output.yaml

Aggregate Functions

count

Implementations:
count(x, option:overflow): -> return_type
0. count(any, option:overflow): -> decimal<38,0>

Count a set of values. Result is returned as a decimal instead of i64.

Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • count

    Implementations:

    Count a set of records (not field referenced). Result is returned as a decimal instead of i64.

    approx_count_distinct

    Implementations:
    approx_count_distinct(x): -> return_type
    0. approx_count_distinct(any): -> decimal<38,0>

    Calculates the approximate number of rows that contain distinct values of the expression argument using HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result. Result is returned as a decimal instead of i64.

    GitHub

    functions_aggregate_decimal_output.yaml

    This document file is generated for functions_aggregate_decimal_output.yaml

    Aggregate Functions

    count

    Implementations:
    count(x, option:overflow): -> return_type
    0. count(any, option:overflow): -> decimal<38,0>

    Count a set of values. Result is returned as a decimal instead of i64.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • count

    Implementations:

    Count a set of records (not field referenced). Result is returned as a decimal instead of i64.

    approx_count_distinct

    Implementations:
    approx_count_distinct(x): -> return_type
    0. approx_count_distinct(any): -> decimal<38,0>

    Calculates the approximate number of rows that contain distinct values of the expression argument using HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result. Result is returned as a decimal instead of i64.

    GitHub

    functions_aggregate_generic.yaml

    This document file is generated for functions_aggregate_generic.yaml

    Aggregate Functions

    count

    Implementations:
    count(x, option:overflow): -> return_type
    0. count(any, option:overflow): -> i64

    Count a set of values

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • count

    Implementations:

    Count a set of records (not field referenced)

    any_value

    Implementations:
    any_value(x, option:ignore_nulls): -> return_type
    0. any_value(any1, option:ignore_nulls): -> any1?

    *Selects an arbitrary value from a group of values. If the input is empty, the function returns null. *

    Options:
  • ignore_nulls ['TRUE', 'FALSE']
  • GitHub

    functions_aggregate_generic.yaml

    This document file is generated for functions_aggregate_generic.yaml

    Aggregate Functions

    count

    Implementations:
    count(x, option:overflow): -> return_type
    0. count(any, option:overflow): -> i64

    Count a set of values

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • count

    Implementations:

    Count a set of records (not field referenced)

    any_value

    Implementations:
    any_value(x, option:ignore_nulls): -> return_type
    0. any_value(any1, option:ignore_nulls): -> any1?

    *Selects an arbitrary value from a group of values. If the input is empty, the function returns null. *

    Options:
  • ignore_nulls ['TRUE', 'FALSE']
  • GitHub

    functions_arithmetic.yaml

    This document file is generated for functions_arithmetic.yaml

    Scalar Functions

    add

    Implementations:
    add(x, y, option:overflow): -> return_type
    0. add(i8, i8, option:overflow): -> i8
    1. add(i16, i16, option:overflow): -> i16
    2. add(i32, i32, option:overflow): -> i32
    3. add(i64, i64, option:overflow): -> i64
    4. add(fp32, fp32, option:rounding): -> fp32
    5. add(fp64, fp64, option:rounding): -> fp64

    Add two values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • subtract

    Implementations:
    subtract(x, y, option:overflow): -> return_type
    0. subtract(i8, i8, option:overflow): -> i8
    1. subtract(i16, i16, option:overflow): -> i16
    2. subtract(i32, i32, option:overflow): -> i32
    3. subtract(i64, i64, option:overflow): -> i64
    4. subtract(fp32, fp32, option:rounding): -> fp32
    5. subtract(fp64, fp64, option:rounding): -> fp64

    Subtract one value from another.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • multiply

    Implementations:
    multiply(x, y, option:overflow): -> return_type
    0. multiply(i8, i8, option:overflow): -> i8
    1. multiply(i16, i16, option:overflow): -> i16
    2. multiply(i32, i32, option:overflow): -> i32
    3. multiply(i64, i64, option:overflow): -> i64
    4. multiply(fp32, fp32, option:rounding): -> fp32
    5. multiply(fp64, fp64, option:rounding): -> fp64

    Multiply two values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • divide

    Implementations:
    divide(x, y, option:overflow, option:on_domain_error, option:on_division_by_zero): -> return_type
    0. divide(i8, i8, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i8
    1. divide(i16, i16, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i16
    2. divide(i32, i32, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i32
    3. divide(i64, i64, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i64
    4. divide(fp32, fp32, option:rounding, option:on_domain_error, option:on_division_by_zero): -> fp32
    5. divide(fp64, fp64, option:rounding, option:on_domain_error, option:on_division_by_zero): -> fp64

    *Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0). The on_division_by_zero option governs behavior in cases where y is 0. If the option is IEEE then the IEEE754 standard is followed: all values except ±infinity return NaN and ±infinity are unchanged. If the option is LIMIT then the result is ±infinity in all cases. If either x or y are NaN then behavior will be governed by on_domain_error. If x and y are both ±infinity, behavior will be governed by on_domain_error. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • on_domain_error ['NULL', 'ERROR']
  • on_division_by_zero ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • rounding ['NAN', 'NULL', 'ERROR']
  • overflow ['IEEE', 'LIMIT', 'NULL', 'ERROR']
  • negate

    Implementations:
    negate(x, option:overflow): -> return_type
    0. negate(i8, option:overflow): -> i8
    1. negate(i16, option:overflow): -> i16
    2. negate(i32, option:overflow): -> i32
    3. negate(i64, option:overflow): -> i64
    4. negate(fp32): -> fp32
    5. negate(fp64): -> fp64

    Negation of the value

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • modulus

    Implementations:
    modulus(x, y, option:division_type, option:overflow, option:on_domain_error): -> return_type
    0. modulus(i8, i8, option:division_type, option:overflow, option:on_domain_error): -> i8
    1. modulus(i16, i16, option:division_type, option:overflow, option:on_domain_error): -> i16
    2. modulus(i32, i32, option:division_type, option:overflow, option:on_domain_error): -> i32
    3. modulus(i64, i64, option:division_type, option:overflow, option:on_domain_error): -> i64

    *Calculate the remainder ® when dividing dividend (x) by divisor (y). In mathematics, many conventions for the modulus (mod) operation exists. The result of a mod operation depends on the software implementation and underlying hardware. Substrait is a format for describing compute operations on structured data and designed for interoperability. Therefore the user is responsible for determining a definition of division as defined by the quotient (q). The following basic conditions of division are satisfied: (1) q ∈ ℤ (the quotient is an integer) (2) x = y * q + r (division rule) (3) abs® < abs(y) where q is the quotient. The division_type option determines the mathematical definition of quotient to use in the above definition of division. When division_type=TRUNCATE, q = trunc(x/y). When division_type=FLOOR, q = floor(x/y). In the cases of TRUNCATE and FLOOR division: remainder r = x - round_func(x/y) The on_domain_error option governs behavior in cases where y is 0, y is ±inf, or x is ±inf. In these cases the mod is undefined. The overflow option governs behavior when integer overflow occurs. If x and y are both 0 or both ±infinity, behavior will be governed by on_domain_error. *

    Options:
  • division_type ['TRUNCATE', 'FLOOR']
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • on_domain_error ['NULL', 'ERROR']
  • power

    Implementations:
    power(x, y, option:overflow): -> return_type
    0. power(i64, i64, option:overflow): -> i64
    1. power(fp32, fp32): -> fp32
    2. power(fp64, fp64): -> fp64

    Take the power with x as the base and y as exponent.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • sqrt

    Implementations:
    sqrt(x, option:rounding, option:on_domain_error): -> return_type
    0. sqrt(i64, option:rounding, option:on_domain_error): -> fp64
    1. sqrt(fp32, option:rounding, option:on_domain_error): -> fp32
    2. sqrt(fp64, option:rounding, option:on_domain_error): -> fp64

    Square root of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • exp

    Implementations:
    exp(x, option:rounding): -> return_type
    0. exp(i64, option:rounding): -> fp64
    1. exp(fp32, option:rounding): -> fp32
    2. exp(fp64, option:rounding): -> fp64

    The mathematical constant e, raised to the power of the value.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • cos

    Implementations:
    cos(x, option:rounding): -> return_type
    0. cos(fp32, option:rounding): -> fp32
    1. cos(fp64, option:rounding): -> fp64

    Get the cosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • sin

    Implementations:
    sin(x, option:rounding): -> return_type
    0. sin(fp32, option:rounding): -> fp32
    1. sin(fp64, option:rounding): -> fp64

    Get the sine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • tan

    Implementations:
    tan(x, option:rounding): -> return_type
    0. tan(fp32, option:rounding): -> fp32
    1. tan(fp64, option:rounding): -> fp64

    Get the tangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • cosh

    Implementations:
    cosh(x, option:rounding): -> return_type
    0. cosh(fp32, option:rounding): -> fp32
    1. cosh(fp64, option:rounding): -> fp64

    Get the hyperbolic cosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • sinh

    Implementations:
    sinh(x, option:rounding): -> return_type
    0. sinh(fp32, option:rounding): -> fp32
    1. sinh(fp64, option:rounding): -> fp64

    Get the hyperbolic sine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • tanh

    Implementations:
    tanh(x, option:rounding): -> return_type
    0. tanh(fp32, option:rounding): -> fp32
    1. tanh(fp64, option:rounding): -> fp64

    Get the hyperbolic tangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • acos

    Implementations:
    acos(x, option:rounding, option:on_domain_error): -> return_type
    0. acos(fp32, option:rounding, option:on_domain_error): -> fp32
    1. acos(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the arccosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • asin

    Implementations:
    asin(x, option:rounding, option:on_domain_error): -> return_type
    0. asin(fp32, option:rounding, option:on_domain_error): -> fp32
    1. asin(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the arcsine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • atan

    Implementations:
    atan(x, option:rounding): -> return_type
    0. atan(fp32, option:rounding): -> fp32
    1. atan(fp64, option:rounding): -> fp64

    Get the arctangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • acosh

    Implementations:
    acosh(x, option:rounding, option:on_domain_error): -> return_type
    0. acosh(fp32, option:rounding, option:on_domain_error): -> fp32
    1. acosh(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the hyperbolic arccosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • asinh

    Implementations:
    asinh(x, option:rounding): -> return_type
    0. asinh(fp32, option:rounding): -> fp32
    1. asinh(fp64, option:rounding): -> fp64

    Get the hyperbolic arcsine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • atanh

    Implementations:
    atanh(x, option:rounding, option:on_domain_error): -> return_type
    0. atanh(fp32, option:rounding, option:on_domain_error): -> fp32
    1. atanh(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the hyperbolic arctangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • atan2

    Implementations:
    atan2(x, y, option:rounding, option:on_domain_error): -> return_type
    0. atan2(fp32, fp32, option:rounding, option:on_domain_error): -> fp32
    1. atan2(fp64, fp64, option:rounding, option:on_domain_error): -> fp64

    Get the arctangent of values given as x/y pairs.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • radians

    Implementations:
    radians(x, option:rounding): -> return_type
    0. radians(fp32, option:rounding): -> fp32
    1. radians(fp64, option:rounding): -> fp64

    *Converts angle x in degrees to radians. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • degrees

    Implementations:
    degrees(x, option:rounding): -> return_type
    0. degrees(fp32, option:rounding): -> fp32
    1. degrees(fp64, option:rounding): -> fp64

    *Converts angle x in radians to degrees. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • abs

    Implementations:
    abs(x, option:overflow): -> return_type
    0. abs(i8, option:overflow): -> i8
    1. abs(i16, option:overflow): -> i16
    2. abs(i32, option:overflow): -> i32
    3. abs(i64, option:overflow): -> i64
    4. abs(fp32): -> fp32
    5. abs(fp64): -> fp64

    *Calculate the absolute value of the argument. Integer values allow the specification of overflow behavior to handle the unevenness of the twos complement, e.g. Int8 range [-128 : 127]. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • sign

    Implementations:
    sign(x): -> return_type
    0. sign(i8): -> i8
    1. sign(i16): -> i16
    2. sign(i32): -> i32
    3. sign(i64): -> i64
    4. sign(fp32): -> fp32
    5. sign(fp64): -> fp64

    *Return the signedness of the argument. Integer values return signedness with the same type as the input. Possible return values are [-1, 0, 1] Floating point values return signedness with the same type as the input. Possible return values are [-1.0, -0.0, 0.0, 1.0, NaN] *

    factorial

    Implementations:
    factorial(n, option:overflow): -> return_type
    0. factorial(i32, option:overflow): -> i32
    1. factorial(i64, option:overflow): -> i64

    *Return the factorial of a given integer input. The factorial of 0! is 1 by convention. Negative inputs will raise an error. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • bitwise_not

    Implementations:
    bitwise_not(x): -> return_type
    0. bitwise_not(i8): -> i8
    1. bitwise_not(i16): -> i16
    2. bitwise_not(i32): -> i32
    3. bitwise_not(i64): -> i64

    *Return the bitwise NOT result for one integer input. *

    bitwise_and

    Implementations:
    bitwise_and(x, y): -> return_type
    0. bitwise_and(i8, i8): -> i8
    1. bitwise_and(i16, i16): -> i16
    2. bitwise_and(i32, i32): -> i32
    3. bitwise_and(i64, i64): -> i64

    *Return the bitwise AND result for two integer inputs. *

    bitwise_or

    Implementations:
    bitwise_or(x, y): -> return_type
    0. bitwise_or(i8, i8): -> i8
    1. bitwise_or(i16, i16): -> i16
    2. bitwise_or(i32, i32): -> i32
    3. bitwise_or(i64, i64): -> i64

    *Return the bitwise OR result for two given integer inputs. *

    bitwise_xor

    Implementations:
    bitwise_xor(x, y): -> return_type
    0. bitwise_xor(i8, i8): -> i8
    1. bitwise_xor(i16, i16): -> i16
    2. bitwise_xor(i32, i32): -> i32
    3. bitwise_xor(i64, i64): -> i64

    *Return the bitwise XOR result for two integer inputs. *

    Aggregate Functions

    sum

    Implementations:
    sum(x, option:overflow): -> return_type
    0. sum(i8, option:overflow): -> i64?
    1. sum(i16, option:overflow): -> i64?
    2. sum(i32, option:overflow): -> i64?
    3. sum(i64, option:overflow): -> i64?
    4. sum(fp32, option:overflow): -> fp64?
    5. sum(fp64, option:overflow): -> fp64?

    Sum a set of values. The sum of zero elements yields null.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • sum0

    Implementations:
    sum0(x, option:overflow): -> return_type
    0. sum0(i8, option:overflow): -> i64
    1. sum0(i16, option:overflow): -> i64
    2. sum0(i32, option:overflow): -> i64
    3. sum0(i64, option:overflow): -> i64
    4. sum0(fp32, option:overflow): -> fp64
    5. sum0(fp64, option:overflow): -> fp64

    *Sum a set of values. The sum of zero elements yields zero. Null values are ignored. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • avg

    Implementations:
    avg(x, option:overflow): -> return_type
    0. avg(i8, option:overflow): -> i8?
    1. avg(i16, option:overflow): -> i16?
    2. avg(i32, option:overflow): -> i32?
    3. avg(i64, option:overflow): -> i64?
    4. avg(fp32, option:overflow): -> fp32?
    5. avg(fp64, option:overflow): -> fp64?

    Average a set of values. For integral types, this truncates partial values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • min

    Implementations:
    min(x): -> return_type
    0. min(i8): -> i8?
    1. min(i16): -> i16?
    2. min(i32): -> i32?
    3. min(i64): -> i64?
    4. min(fp32): -> fp32?
    5. min(fp64): -> fp64?

    Min a set of values.

    max

    Implementations:
    max(x): -> return_type
    0. max(i8): -> i8?
    1. max(i16): -> i16?
    2. max(i32): -> i32?
    3. max(i64): -> i64?
    4. max(fp32): -> fp32?
    5. max(fp64): -> fp64?

    Max a set of values.

    product

    Implementations:
    product(x, option:overflow): -> return_type
    0. product(i8, option:overflow): -> i8
    1. product(i16, option:overflow): -> i16
    2. product(i32, option:overflow): -> i32
    3. product(i64, option:overflow): -> i64
    4. product(fp32, option:rounding): -> fp32
    5. product(fp64, option:rounding): -> fp64

    Product of a set of values. Returns 1 for empty input.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • std_dev

    Implementations:
    std_dev(x, option:rounding, option:distribution): -> return_type
    0. std_dev(fp32, option:rounding, option:distribution): -> fp32?
    1. std_dev(fp64, option:rounding, option:distribution): -> fp64?

    Calculates standard-deviation for a set of values.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • distribution ['SAMPLE', 'POPULATION']
  • variance

    Implementations:
    variance(x, option:rounding, option:distribution): -> return_type
    0. variance(fp32, option:rounding, option:distribution): -> fp32?
    1. variance(fp64, option:rounding, option:distribution): -> fp64?

    Calculates variance for a set of values.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • distribution ['SAMPLE', 'POPULATION']
  • corr

    Implementations:
    corr(x, y, option:rounding): -> return_type
    0. corr(fp32, fp32, option:rounding): -> fp32?
    1. corr(fp64, fp64, option:rounding): -> fp64?

    *Calculates the value of Pearson’s correlation coefficient between x and y. If there is no input, null is returned. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • mode

    Implementations:
    mode(x): -> return_type
    0. mode(i8): -> i8?
    1. mode(i16): -> i16?
    2. mode(i32): -> i32?
    3. mode(i64): -> i64?
    4. mode(fp32): -> fp32?
    5. mode(fp64): -> fp64?

    *Calculates mode for a set of values. If there is no input, null is returned. *

    median

    Implementations:
    median(precision, x, option:rounding): -> return_type
    0. median(precision, i8, option:rounding): -> i8?
    1. median(precision, i16, option:rounding): -> i16?
    2. median(precision, i32, option:rounding): -> i32?
    3. median(precision, i64, option:rounding): -> i64?
    4. median(precision, fp32, option:rounding): -> fp32?
    5. median(precision, fp64, option:rounding): -> fp64?

    *Calculate the median for a set of values. Returns null if applied to zero records. For the integer implementations, the rounding option determines how the median should be rounded if it ends up midway between two values. For the floating point implementations, they specify the usual floating point rounding mode. *

    Options:
  • precision ['EXACT', 'APPROXIMATE']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • quantile

    Implementations:
    quantile(boundaries, precision, n, distribution, option:rounding): -> return_type

  • n: A positive integer which defines the number of quantile partitions.
  • distribution: The data for which the quantiles should be computed.
  • 0. quantile(boundaries, precision, i64, any, option:rounding): -> LIST?<any>

    *Calculates quantiles for a set of values. This function will divide the aggregated values (passed via the distribution argument) over N equally-sized bins, where N is passed via a constant argument. It will then return the values at the boundaries of these bins in list form. If the input is appropriately sorted, this computes the quantiles of the distribution. The function can optionally return the first and/or last element of the input, as specified by the boundaries argument. If the input is appropriately sorted, this will thus be the minimum and/or maximum values of the distribution. When the boundaries do not lie exactly on elements of the incoming distribution, the function will interpolate between the two nearby elements. If the interpolated value cannot be represented exactly, the rounding option controls how the value should be selected or computed. The function fails and returns null in the following cases: - n is null or less than one; - any value in distribution is null.

    The function returns an empty list if n equals 1 and boundaries is set to NEITHER. *

    Options:
  • boundaries ['NEITHER', 'MINIMUM', 'MAXIMUM', 'BOTH']
  • precision ['EXACT', 'APPROXIMATE']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • Window Functions

    row_number

    Implementations:
    0. row_number(): -> i64?

    the number of the current row within its partition.

    rank

    Implementations:
    0. rank(): -> i64?

    the rank of the current row, with gaps.

    dense_rank

    Implementations:
    0. dense_rank(): -> i64?

    the rank of the current row, without gaps.

    percent_rank

    Implementations:
    0. percent_rank(): -> fp64?

    the relative rank of the current row.

    cume_dist

    Implementations:
    0. cume_dist(): -> fp64?

    the cumulative distribution.

    ntile

    Implementations:
    ntile(x): -> return_type
    0. ntile(i32): -> i32?
    1. ntile(i64): -> i64?

    Return an integer ranging from 1 to the argument value,dividing the partition as equally as possible.

    first_value

    Implementations:
    first_value(expression): -> return_type
    0. first_value(any1): -> any1

    *Returns the first value in the window. *

    last_value

    Implementations:
    last_value(expression): -> return_type
    0. last_value(any1): -> any1

    *Returns the last value in the window. *

    nth_value

    Implementations:
    nth_value(expression, window_offset, option:on_domain_error): -> return_type
    0. nth_value(any1, i32, option:on_domain_error): -> any1?

    *Returns a value from the nth row based on the window_offset. window_offset should be a positive integer. If the value of the window_offset is outside the range of the window, null is returned. The on_domain_error option governs behavior in cases where window_offset is not a positive integer or null. *

    Options:
  • on_domain_error ['NAN', 'ERROR']
  • lead

    Implementations:
    lead(expression): -> return_type
    0. lead(any1): -> any1?
    1. lead(any1, i32): -> any1?
    2. lead(any1, i32, any1): -> any1?

    *Return a value from a following row based on a specified physical offset. This allows you to compare a value in the current row against a following row. The expression is evaluated against a row that comes after the current row based on the row_offset. The row_offset should be a positive integer and is set to 1 if not specified explicitly. If the row_offset is negative, the expression will be evaluated against a row coming before the current row, similar to the lag function. A row_offset of null will return null. The function returns the default input value if row_offset goes beyond the scope of the window. If a default value is not specified, it is set to null. Example comparing the sales of the current year to the following year. row_offset of 1. | year | sales | next_year_sales | | 2019 | 20.50 | 30.00 | | 2020 | 30.00 | 45.99 | | 2021 | 45.99 | null | *

    lag

    Implementations:
    lag(expression): -> return_type
    0. lag(any1): -> any1?
    1. lag(any1, i32): -> any1?
    2. lag(any1, i32, any1): -> any1?

    *Return a column value from a previous row based on a specified physical offset. This allows you to compare a value in the current row against a previous row. The expression is evaluated against a row that comes before the current row based on the row_offset. The expression can be a column, expression or subquery that evaluates to a single value. The row_offset should be a positive integer and is set to 1 if not specified explicitly. If the row_offset is negative, the expression will be evaluated against a row coming after the current row, similar to the lead function. A row_offset of null will return null. The function returns the default input value if row_offset goes beyond the scope of the partition. If a default value is not specified, it is set to null. Example comparing the sales of the current year to the previous year. row_offset of 1. | year | sales | previous_year_sales | | 2019 | 20.50 | null | | 2020 | 30.00 | 20.50 | | 2021 | 45.99 | 30.00 | *

    GitHub

    functions_arithmetic.yaml

    This document file is generated for functions_arithmetic.yaml

    Scalar Functions

    add

    Implementations:
    add(x, y, option:overflow): -> return_type
    0. add(i8, i8, option:overflow): -> i8
    1. add(i16, i16, option:overflow): -> i16
    2. add(i32, i32, option:overflow): -> i32
    3. add(i64, i64, option:overflow): -> i64
    4. add(fp32, fp32, option:rounding): -> fp32
    5. add(fp64, fp64, option:rounding): -> fp64

    Add two values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • subtract

    Implementations:
    subtract(x, y, option:overflow): -> return_type
    0. subtract(i8, i8, option:overflow): -> i8
    1. subtract(i16, i16, option:overflow): -> i16
    2. subtract(i32, i32, option:overflow): -> i32
    3. subtract(i64, i64, option:overflow): -> i64
    4. subtract(fp32, fp32, option:rounding): -> fp32
    5. subtract(fp64, fp64, option:rounding): -> fp64

    Subtract one value from another.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • multiply

    Implementations:
    multiply(x, y, option:overflow): -> return_type
    0. multiply(i8, i8, option:overflow): -> i8
    1. multiply(i16, i16, option:overflow): -> i16
    2. multiply(i32, i32, option:overflow): -> i32
    3. multiply(i64, i64, option:overflow): -> i64
    4. multiply(fp32, fp32, option:rounding): -> fp32
    5. multiply(fp64, fp64, option:rounding): -> fp64

    Multiply two values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • divide

    Implementations:
    divide(x, y, option:overflow, option:on_domain_error, option:on_division_by_zero): -> return_type
    0. divide(i8, i8, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i8
    1. divide(i16, i16, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i16
    2. divide(i32, i32, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i32
    3. divide(i64, i64, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i64
    4. divide(fp32, fp32, option:rounding, option:on_domain_error, option:on_division_by_zero): -> fp32
    5. divide(fp64, fp64, option:rounding, option:on_domain_error, option:on_division_by_zero): -> fp64

    *Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0). The on_division_by_zero option governs behavior in cases where y is 0. If the option is IEEE then the IEEE754 standard is followed: all values except ±infinity return NaN and ±infinity are unchanged. If the option is LIMIT then the result is ±infinity in all cases. If either x or y are NaN then behavior will be governed by on_domain_error. If x and y are both ±infinity, behavior will be governed by on_domain_error. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • on_domain_error ['NULL', 'ERROR']
  • on_division_by_zero ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • rounding ['NAN', 'NULL', 'ERROR']
  • overflow ['IEEE', 'LIMIT', 'NULL', 'ERROR']
  • negate

    Implementations:
    negate(x, option:overflow): -> return_type
    0. negate(i8, option:overflow): -> i8
    1. negate(i16, option:overflow): -> i16
    2. negate(i32, option:overflow): -> i32
    3. negate(i64, option:overflow): -> i64
    4. negate(fp32): -> fp32
    5. negate(fp64): -> fp64

    Negation of the value

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • modulus

    Implementations:
    modulus(x, y, option:division_type, option:overflow, option:on_domain_error): -> return_type
    0. modulus(i8, i8, option:division_type, option:overflow, option:on_domain_error): -> i8
    1. modulus(i16, i16, option:division_type, option:overflow, option:on_domain_error): -> i16
    2. modulus(i32, i32, option:division_type, option:overflow, option:on_domain_error): -> i32
    3. modulus(i64, i64, option:division_type, option:overflow, option:on_domain_error): -> i64

    *Calculate the remainder ® when dividing dividend (x) by divisor (y). In mathematics, many conventions for the modulus (mod) operation exists. The result of a mod operation depends on the software implementation and underlying hardware. Substrait is a format for describing compute operations on structured data and designed for interoperability. Therefore the user is responsible for determining a definition of division as defined by the quotient (q). The following basic conditions of division are satisfied: (1) q ∈ ℤ (the quotient is an integer) (2) x = y * q + r (division rule) (3) abs® < abs(y) where q is the quotient. The division_type option determines the mathematical definition of quotient to use in the above definition of division. When division_type=TRUNCATE, q = trunc(x/y). When division_type=FLOOR, q = floor(x/y). In the cases of TRUNCATE and FLOOR division: remainder r = x - round_func(x/y) The on_domain_error option governs behavior in cases where y is 0, y is ±inf, or x is ±inf. In these cases the mod is undefined. The overflow option governs behavior when integer overflow occurs. If x and y are both 0 or both ±infinity, behavior will be governed by on_domain_error. *

    Options:
  • division_type ['TRUNCATE', 'FLOOR']
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • on_domain_error ['NULL', 'ERROR']
  • power

    Implementations:
    power(x, y, option:overflow): -> return_type
    0. power(i64, i64, option:overflow): -> i64
    1. power(fp32, fp32): -> fp32
    2. power(fp64, fp64): -> fp64

    Take the power with x as the base and y as exponent.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • sqrt

    Implementations:
    sqrt(x, option:rounding, option:on_domain_error): -> return_type
    0. sqrt(i64, option:rounding, option:on_domain_error): -> fp64
    1. sqrt(fp32, option:rounding, option:on_domain_error): -> fp32
    2. sqrt(fp64, option:rounding, option:on_domain_error): -> fp64

    Square root of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • exp

    Implementations:
    exp(x, option:rounding): -> return_type
    0. exp(i64, option:rounding): -> fp64
    1. exp(fp32, option:rounding): -> fp32
    2. exp(fp64, option:rounding): -> fp64

    The mathematical constant e, raised to the power of the value.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • cos

    Implementations:
    cos(x, option:rounding): -> return_type
    0. cos(fp32, option:rounding): -> fp32
    1. cos(fp64, option:rounding): -> fp64

    Get the cosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • sin

    Implementations:
    sin(x, option:rounding): -> return_type
    0. sin(fp32, option:rounding): -> fp32
    1. sin(fp64, option:rounding): -> fp64

    Get the sine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • tan

    Implementations:
    tan(x, option:rounding): -> return_type
    0. tan(fp32, option:rounding): -> fp32
    1. tan(fp64, option:rounding): -> fp64

    Get the tangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • cosh

    Implementations:
    cosh(x, option:rounding): -> return_type
    0. cosh(fp32, option:rounding): -> fp32
    1. cosh(fp64, option:rounding): -> fp64

    Get the hyperbolic cosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • sinh

    Implementations:
    sinh(x, option:rounding): -> return_type
    0. sinh(fp32, option:rounding): -> fp32
    1. sinh(fp64, option:rounding): -> fp64

    Get the hyperbolic sine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • tanh

    Implementations:
    tanh(x, option:rounding): -> return_type
    0. tanh(fp32, option:rounding): -> fp32
    1. tanh(fp64, option:rounding): -> fp64

    Get the hyperbolic tangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • acos

    Implementations:
    acos(x, option:rounding, option:on_domain_error): -> return_type
    0. acos(fp32, option:rounding, option:on_domain_error): -> fp32
    1. acos(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the arccosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • asin

    Implementations:
    asin(x, option:rounding, option:on_domain_error): -> return_type
    0. asin(fp32, option:rounding, option:on_domain_error): -> fp32
    1. asin(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the arcsine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • atan

    Implementations:
    atan(x, option:rounding): -> return_type
    0. atan(fp32, option:rounding): -> fp32
    1. atan(fp64, option:rounding): -> fp64

    Get the arctangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • acosh

    Implementations:
    acosh(x, option:rounding, option:on_domain_error): -> return_type
    0. acosh(fp32, option:rounding, option:on_domain_error): -> fp32
    1. acosh(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the hyperbolic arccosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • asinh

    Implementations:
    asinh(x, option:rounding): -> return_type
    0. asinh(fp32, option:rounding): -> fp32
    1. asinh(fp64, option:rounding): -> fp64

    Get the hyperbolic arcsine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • atanh

    Implementations:
    atanh(x, option:rounding, option:on_domain_error): -> return_type
    0. atanh(fp32, option:rounding, option:on_domain_error): -> fp32
    1. atanh(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the hyperbolic arctangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • atan2

    Implementations:
    atan2(x, y, option:rounding, option:on_domain_error): -> return_type
    0. atan2(fp32, fp32, option:rounding, option:on_domain_error): -> fp32
    1. atan2(fp64, fp64, option:rounding, option:on_domain_error): -> fp64

    Get the arctangent of values given as x/y pairs.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • radians

    Implementations:
    radians(x, option:rounding): -> return_type
    0. radians(fp32, option:rounding): -> fp32
    1. radians(fp64, option:rounding): -> fp64

    *Converts angle x in degrees to radians. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • degrees

    Implementations:
    degrees(x, option:rounding): -> return_type
    0. degrees(fp32, option:rounding): -> fp32
    1. degrees(fp64, option:rounding): -> fp64

    *Converts angle x in radians to degrees. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • abs

    Implementations:
    abs(x, option:overflow): -> return_type
    0. abs(i8, option:overflow): -> i8
    1. abs(i16, option:overflow): -> i16
    2. abs(i32, option:overflow): -> i32
    3. abs(i64, option:overflow): -> i64
    4. abs(fp32): -> fp32
    5. abs(fp64): -> fp64

    *Calculate the absolute value of the argument. Integer values allow the specification of overflow behavior to handle the unevenness of the twos complement, e.g. Int8 range [-128 : 127]. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • sign

    Implementations:
    sign(x): -> return_type
    0. sign(i8): -> i8
    1. sign(i16): -> i16
    2. sign(i32): -> i32
    3. sign(i64): -> i64
    4. sign(fp32): -> fp32
    5. sign(fp64): -> fp64

    *Return the signedness of the argument. Integer values return signedness with the same type as the input. Possible return values are [-1, 0, 1] Floating point values return signedness with the same type as the input. Possible return values are [-1.0, -0.0, 0.0, 1.0, NaN] *

    factorial

    Implementations:
    factorial(n, option:overflow): -> return_type
    0. factorial(i32, option:overflow): -> i32
    1. factorial(i64, option:overflow): -> i64

    *Return the factorial of a given integer input. The factorial of 0! is 1 by convention. Negative inputs will raise an error. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • bitwise_not

    Implementations:
    bitwise_not(x): -> return_type
    0. bitwise_not(i8): -> i8
    1. bitwise_not(i16): -> i16
    2. bitwise_not(i32): -> i32
    3. bitwise_not(i64): -> i64

    *Return the bitwise NOT result for one integer input. *

    bitwise_and

    Implementations:
    bitwise_and(x, y): -> return_type
    0. bitwise_and(i8, i8): -> i8
    1. bitwise_and(i16, i16): -> i16
    2. bitwise_and(i32, i32): -> i32
    3. bitwise_and(i64, i64): -> i64

    *Return the bitwise AND result for two integer inputs. *

    bitwise_or

    Implementations:
    bitwise_or(x, y): -> return_type
    0. bitwise_or(i8, i8): -> i8
    1. bitwise_or(i16, i16): -> i16
    2. bitwise_or(i32, i32): -> i32
    3. bitwise_or(i64, i64): -> i64

    *Return the bitwise OR result for two given integer inputs. *

    bitwise_xor

    Implementations:
    bitwise_xor(x, y): -> return_type
    0. bitwise_xor(i8, i8): -> i8
    1. bitwise_xor(i16, i16): -> i16
    2. bitwise_xor(i32, i32): -> i32
    3. bitwise_xor(i64, i64): -> i64

    *Return the bitwise XOR result for two integer inputs. *

    Aggregate Functions

    sum

    Implementations:
    sum(x, option:overflow): -> return_type
    0. sum(i8, option:overflow): -> i64?
    1. sum(i16, option:overflow): -> i64?
    2. sum(i32, option:overflow): -> i64?
    3. sum(i64, option:overflow): -> i64?
    4. sum(fp32, option:overflow): -> fp64?
    5. sum(fp64, option:overflow): -> fp64?

    Sum a set of values. The sum of zero elements yields null.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • sum0

    Implementations:
    sum0(x, option:overflow): -> return_type
    0. sum0(i8, option:overflow): -> i64
    1. sum0(i16, option:overflow): -> i64
    2. sum0(i32, option:overflow): -> i64
    3. sum0(i64, option:overflow): -> i64
    4. sum0(fp32, option:overflow): -> fp64
    5. sum0(fp64, option:overflow): -> fp64

    *Sum a set of values. The sum of zero elements yields zero. Null values are ignored. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • avg

    Implementations:
    avg(x, option:overflow): -> return_type
    0. avg(i8, option:overflow): -> i8?
    1. avg(i16, option:overflow): -> i16?
    2. avg(i32, option:overflow): -> i32?
    3. avg(i64, option:overflow): -> i64?
    4. avg(fp32, option:overflow): -> fp32?
    5. avg(fp64, option:overflow): -> fp64?

    Average a set of values. For integral types, this truncates partial values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • min

    Implementations:
    min(x): -> return_type
    0. min(i8): -> i8?
    1. min(i16): -> i16?
    2. min(i32): -> i32?
    3. min(i64): -> i64?
    4. min(fp32): -> fp32?
    5. min(fp64): -> fp64?

    Min a set of values.

    max

    Implementations:
    max(x): -> return_type
    0. max(i8): -> i8?
    1. max(i16): -> i16?
    2. max(i32): -> i32?
    3. max(i64): -> i64?
    4. max(fp32): -> fp32?
    5. max(fp64): -> fp64?

    Max a set of values.

    product

    Implementations:
    product(x, option:overflow): -> return_type
    0. product(i8, option:overflow): -> i8
    1. product(i16, option:overflow): -> i16
    2. product(i32, option:overflow): -> i32
    3. product(i64, option:overflow): -> i64
    4. product(fp32, option:rounding): -> fp32
    5. product(fp64, option:rounding): -> fp64

    Product of a set of values. Returns 1 for empty input.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • std_dev

    Implementations:
    std_dev(x, option:rounding, option:distribution): -> return_type
    0. std_dev(fp32, option:rounding, option:distribution): -> fp32?
    1. std_dev(fp64, option:rounding, option:distribution): -> fp64?

    Calculates standard-deviation for a set of values.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • distribution ['SAMPLE', 'POPULATION']
  • variance

    Implementations:
    variance(x, option:rounding, option:distribution): -> return_type
    0. variance(fp32, option:rounding, option:distribution): -> fp32?
    1. variance(fp64, option:rounding, option:distribution): -> fp64?

    Calculates variance for a set of values.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • distribution ['SAMPLE', 'POPULATION']
  • corr

    Implementations:
    corr(x, y, option:rounding): -> return_type
    0. corr(fp32, fp32, option:rounding): -> fp32?
    1. corr(fp64, fp64, option:rounding): -> fp64?

    *Calculates the value of Pearson’s correlation coefficient between x and y. If there is no input, null is returned. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • mode

    Implementations:
    mode(x): -> return_type
    0. mode(i8): -> i8?
    1. mode(i16): -> i16?
    2. mode(i32): -> i32?
    3. mode(i64): -> i64?
    4. mode(fp32): -> fp32?
    5. mode(fp64): -> fp64?

    *Calculates mode for a set of values. If there is no input, null is returned. *

    median

    Implementations:
    median(precision, x, option:rounding): -> return_type
    0. median(precision, i8, option:rounding): -> i8?
    1. median(precision, i16, option:rounding): -> i16?
    2. median(precision, i32, option:rounding): -> i32?
    3. median(precision, i64, option:rounding): -> i64?
    4. median(precision, fp32, option:rounding): -> fp32?
    5. median(precision, fp64, option:rounding): -> fp64?

    *Calculate the median for a set of values. Returns null if applied to zero records. For the integer implementations, the rounding option determines how the median should be rounded if it ends up midway between two values. For the floating point implementations, they specify the usual floating point rounding mode. *

    Options:
  • precision ['EXACT', 'APPROXIMATE']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • quantile

    Implementations:
    quantile(boundaries, precision, n, distribution, option:rounding): -> return_type

  • n: A positive integer which defines the number of quantile partitions.
  • distribution: The data for which the quantiles should be computed.
  • 0. quantile(boundaries, precision, i64, any, option:rounding): -> LIST?<any>

    *Calculates quantiles for a set of values. This function will divide the aggregated values (passed via the distribution argument) over N equally-sized bins, where N is passed via a constant argument. It will then return the values at the boundaries of these bins in list form. If the input is appropriately sorted, this computes the quantiles of the distribution. The function can optionally return the first and/or last element of the input, as specified by the boundaries argument. If the input is appropriately sorted, this will thus be the minimum and/or maximum values of the distribution. When the boundaries do not lie exactly on elements of the incoming distribution, the function will interpolate between the two nearby elements. If the interpolated value cannot be represented exactly, the rounding option controls how the value should be selected or computed. The function fails and returns null in the following cases: - n is null or less than one; - any value in distribution is null.

    The function returns an empty list if n equals 1 and boundaries is set to NEITHER. *

    Options:
  • boundaries ['NEITHER', 'MINIMUM', 'MAXIMUM', 'BOTH']
  • precision ['EXACT', 'APPROXIMATE']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • Window Functions

    row_number

    Implementations:
    0. row_number(): -> i64?

    the number of the current row within its partition.

    rank

    Implementations:
    0. rank(): -> i64?

    the rank of the current row, with gaps.

    dense_rank

    Implementations:
    0. dense_rank(): -> i64?

    the rank of the current row, without gaps.

    percent_rank

    Implementations:
    0. percent_rank(): -> fp64?

    the relative rank of the current row.

    cume_dist

    Implementations:
    0. cume_dist(): -> fp64?

    the cumulative distribution.

    ntile

    Implementations:
    ntile(x): -> return_type
    0. ntile(i32): -> i32?
    1. ntile(i64): -> i64?

    Return an integer ranging from 1 to the argument value,dividing the partition as equally as possible.

    first_value

    Implementations:
    first_value(expression): -> return_type
    0. first_value(any1): -> any1

    *Returns the first value in the window. *

    last_value

    Implementations:
    last_value(expression): -> return_type
    0. last_value(any1): -> any1

    *Returns the last value in the window. *

    nth_value

    Implementations:
    nth_value(expression, window_offset, option:on_domain_error): -> return_type
    0. nth_value(any1, i32, option:on_domain_error): -> any1?

    *Returns a value from the nth row based on the window_offset. window_offset should be a positive integer. If the value of the window_offset is outside the range of the window, null is returned. The on_domain_error option governs behavior in cases where window_offset is not a positive integer or null. *

    Options:
  • on_domain_error ['NAN', 'ERROR']
  • lead

    Implementations:
    lead(expression): -> return_type
    0. lead(any1): -> any1?
    1. lead(any1, i32): -> any1?
    2. lead(any1, i32, any1): -> any1?

    *Return a value from a following row based on a specified physical offset. This allows you to compare a value in the current row against a following row. The expression is evaluated against a row that comes after the current row based on the row_offset. The row_offset should be a positive integer and is set to 1 if not specified explicitly. If the row_offset is negative, the expression will be evaluated against a row coming before the current row, similar to the lag function. A row_offset of null will return null. The function returns the default input value if row_offset goes beyond the scope of the window. If a default value is not specified, it is set to null. Example comparing the sales of the current year to the following year. row_offset of 1. | year | sales | next_year_sales | | 2019 | 20.50 | 30.00 | | 2020 | 30.00 | 45.99 | | 2021 | 45.99 | null | *

    lag

    Implementations:
    lag(expression): -> return_type
    0. lag(any1): -> any1?
    1. lag(any1, i32): -> any1?
    2. lag(any1, i32, any1): -> any1?

    *Return a column value from a previous row based on a specified physical offset. This allows you to compare a value in the current row against a previous row. The expression is evaluated against a row that comes before the current row based on the row_offset. The expression can be a column, expression or subquery that evaluates to a single value. The row_offset should be a positive integer and is set to 1 if not specified explicitly. If the row_offset is negative, the expression will be evaluated against a row coming after the current row, similar to the lead function. A row_offset of null will return null. The function returns the default input value if row_offset goes beyond the scope of the partition. If a default value is not specified, it is set to null. Example comparing the sales of the current year to the previous year. row_offset of 1. | year | sales | previous_year_sales | | 2019 | 20.50 | null | | 2020 | 30.00 | 20.50 | | 2021 | 45.99 | 30.00 | *

    GitHub

    functions_arithmetic_decimal.yaml

    This document file is generated for functions_arithmetic_decimal.yaml

    Scalar Functions

    add

    Implementations:
    add(x, y, option:overflow): -> return_type
    0. add(decimal<P1,S1>, decimal<P2,S2>, option:overflow): ->

    init_scale = max(S1,S2)
    + functions_arithmetic_decimal.yaml - Substrait: Cross-Language Serialization for Relational Algebra      

    functions_arithmetic_decimal.yaml

    This document file is generated for functions_arithmetic_decimal.yaml

    Scalar Functions

    add

    Implementations:
    add(x, y, option:overflow): -> return_type
    0. add(decimal<P1,S1>, decimal<P2,S2>, option:overflow): ->

    init_scale = max(S1,S2)
     init_prec = init_scale + max(P1 - S1, P2 - S2) + 1
     min_scale = min(init_scale, 6)
     delta = init_prec - 38
    diff --git a/extensions/functions_boolean/index.html b/extensions/functions_boolean/index.html
    index 2697efa..c2d09cc 100644
    --- a/extensions/functions_boolean/index.html
    +++ b/extensions/functions_boolean/index.html
    @@ -1,4 +1,4 @@
    - functions_boolean.yaml - Substrait: Cross-Language Serialization for Relational Algebra      

    functions_boolean.yaml

    This document file is generated for functions_boolean.yaml

    Scalar Functions

    or

    Implementations:
    or(a): -> return_type
    0. or(boolean?): -> boolean?

    *The boolean or using Kleene logic. This function behaves as follows with nulls:

    true or null = true
    + functions_boolean.yaml - Substrait: Cross-Language Serialization for Relational Algebra      

    functions_boolean.yaml

    This document file is generated for functions_boolean.yaml

    Scalar Functions

    or

    Implementations:
    or(a): -> return_type
    0. or(boolean?): -> boolean?

    *The boolean or using Kleene logic. This function behaves as follows with nulls:

    true or null = true
     
     null or true = true
     
    diff --git a/extensions/functions_comparison/index.html b/extensions/functions_comparison/index.html
    index 711a272..0520fd7 100644
    --- a/extensions/functions_comparison/index.html
    +++ b/extensions/functions_comparison/index.html
    @@ -1,4 +1,4 @@
    - functions_comparison.yaml - Substrait: Cross-Language Serialization for Relational Algebra      

    functions_comparison.yaml

    This document file is generated for functions_comparison.yaml

    Scalar Functions

    not_equal

    Implementations:
    not_equal(x, y): -> return_type
    0. not_equal(any1, any1): -> boolean

    *Whether two values are not_equal. not_equal(x, y) := (x != y) If either/both of x and y are null, null is returned. *

    equal

    Implementations:
    equal(x, y): -> return_type
    0. equal(any1, any1): -> boolean

    *Whether two values are equal. equal(x, y) := (x == y) If either/both of x and y are null, null is returned. *

    is_not_distinct_from

    Implementations:
    is_not_distinct_from(x, y): -> return_type
    0. is_not_distinct_from(any1, any1): -> boolean

    *Whether two values are equal. This function treats null values as comparable, so is_not_distinct_from(null, null) == True This is in contrast to equal, in which null values do not compare. *

    is_distinct_from

    Implementations:
    is_distinct_from(x, y): -> return_type
    0. is_distinct_from(any1, any1): -> boolean

    *Whether two values are not equal. This function treats null values as comparable, so is_distinct_from(null, null) == False This is in contrast to equal, in which null values do not compare. *

    lt

    Implementations:
    lt(x, y): -> return_type
    0. lt(any1, any1): -> boolean

    *Less than. lt(x, y) := (x < y) If either/both of x and y are null, null is returned. *

    gt

    Implementations:
    gt(x, y): -> return_type
    0. gt(any1, any1): -> boolean

    *Greater than. gt(x, y) := (x > y) If either/both of x and y are null, null is returned. *

    lte

    Implementations:
    lte(x, y): -> return_type
    0. lte(any1, any1): -> boolean

    *Less than or equal to. lte(x, y) := (x <= y) If either/both of x and y are null, null is returned. *

    gte

    Implementations:
    gte(x, y): -> return_type
    0. gte(any1, any1): -> boolean

    *Greater than or equal to. gte(x, y) := (x >= y) If either/both of x and y are null, null is returned. *

    between

    Implementations:
    between(expression, low, high): -> return_type

  • expression: The expression to test for in the range defined by `low` and `high`.
  • low: The value to check if greater than or equal to.
  • high: The value to check if less than or equal to.
  • 0. between(any1, any1, any1): -> boolean

    Whether the expression is greater than or equal to low and less than or equal to high. expression BETWEEN low AND high If low, high, or expression are null, null is returned.

    is_null

    Implementations:
    is_null(x): -> return_type
    0. is_null(any1): -> boolean

    Whether a value is null. NaN is not null.

    is_not_null

    Implementations:
    is_not_null(x): -> return_type
    0. is_not_null(any1): -> boolean

    Whether a value is not null. NaN is not null.

    is_nan

    Implementations:
    is_nan(x): -> return_type
    0. is_nan(fp32): -> boolean
    1. is_nan(fp64): -> boolean

    *Whether a value is not a number. If x is null, null is returned. *

    is_finite

    Implementations:
    is_finite(x): -> return_type
    0. is_finite(fp32): -> boolean
    1. is_finite(fp64): -> boolean

    *Whether a value is finite (neither infinite nor NaN). If x is null, null is returned. *

    is_infinite

    Implementations:
    is_infinite(x): -> return_type
    0. is_infinite(fp32): -> boolean
    1. is_infinite(fp64): -> boolean

    *Whether a value is infinite. If x is null, null is returned. *

    nullif

    Implementations:
    nullif(x, y): -> return_type
    0. nullif(any1, any1): -> any1

    If two values are equal, return null. Otherwise, return the first value.

    coalesce

    Implementations:
    0. coalesce(any1, any1): -> any1

    Evaluate arguments from left to right and return the first argument that is not null. Once a non-null argument is found, the remaining arguments are not evaluated. If all arguments are null, return null.

    least

    Implementations:
    0. least(any1, any1): -> any1

    Evaluates each argument and returns the smallest one. The function will return null if any argument evaluates to null.

    least_skip_null

    Implementations:
    0. least_skip_null(any1, any1): -> any1

    Evaluates each argument and returns the smallest one. The function will return null only if all arguments evaluate to null.

    greatest

    Implementations:
    0. greatest(any1, any1): -> any1

    Evaluates each argument and returns the largest one. The function will return null if any argument evaluates to null.

    greatest_skip_null

    Implementations:
    0. greatest_skip_null(any1, any1): -> any1

    Evaluates each argument and returns the largest one. The function will return null only if all arguments evaluate to null.

    GitHub

    functions_comparison.yaml

    This document file is generated for functions_comparison.yaml

    Scalar Functions

    not_equal

    Implementations:
    not_equal(x, y): -> return_type
    0. not_equal(any1, any1): -> boolean

    *Whether two values are not_equal. not_equal(x, y) := (x != y) If either/both of x and y are null, null is returned. *

    equal

    Implementations:
    equal(x, y): -> return_type
    0. equal(any1, any1): -> boolean

    *Whether two values are equal. equal(x, y) := (x == y) If either/both of x and y are null, null is returned. *

    is_not_distinct_from

    Implementations:
    is_not_distinct_from(x, y): -> return_type
    0. is_not_distinct_from(any1, any1): -> boolean

    *Whether two values are equal. This function treats null values as comparable, so is_not_distinct_from(null, null) == True This is in contrast to equal, in which null values do not compare. *

    is_distinct_from

    Implementations:
    is_distinct_from(x, y): -> return_type
    0. is_distinct_from(any1, any1): -> boolean

    *Whether two values are not equal. This function treats null values as comparable, so is_distinct_from(null, null) == False This is in contrast to equal, in which null values do not compare. *

    lt

    Implementations:
    lt(x, y): -> return_type
    0. lt(any1, any1): -> boolean

    *Less than. lt(x, y) := (x < y) If either/both of x and y are null, null is returned. *

    gt

    Implementations:
    gt(x, y): -> return_type
    0. gt(any1, any1): -> boolean

    *Greater than. gt(x, y) := (x > y) If either/both of x and y are null, null is returned. *

    lte

    Implementations:
    lte(x, y): -> return_type
    0. lte(any1, any1): -> boolean

    *Less than or equal to. lte(x, y) := (x <= y) If either/both of x and y are null, null is returned. *

    gte

    Implementations:
    gte(x, y): -> return_type
    0. gte(any1, any1): -> boolean

    *Greater than or equal to. gte(x, y) := (x >= y) If either/both of x and y are null, null is returned. *

    between

    Implementations:
    between(expression, low, high): -> return_type

  • expression: The expression to test for in the range defined by `low` and `high`.
  • low: The value to check if greater than or equal to.
  • high: The value to check if less than or equal to.
  • 0. between(any1, any1, any1): -> boolean

    Whether the expression is greater than or equal to low and less than or equal to high. expression BETWEEN low AND high If low, high, or expression are null, null is returned.

    is_null

    Implementations:
    is_null(x): -> return_type
    0. is_null(any1): -> boolean

    Whether a value is null. NaN is not null.

    is_not_null

    Implementations:
    is_not_null(x): -> return_type
    0. is_not_null(any1): -> boolean

    Whether a value is not null. NaN is not null.

    is_nan

    Implementations:
    is_nan(x): -> return_type
    0. is_nan(fp32): -> boolean
    1. is_nan(fp64): -> boolean

    *Whether a value is not a number. If x is null, null is returned. *

    is_finite

    Implementations:
    is_finite(x): -> return_type
    0. is_finite(fp32): -> boolean
    1. is_finite(fp64): -> boolean

    *Whether a value is finite (neither infinite nor NaN). If x is null, null is returned. *

    is_infinite

    Implementations:
    is_infinite(x): -> return_type
    0. is_infinite(fp32): -> boolean
    1. is_infinite(fp64): -> boolean

    *Whether a value is infinite. If x is null, null is returned. *

    nullif

    Implementations:
    nullif(x, y): -> return_type
    0. nullif(any1, any1): -> any1

    If two values are equal, return null. Otherwise, return the first value.

    coalesce

    Implementations:
    0. coalesce(any1, any1): -> any1

    Evaluate arguments from left to right and return the first argument that is not null. Once a non-null argument is found, the remaining arguments are not evaluated. If all arguments are null, return null.

    least

    Implementations:
    0. least(any1, any1): -> any1

    Evaluates each argument and returns the smallest one. The function will return null if any argument evaluates to null.

    least_skip_null

    Implementations:
    0. least_skip_null(any1, any1): -> any1

    Evaluates each argument and returns the smallest one. The function will return null only if all arguments evaluate to null.

    greatest

    Implementations:
    0. greatest(any1, any1): -> any1

    Evaluates each argument and returns the largest one. The function will return null if any argument evaluates to null.

    greatest_skip_null

    Implementations:
    0. greatest_skip_null(any1, any1): -> any1

    Evaluates each argument and returns the largest one. The function will return null only if all arguments evaluate to null.

    GitHub

    functions_datetime.yaml

    This document file is generated for functions_datetime.yaml

    Scalar Functions

    extract

    Implementations:
    extract(component, x, timezone): -> return_type

  • x: Timezone string from IANA tzdb.
  • 0. extract(component, timestamp_tz, string): -> i64
    1. extract(component, precision_timestamp_tz<P>, string): -> i64
    2. extract(component, timestamp): -> i64
    3. extract(component, precision_timestamp<P>): -> i64
    4. extract(component, date): -> i64
    5. extract(component, time): -> i64
    6. extract(component, indexing, timestamp_tz, string): -> i64
    7. extract(component, indexing, precision_timestamp_tz<P>, string): -> i64
    8. extract(component, indexing, timestamp): -> i64
    9. extract(component, indexing, precision_timestamp<P>): -> i64
    10. extract(component, indexing, date): -> i64

    Extract portion of a date/time value. * YEAR Return the year. * ISO_YEAR Return the ISO 8601 week-numbering year. First week of an ISO year has the majority (4 or more) of its days in January. * US_YEAR Return the US epidemiological year. First week of US epidemiological year has the majority (4 or more) of its days in January. Last week of US epidemiological year has the year’s last Wednesday in it. US epidemiological week starts on Sunday. * QUARTER Return the number of the quarter within the year. January 1 through March 31 map to the first quarter, April 1 through June 30 map to the second quarter, etc. * MONTH Return the number of the month within the year. * DAY Return the number of the day within the month. * DAY_OF_YEAR Return the number of the day within the year. January 1 maps to the first day, February 1 maps to the thirty-second day, etc. * MONDAY_DAY_OF_WEEK Return the number of the day within the week, from Monday (first day) to Sunday (seventh day). * SUNDAY_DAY_OF_WEEK Return the number of the day within the week, from Sunday (first day) to Saturday (seventh day). * MONDAY_WEEK Return the number of the week within the year. First week starts on first Monday of January. * SUNDAY_WEEK Return the number of the week within the year. First week starts on first Sunday of January. * ISO_WEEK Return the number of the ISO week within the ISO year. First ISO week has the majority (4 or more) of its days in January. ISO week starts on Monday. * US_WEEK Return the number of the US week within the US year. First US week has the majority (4 or more) of its days in January. US week starts on Sunday. * HOUR Return the hour (0-23). * MINUTE Return the minute (0-59). * SECOND Return the second (0-59). * MILLISECOND Return number of milliseconds since the last full second. * MICROSECOND Return number of microseconds since the last full millisecond. * NANOSECOND Return number of nanoseconds since the last full microsecond. * SUBSECOND Return number of microseconds since the last full second of the given timestamp. * UNIX_TIME Return number of seconds that have elapsed since 1970-01-01 00:00:00 UTC, ignoring leap seconds. * TIMEZONE_OFFSET Return number of seconds of timezone offset to UTC. The range of values returned for QUARTER, MONTH, DAY, DAY_OF_YEAR, MONDAY_DAY_OF_WEEK, SUNDAY_DAY_OF_WEEK, MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK depends on whether counting starts at 1 or 0. This is governed by the indexing option. When indexing is ONE: * QUARTER returns values in range 1-4 * MONTH returns values in range 1-12 * DAY returns values in range 1-31 * DAY_OF_YEAR returns values in range 1-366 * MONDAY_DAY_OF_WEEK and SUNDAY_DAY_OF_WEEK return values in range 1-7 * MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK return values in range 1-53 When indexing is ZERO: * QUARTER returns values in range 0-3 * MONTH returns values in range 0-11 * DAY returns values in range 0-30 * DAY_OF_YEAR returns values in range 0-365 * MONDAY_DAY_OF_WEEK and SUNDAY_DAY_OF_WEEK return values in range 0-6 * MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK return values in range 0-52 The indexing option must be specified when the component is QUARTER, MONTH, DAY, DAY_OF_YEAR, MONDAY_DAY_OF_WEEK, SUNDAY_DAY_OF_WEEK, MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, or US_WEEK. The indexing option cannot be specified when the component is YEAR, ISO_YEAR, US_YEAR, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, SUBSECOND, UNIX_TIME, or TIMEZONE_OFFSET. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    Options:
  • component ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'SUBSECOND', 'UNIX_TIME', 'TIMEZONE_OFFSET']
  • indexing ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'NANOSECOND', 'SUBSECOND', 'UNIX_TIME', 'TIMEZONE_OFFSET']
  • component ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'SUBSECOND', 'UNIX_TIME']
  • indexing ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'NANOSECOND', 'SUBSECOND', 'UNIX_TIME']
  • component ['YEAR', 'ISO_YEAR', 'US_YEAR', 'UNIX_TIME']
  • indexing ['HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'SUBSECOND']
  • component ['QUARTER', 'MONTH', 'DAY', 'DAY_OF_YEAR', 'MONDAY_DAY_OF_WEEK', 'SUNDAY_DAY_OF_WEEK', 'MONDAY_WEEK', 'SUNDAY_WEEK', 'ISO_WEEK', 'US_WEEK']
  • indexing ['ONE', 'ZERO']
  • extract_boolean

    Implementations:
    extract_boolean(component, x): -> return_type
    0. extract_boolean(component, timestamp): -> boolean
    1. extract_boolean(component, precision_timestamp<P>): -> boolean
    2. extract_boolean(component, timestamp_tz, string): -> boolean
    3. extract_boolean(component, precision_timestamp_tz<P>, string): -> boolean
    4. extract_boolean(component, date): -> boolean

    *Extract boolean values of a date/time value. * IS_LEAP_YEAR Return true if year of the given value is a leap year and false otherwise. * IS_DST Return true if DST (Daylight Savings Time) is observed at the given value in the given timezone.

    Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.*

    Options:
  • component ['IS_LEAP_YEAR']
  • component ['IS_LEAP_YEAR', 'IS_DST']
  • add

    Implementations:
    add(x, y): -> return_type
    0. add(timestamp, interval_year): -> timestamp
    1. add(precision_timestamp<P>, interval_year): -> precision_timestamp<P>
    2. add(timestamp_tz, interval_year, string): -> timestamp_tz
    3. add(precision_timestamp_tz<P>, interval_year, string): -> precision_timestamp_tz<P>
    4. add(date, interval_year): -> timestamp
    5. add(timestamp, interval_day): -> timestamp
    6. add(precision_timestamp<P>, interval_day): -> precision_timestamp<P>
    7. add(timestamp_tz, interval_day): -> timestamp_tz
    8. add(precision_timestamp_tz<P>, interval_day): -> precision_timestamp_tz<P>
    9. add(date, interval_day): -> timestamp

    Add an interval to a date/time type. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    multiply

    Implementations:
    multiply(x, y): -> return_type
    0. multiply(i8, interval_day): -> interval_day
    1. multiply(i16, interval_day): -> interval_day
    2. multiply(i32, interval_day): -> interval_day
    3. multiply(i64, interval_day): -> interval_day
    4. multiply(i8, interval_year): -> interval_year
    5. multiply(i16, interval_year): -> interval_year
    6. multiply(i32, interval_year): -> interval_year
    7. multiply(i64, interval_year): -> interval_year

    Multiply an interval by an integral number.

    add_intervals

    Implementations:
    add_intervals(x, y): -> return_type
    0. add_intervals(interval_day, interval_day): -> interval_day
    1. add_intervals(interval_year, interval_year): -> interval_year

    Add two intervals together.

    subtract

    Implementations:
    subtract(x, y): -> return_type
    0. subtract(timestamp, interval_year): -> timestamp
    1. subtract(precision_timestamp<P>, interval_year): -> precision_timestamp<P>
    2. subtract(timestamp_tz, interval_year): -> timestamp_tz
    3. subtract(precision_timestamp_tz<P>, interval_year): -> precision_timestamp_tz<P>
    4. subtract(timestamp_tz, interval_year, string): -> timestamp_tz
    5. subtract(precision_timestamp_tz<P>, interval_year, string): -> precision_timestamp_tz<P>
    6. subtract(date, interval_year): -> date
    7. subtract(timestamp, interval_day): -> timestamp
    8. subtract(precision_timestamp<P>, interval_day): -> precision_timestamp<P>
    9. subtract(timestamp_tz, interval_day): -> timestamp_tz
    10. subtract(precision_timestamp_tz<P>, interval_day): -> precision_timestamp_tz<P>
    11. subtract(date, interval_day): -> date

    Subtract an interval from a date/time type. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    lte

    Implementations:
    lte(x, y): -> return_type
    0. lte(timestamp, timestamp): -> boolean
    1. lte(precision_timestamp<P>, precision_timestamp<P>): -> boolean
    2. lte(timestamp_tz, timestamp_tz): -> boolean
    3. lte(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean
    4. lte(date, date): -> boolean
    5. lte(interval_day, interval_day): -> boolean
    6. lte(interval_year, interval_year): -> boolean

    less than or equal to

    lt

    Implementations:
    lt(x, y): -> return_type
    0. lt(timestamp, timestamp): -> boolean
    1. lt(precision_timestamp<P>, precision_timestamp<P>): -> boolean
    2. lt(timestamp_tz, timestamp_tz): -> boolean
    3. lt(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean
    4. lt(date, date): -> boolean
    5. lt(interval_day, interval_day): -> boolean
    6. lt(interval_year, interval_year): -> boolean

    less than

    gte

    Implementations:
    gte(x, y): -> return_type
    0. gte(timestamp, timestamp): -> boolean
    1. gte(precision_timestamp<P>, precision_timestamp<P>): -> boolean
    2. gte(timestamp_tz, timestamp_tz): -> boolean
    3. gte(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean
    4. gte(date, date): -> boolean
    5. gte(interval_day, interval_day): -> boolean
    6. gte(interval_year, interval_year): -> boolean

    greater than or equal to

    gt

    Implementations:
    gt(x, y): -> return_type
    0. gt(timestamp, timestamp): -> boolean
    1. gt(precision_timestamp<P>, precision_timestamp<P>): -> boolean
    2. gt(timestamp_tz, timestamp_tz): -> boolean
    3. gt(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean
    4. gt(date, date): -> boolean
    5. gt(interval_day, interval_day): -> boolean
    6. gt(interval_year, interval_year): -> boolean

    greater than

    assume_timezone

    Implementations:
    assume_timezone(x, timezone): -> return_type

  • x: Timezone string from IANA tzdb.
  • 0. assume_timezone(timestamp, string): -> timestamp_tz
    1. assume_timezone(precision_timestamp<P>, string): -> precision_timestamp_tz<P>
    2. assume_timezone(date, string): -> timestamp_tz

    Convert local timestamp to UTC-relative timestamp_tz using given local time’s timezone. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    local_timestamp

    Implementations:
    local_timestamp(x, timezone): -> return_type

  • x: Timezone string from IANA tzdb.
  • 0. local_timestamp(timestamp_tz, string): -> timestamp
    1. local_timestamp(precision_timestamp_tz<P>, string): -> precision_timestamp<P>

    Convert UTC-relative timestamp_tz to local timestamp using given local time’s timezone. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    strptime_time

    Implementations:
    strptime_time(time_string, format): -> return_type
    0. strptime_time(string, string): -> time

    Parse string into time using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference.

    strptime_date

    Implementations:
    strptime_date(date_string, format): -> return_type
    0. strptime_date(string, string): -> date

    Parse string into date using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference.

    strptime_timestamp

    Implementations:
    strptime_timestamp(timestamp_string, format, timezone): -> return_type

  • timestamp_string: Timezone string from IANA tzdb.
  • 0. strptime_timestamp(string, string, string): -> timestamp_tz
    1. strptime_timestamp(string, string): -> timestamp_tz

    Parse string into timestamp using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference. If timezone is present in timestamp and provided as parameter an error is thrown. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is supplied as parameter and present in the parsed string the parsed timezone is used. If parameter supplied timezone is invalid an error is thrown.

    strftime

    Implementations:
    strftime(x, format): -> return_type
    0. strftime(timestamp, string): -> string
    1. strftime(precision_timestamp<P>, string): -> string
    2. strftime(timestamp_tz, string, string): -> string
    3. strftime(precision_timestamp_tz<P>, string, string): -> string
    4. strftime(date, string): -> string
    5. strftime(time, string): -> string

    Convert timestamp/date/time to string using provided format, see https://man7.org/linux/man-pages/man3/strftime.3.html for reference. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    round_temporal

    Implementations:
    round_temporal(x, rounding, unit, multiple, origin): -> return_type
    0. round_temporal(timestamp, rounding, unit, i64, timestamp): -> timestamp
    1. round_temporal(precision_timestamp<P>, rounding, unit, i64, precision_timestamp<P>): -> precision_timestamp<P>
    2. round_temporal(timestamp_tz, rounding, unit, i64, string, timestamp_tz): -> timestamp_tz
    3. round_temporal(precision_timestamp_tz<P>, rounding, unit, i64, string, precision_timestamp_tz<P>): -> precision_timestamp_tz<P>
    4. round_temporal(date, rounding, unit, i64, date): -> date
    5. round_temporal(time, rounding, unit, i64, time): -> time

    Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple from the origin in the given timezone, the resulting point is chosen as one of the two nearest multiples. Which of these is chosen is governed by rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    Options:
  • rounding ['FLOOR', 'CEIL', 'ROUND_TIE_DOWN', 'ROUND_TIE_UP']
  • unit ['YEAR', 'MONTH', 'WEEK', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • rounding ['YEAR', 'MONTH', 'WEEK', 'DAY']
  • unit ['HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • round_calendar

    Implementations:
    round_calendar(x, rounding, unit, origin, multiple): -> return_type
    0. round_calendar(timestamp, rounding, unit, origin, i64): -> timestamp
    1. round_calendar(precision_timestamp<P>, rounding, unit, origin, i64): -> precision_timestamp<P>
    2. round_calendar(timestamp_tz, rounding, unit, origin, i64, string): -> timestamp_tz
    3. round_calendar(precision_timestamp_tz<P>, rounding, unit, origin, i64, string): -> precision_timestamp_tz<P>
    4. round_calendar(date, rounding, unit, origin, i64, date): -> date
    5. round_calendar(time, rounding, unit, origin, i64, time): -> time

    Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple from the last origin unit in the given timezone, the resulting point is chosen as one of the two nearest multiples. Which of these is chosen is governed by rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    Options:
  • rounding ['FLOOR', 'CEIL', 'ROUND_TIE_DOWN', 'ROUND_TIE_UP']
  • unit ['YEAR', 'MONTH', 'WEEK', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • origin ['YEAR', 'MONTH', 'MONDAY_WEEK', 'SUNDAY_WEEK', 'ISO_WEEK', 'US_WEEK', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND']
  • rounding ['YEAR', 'MONTH', 'WEEK', 'DAY']
  • unit ['YEAR', 'MONTH', 'MONDAY_WEEK', 'SUNDAY_WEEK', 'ISO_WEEK', 'US_WEEK', 'DAY']
  • origin ['DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • rounding ['DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND']
  • Aggregate Functions

    min

    Implementations:
    min(x): -> return_type
    0. min(date): -> date?
    1. min(time): -> time?
    2. min(timestamp): -> timestamp?
    3. min(precision_timestamp<P>): -> precision_timestamp?<P>
    4. min(timestamp_tz): -> timestamp_tz?
    5. min(precision_timestamp_tz<P>): -> precision_timestamp_tz?<P>
    6. min(interval_day): -> interval_day?
    7. min(interval_year): -> interval_year?

    Min a set of values.

    max

    Implementations:
    max(x): -> return_type
    0. max(date): -> date?
    1. max(time): -> time?
    2. max(timestamp): -> timestamp?
    3. max(timestamp_tz): -> timestamp_tz?
    4. max(precision_timestamp_tz<P>): -> precision_timestamp_tz?<P>
    5. max(interval_day): -> interval_day?
    6. max(interval_year): -> interval_year?

    Max a set of values.

    GitHub

    functions_datetime.yaml

    This document file is generated for functions_datetime.yaml

    Scalar Functions

    extract

    Implementations:
    extract(component, x, timezone): -> return_type

  • x: Timezone string from IANA tzdb.
  • 0. extract(component, timestamp_tz, string): -> i64
    1. extract(component, precision_timestamp_tz<P>, string): -> i64
    2. extract(component, timestamp): -> i64
    3. extract(component, precision_timestamp<P>): -> i64
    4. extract(component, date): -> i64
    5. extract(component, time): -> i64
    6. extract(component, indexing, timestamp_tz, string): -> i64
    7. extract(component, indexing, precision_timestamp_tz<P>, string): -> i64
    8. extract(component, indexing, timestamp): -> i64
    9. extract(component, indexing, precision_timestamp<P>): -> i64
    10. extract(component, indexing, date): -> i64

    Extract portion of a date/time value. * YEAR Return the year. * ISO_YEAR Return the ISO 8601 week-numbering year. First week of an ISO year has the majority (4 or more) of its days in January. * US_YEAR Return the US epidemiological year. First week of US epidemiological year has the majority (4 or more) of its days in January. Last week of US epidemiological year has the year’s last Wednesday in it. US epidemiological week starts on Sunday. * QUARTER Return the number of the quarter within the year. January 1 through March 31 map to the first quarter, April 1 through June 30 map to the second quarter, etc. * MONTH Return the number of the month within the year. * DAY Return the number of the day within the month. * DAY_OF_YEAR Return the number of the day within the year. January 1 maps to the first day, February 1 maps to the thirty-second day, etc. * MONDAY_DAY_OF_WEEK Return the number of the day within the week, from Monday (first day) to Sunday (seventh day). * SUNDAY_DAY_OF_WEEK Return the number of the day within the week, from Sunday (first day) to Saturday (seventh day). * MONDAY_WEEK Return the number of the week within the year. First week starts on first Monday of January. * SUNDAY_WEEK Return the number of the week within the year. First week starts on first Sunday of January. * ISO_WEEK Return the number of the ISO week within the ISO year. First ISO week has the majority (4 or more) of its days in January. ISO week starts on Monday. * US_WEEK Return the number of the US week within the US year. First US week has the majority (4 or more) of its days in January. US week starts on Sunday. * HOUR Return the hour (0-23). * MINUTE Return the minute (0-59). * SECOND Return the second (0-59). * MILLISECOND Return number of milliseconds since the last full second. * MICROSECOND Return number of microseconds since the last full millisecond. * NANOSECOND Return number of nanoseconds since the last full microsecond. * SUBSECOND Return number of microseconds since the last full second of the given timestamp. * UNIX_TIME Return number of seconds that have elapsed since 1970-01-01 00:00:00 UTC, ignoring leap seconds. * TIMEZONE_OFFSET Return number of seconds of timezone offset to UTC. The range of values returned for QUARTER, MONTH, DAY, DAY_OF_YEAR, MONDAY_DAY_OF_WEEK, SUNDAY_DAY_OF_WEEK, MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK depends on whether counting starts at 1 or 0. This is governed by the indexing option. When indexing is ONE: * QUARTER returns values in range 1-4 * MONTH returns values in range 1-12 * DAY returns values in range 1-31 * DAY_OF_YEAR returns values in range 1-366 * MONDAY_DAY_OF_WEEK and SUNDAY_DAY_OF_WEEK return values in range 1-7 * MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK return values in range 1-53 When indexing is ZERO: * QUARTER returns values in range 0-3 * MONTH returns values in range 0-11 * DAY returns values in range 0-30 * DAY_OF_YEAR returns values in range 0-365 * MONDAY_DAY_OF_WEEK and SUNDAY_DAY_OF_WEEK return values in range 0-6 * MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK return values in range 0-52 The indexing option must be specified when the component is QUARTER, MONTH, DAY, DAY_OF_YEAR, MONDAY_DAY_OF_WEEK, SUNDAY_DAY_OF_WEEK, MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, or US_WEEK. The indexing option cannot be specified when the component is YEAR, ISO_YEAR, US_YEAR, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, SUBSECOND, UNIX_TIME, or TIMEZONE_OFFSET. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    Options:
  • component ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'SUBSECOND', 'UNIX_TIME', 'TIMEZONE_OFFSET']
  • indexing ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'NANOSECOND', 'SUBSECOND', 'UNIX_TIME', 'TIMEZONE_OFFSET']
  • component ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'SUBSECOND', 'UNIX_TIME']
  • indexing ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'NANOSECOND', 'SUBSECOND', 'UNIX_TIME']
  • component ['YEAR', 'ISO_YEAR', 'US_YEAR', 'UNIX_TIME']
  • indexing ['HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'SUBSECOND']
  • component ['QUARTER', 'MONTH', 'DAY', 'DAY_OF_YEAR', 'MONDAY_DAY_OF_WEEK', 'SUNDAY_DAY_OF_WEEK', 'MONDAY_WEEK', 'SUNDAY_WEEK', 'ISO_WEEK', 'US_WEEK']
  • indexing ['ONE', 'ZERO']
  • extract_boolean

    Implementations:
    extract_boolean(component, x): -> return_type
    0. extract_boolean(component, timestamp): -> boolean
    1. extract_boolean(component, precision_timestamp<P>): -> boolean
    2. extract_boolean(component, timestamp_tz, string): -> boolean
    3. extract_boolean(component, precision_timestamp_tz<P>, string): -> boolean
    4. extract_boolean(component, date): -> boolean

    *Extract boolean values of a date/time value. * IS_LEAP_YEAR Return true if year of the given value is a leap year and false otherwise. * IS_DST Return true if DST (Daylight Savings Time) is observed at the given value in the given timezone.

    Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.*

    Options:
  • component ['IS_LEAP_YEAR']
  • component ['IS_LEAP_YEAR', 'IS_DST']
  • add

    Implementations:
    add(x, y): -> return_type
    0. add(timestamp, interval_year): -> timestamp
    1. add(precision_timestamp<P>, interval_year): -> precision_timestamp<P>
    2. add(timestamp_tz, interval_year, string): -> timestamp_tz
    3. add(precision_timestamp_tz<P>, interval_year, string): -> precision_timestamp_tz<P>
    4. add(date, interval_year): -> timestamp
    5. add(timestamp, interval_day): -> timestamp
    6. add(precision_timestamp<P>, interval_day): -> precision_timestamp<P>
    7. add(timestamp_tz, interval_day): -> timestamp_tz
    8. add(precision_timestamp_tz<P>, interval_day): -> precision_timestamp_tz<P>
    9. add(date, interval_day): -> timestamp

    Add an interval to a date/time type. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    multiply

    Implementations:
    multiply(x, y): -> return_type
    0. multiply(i8, interval_day): -> interval_day
    1. multiply(i16, interval_day): -> interval_day
    2. multiply(i32, interval_day): -> interval_day
    3. multiply(i64, interval_day): -> interval_day
    4. multiply(i8, interval_year): -> interval_year
    5. multiply(i16, interval_year): -> interval_year
    6. multiply(i32, interval_year): -> interval_year
    7. multiply(i64, interval_year): -> interval_year

    Multiply an interval by an integral number.

    add_intervals

    Implementations:
    add_intervals(x, y): -> return_type
    0. add_intervals(interval_day, interval_day): -> interval_day
    1. add_intervals(interval_year, interval_year): -> interval_year

    Add two intervals together.

    subtract

    Implementations:
    subtract(x, y): -> return_type
    0. subtract(timestamp, interval_year): -> timestamp
    1. subtract(precision_timestamp<P>, interval_year): -> precision_timestamp<P>
    2. subtract(timestamp_tz, interval_year): -> timestamp_tz
    3. subtract(precision_timestamp_tz<P>, interval_year): -> precision_timestamp_tz<P>
    4. subtract(timestamp_tz, interval_year, string): -> timestamp_tz
    5. subtract(precision_timestamp_tz<P>, interval_year, string): -> precision_timestamp_tz<P>
    6. subtract(date, interval_year): -> date
    7. subtract(timestamp, interval_day): -> timestamp
    8. subtract(precision_timestamp<P>, interval_day): -> precision_timestamp<P>
    9. subtract(timestamp_tz, interval_day): -> timestamp_tz
    10. subtract(precision_timestamp_tz<P>, interval_day): -> precision_timestamp_tz<P>
    11. subtract(date, interval_day): -> date

    Subtract an interval from a date/time type. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    lte

    Implementations:
    lte(x, y): -> return_type
    0. lte(timestamp, timestamp): -> boolean
    1. lte(precision_timestamp<P>, precision_timestamp<P>): -> boolean
    2. lte(timestamp_tz, timestamp_tz): -> boolean
    3. lte(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean
    4. lte(date, date): -> boolean
    5. lte(interval_day, interval_day): -> boolean
    6. lte(interval_year, interval_year): -> boolean

    less than or equal to

    lt

    Implementations:
    lt(x, y): -> return_type
    0. lt(timestamp, timestamp): -> boolean
    1. lt(precision_timestamp<P>, precision_timestamp<P>): -> boolean
    2. lt(timestamp_tz, timestamp_tz): -> boolean
    3. lt(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean
    4. lt(date, date): -> boolean
    5. lt(interval_day, interval_day): -> boolean
    6. lt(interval_year, interval_year): -> boolean

    less than

    gte

    Implementations:
    gte(x, y): -> return_type
    0. gte(timestamp, timestamp): -> boolean
    1. gte(precision_timestamp<P>, precision_timestamp<P>): -> boolean
    2. gte(timestamp_tz, timestamp_tz): -> boolean
    3. gte(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean
    4. gte(date, date): -> boolean
    5. gte(interval_day, interval_day): -> boolean
    6. gte(interval_year, interval_year): -> boolean

    greater than or equal to

    gt

    Implementations:
    gt(x, y): -> return_type
    0. gt(timestamp, timestamp): -> boolean
    1. gt(precision_timestamp<P>, precision_timestamp<P>): -> boolean
    2. gt(timestamp_tz, timestamp_tz): -> boolean
    3. gt(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean
    4. gt(date, date): -> boolean
    5. gt(interval_day, interval_day): -> boolean
    6. gt(interval_year, interval_year): -> boolean

    greater than

    assume_timezone

    Implementations:
    assume_timezone(x, timezone): -> return_type

  • x: Timezone string from IANA tzdb.
  • 0. assume_timezone(timestamp, string): -> timestamp_tz
    1. assume_timezone(precision_timestamp<P>, string): -> precision_timestamp_tz<P>
    2. assume_timezone(date, string): -> timestamp_tz

    Convert local timestamp to UTC-relative timestamp_tz using given local time’s timezone. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    local_timestamp

    Implementations:
    local_timestamp(x, timezone): -> return_type

  • x: Timezone string from IANA tzdb.
  • 0. local_timestamp(timestamp_tz, string): -> timestamp
    1. local_timestamp(precision_timestamp_tz<P>, string): -> precision_timestamp<P>

    Convert UTC-relative timestamp_tz to local timestamp using given local time’s timezone. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    strptime_time

    Implementations:
    strptime_time(time_string, format): -> return_type
    0. strptime_time(string, string): -> time

    Parse string into time using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference.

    strptime_date

    Implementations:
    strptime_date(date_string, format): -> return_type
    0. strptime_date(string, string): -> date

    Parse string into date using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference.

    strptime_timestamp

    Implementations:
    strptime_timestamp(timestamp_string, format, timezone): -> return_type

  • timestamp_string: Timezone string from IANA tzdb.
  • 0. strptime_timestamp(string, string, string): -> timestamp_tz
    1. strptime_timestamp(string, string): -> timestamp_tz

    Parse string into timestamp using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference. If timezone is present in timestamp and provided as parameter an error is thrown. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is supplied as parameter and present in the parsed string the parsed timezone is used. If parameter supplied timezone is invalid an error is thrown.

    strftime

    Implementations:
    strftime(x, format): -> return_type
    0. strftime(timestamp, string): -> string
    1. strftime(precision_timestamp<P>, string): -> string
    2. strftime(timestamp_tz, string, string): -> string
    3. strftime(precision_timestamp_tz<P>, string, string): -> string
    4. strftime(date, string): -> string
    5. strftime(time, string): -> string

    Convert timestamp/date/time to string using provided format, see https://man7.org/linux/man-pages/man3/strftime.3.html for reference. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    round_temporal

    Implementations:
    round_temporal(x, rounding, unit, multiple, origin): -> return_type
    0. round_temporal(timestamp, rounding, unit, i64, timestamp): -> timestamp
    1. round_temporal(precision_timestamp<P>, rounding, unit, i64, precision_timestamp<P>): -> precision_timestamp<P>
    2. round_temporal(timestamp_tz, rounding, unit, i64, string, timestamp_tz): -> timestamp_tz
    3. round_temporal(precision_timestamp_tz<P>, rounding, unit, i64, string, precision_timestamp_tz<P>): -> precision_timestamp_tz<P>
    4. round_temporal(date, rounding, unit, i64, date): -> date
    5. round_temporal(time, rounding, unit, i64, time): -> time

    Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple from the origin in the given timezone, the resulting point is chosen as one of the two nearest multiples. Which of these is chosen is governed by rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    Options:
  • rounding ['FLOOR', 'CEIL', 'ROUND_TIE_DOWN', 'ROUND_TIE_UP']
  • unit ['YEAR', 'MONTH', 'WEEK', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • rounding ['YEAR', 'MONTH', 'WEEK', 'DAY']
  • unit ['HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • round_calendar

    Implementations:
    round_calendar(x, rounding, unit, origin, multiple): -> return_type
    0. round_calendar(timestamp, rounding, unit, origin, i64): -> timestamp
    1. round_calendar(precision_timestamp<P>, rounding, unit, origin, i64): -> precision_timestamp<P>
    2. round_calendar(timestamp_tz, rounding, unit, origin, i64, string): -> timestamp_tz
    3. round_calendar(precision_timestamp_tz<P>, rounding, unit, origin, i64, string): -> precision_timestamp_tz<P>
    4. round_calendar(date, rounding, unit, origin, i64, date): -> date
    5. round_calendar(time, rounding, unit, origin, i64, time): -> time

    Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple from the last origin unit in the given timezone, the resulting point is chosen as one of the two nearest multiples. Which of these is chosen is governed by rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: “Pacific/Marquesas”, “Etc/GMT+1”. If timezone is invalid an error is thrown.

    Options:
  • rounding ['FLOOR', 'CEIL', 'ROUND_TIE_DOWN', 'ROUND_TIE_UP']
  • unit ['YEAR', 'MONTH', 'WEEK', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • origin ['YEAR', 'MONTH', 'MONDAY_WEEK', 'SUNDAY_WEEK', 'ISO_WEEK', 'US_WEEK', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND']
  • rounding ['YEAR', 'MONTH', 'WEEK', 'DAY']
  • unit ['YEAR', 'MONTH', 'MONDAY_WEEK', 'SUNDAY_WEEK', 'ISO_WEEK', 'US_WEEK', 'DAY']
  • origin ['DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • rounding ['DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND']
  • Aggregate Functions

    min

    Implementations:
    min(x): -> return_type
    0. min(date): -> date?
    1. min(time): -> time?
    2. min(timestamp): -> timestamp?
    3. min(precision_timestamp<P>): -> precision_timestamp?<P>
    4. min(timestamp_tz): -> timestamp_tz?
    5. min(precision_timestamp_tz<P>): -> precision_timestamp_tz?<P>
    6. min(interval_day): -> interval_day?
    7. min(interval_year): -> interval_year?

    Min a set of values.

    max

    Implementations:
    max(x): -> return_type
    0. max(date): -> date?
    1. max(time): -> time?
    2. max(timestamp): -> timestamp?
    3. max(timestamp_tz): -> timestamp_tz?
    4. max(precision_timestamp_tz<P>): -> precision_timestamp_tz?<P>
    5. max(interval_day): -> interval_day?
    6. max(interval_year): -> interval_year?

    Max a set of values.

    GitHub

    functions_geometry.yaml

    This document file is generated for functions_geometry.yaml

    Data Types

    name: geometry
    structure: BINARY

    Scalar Functions

    point

    Implementations:
    point(x, y): -> return_type
    0. point(fp64, fp64): -> u!geometry

    *Returns a 2D point with the given x and y coordinate values. *

    make_line

    Implementations:
    make_line(geom1, geom2): -> return_type
    0. make_line(u!geometry, u!geometry): -> u!geometry

    *Returns a linestring connecting the endpoint of geometry geom1 to the begin point of geometry geom2. Repeated points at the beginning of input geometries are collapsed to a single point. A linestring can be closed or simple. A closed linestring starts and ends on the same point. A simple linestring does not cross or touch itself. *

    x_coordinate

    Implementations:
    x_coordinate(point): -> return_type
    0. x_coordinate(u!geometry): -> fp64

    *Return the x coordinate of the point. Return null if not available. *

    y_coordinate

    Implementations:
    y_coordinate(point): -> return_type
    0. y_coordinate(u!geometry): -> fp64

    *Return the y coordinate of the point. Return null if not available. *

    num_points

    Implementations:
    num_points(geom): -> return_type
    0. num_points(u!geometry): -> i64

    *Return the number of points in the geometry. The geometry should be an linestring or circularstring. *

    is_empty

    Implementations:
    is_empty(geom): -> return_type
    0. is_empty(u!geometry): -> boolean

    *Return true is the geometry is an empty geometry. *

    is_closed

    Implementations:
    is_closed(geom): -> return_type
    0. is_closed(geometry): -> boolean

    *Return true if the geometry’s start and end points are the same. *

    is_simple

    Implementations:
    is_simple(geom): -> return_type
    0. is_simple(u!geometry): -> boolean

    *Return true if the geometry does not self intersect. *

    is_ring

    Implementations:
    is_ring(geom): -> return_type
    0. is_ring(u!geometry): -> boolean

    *Return true if the geometry’s start and end points are the same and it does not self intersect. *

    geometry_type

    Implementations:
    geometry_type(geom): -> return_type
    0. geometry_type(u!geometry): -> string

    *Return the type of geometry as a string. *

    envelope

    Implementations:
    envelope(geom): -> return_type
    0. envelope(u!geometry): -> u!geometry

    *Return the minimum bounding box for the input geometry as a geometry. The returned geometry is defined by the corner points of the bounding box. If the input geometry is a point or a line, the returned geometry can also be a point or line. *

    dimension

    Implementations:
    dimension(geom): -> return_type
    0. dimension(u!geometry): -> i8

    *Return the dimension of the input geometry. If the input is a collection of geometries, return the largest dimension from the collection. Dimensionality is determined by the complexity of the input and not the coordinate system being used. Type dimensions: POINT - 0 LINE - 1 POLYGON - 2 *

    is_valid

    Implementations:
    is_valid(geom): -> return_type
    0. is_valid(u!geometry): -> boolean

    *Return true if the input geometry is a valid 2D geometry. For 3 dimensional and 4 dimensional geometries, the validity is still only tested in 2 dimensions. *

    collection_extract

    Implementations:
    collection_extract(geom_collection): -> return_type
    0. collection_extract(u!geometry): -> u!geometry
    1. collection_extract(u!geometry, i8): -> u!geometry

    *Given the input geometry collection, return a homogenous multi-geometry. All geometries in the multi-geometry will have the same dimension. If type is not specified, the multi-geometry will only contain geometries of the highest dimension. If type is specified, the multi-geometry will only contain geometries of that type. If there are no geometries of the specified type, an empty geometry is returned. Only points, linestrings, and polygons are supported. Type numbers: POINT - 0 LINE - 1 POLYGON - 2 *

    flip_coordinates

    Implementations:
    flip_coordinates(geom_collection): -> return_type
    0. flip_coordinates(u!geometry): -> u!geometry

    *Return a version of the input geometry with the X and Y axis flipped. This operation can be performed on geometries with more than 2 dimensions. However, only X and Y axis will be flipped. *

    remove_repeated_points

    Implementations:
    remove_repeated_points(geom): -> return_type
    0. remove_repeated_points(u!geometry): -> u!geometry
    1. remove_repeated_points(u!geometry, fp64): -> u!geometry

    *Return a version of the input geometry with duplicate consecutive points removed. If the tolerance argument is provided, consecutive points within the tolerance distance of one another are considered to be duplicates. *

    buffer

    Implementations:
    buffer(geom, buffer_radius): -> return_type
    0. buffer(u!geometry, fp64): -> u!geometry

    *Compute and return an expanded version of the input geometry. All the points of the returned geometry are at a distance of buffer_radius away from the points of the input geometry. If a negative buffer_radius is provided, the geometry will shrink instead of expand. A negative buffer_radius may shrink the geometry completely, in which case an empty geometry is returned. For input the geometries of points or lines, a negative buffer_radius will always return an emtpy geometry. *

    centroid

    Implementations:
    centroid(geom): -> return_type
    0. centroid(u!geometry): -> u!geometry

    *Return a point which is the geometric center of mass of the input geometry. *

    minimum_bounding_circle

    Implementations:
    minimum_bounding_circle(geom): -> return_type
    0. minimum_bounding_circle(u!geometry): -> u!geometry

    *Return the smallest circle polygon that contains the input geometry. *

    GitHub

    functions_geometry.yaml

    This document file is generated for functions_geometry.yaml

    Data Types

    name: geometry
    structure: BINARY

    Scalar Functions

    point

    Implementations:
    point(x, y): -> return_type
    0. point(fp64, fp64): -> u!geometry

    *Returns a 2D point with the given x and y coordinate values. *

    make_line

    Implementations:
    make_line(geom1, geom2): -> return_type
    0. make_line(u!geometry, u!geometry): -> u!geometry

    *Returns a linestring connecting the endpoint of geometry geom1 to the begin point of geometry geom2. Repeated points at the beginning of input geometries are collapsed to a single point. A linestring can be closed or simple. A closed linestring starts and ends on the same point. A simple linestring does not cross or touch itself. *

    x_coordinate

    Implementations:
    x_coordinate(point): -> return_type
    0. x_coordinate(u!geometry): -> fp64

    *Return the x coordinate of the point. Return null if not available. *

    y_coordinate

    Implementations:
    y_coordinate(point): -> return_type
    0. y_coordinate(u!geometry): -> fp64

    *Return the y coordinate of the point. Return null if not available. *

    num_points

    Implementations:
    num_points(geom): -> return_type
    0. num_points(u!geometry): -> i64

    *Return the number of points in the geometry. The geometry should be an linestring or circularstring. *

    is_empty

    Implementations:
    is_empty(geom): -> return_type
    0. is_empty(u!geometry): -> boolean

    *Return true is the geometry is an empty geometry. *

    is_closed

    Implementations:
    is_closed(geom): -> return_type
    0. is_closed(geometry): -> boolean

    *Return true if the geometry’s start and end points are the same. *

    is_simple

    Implementations:
    is_simple(geom): -> return_type
    0. is_simple(u!geometry): -> boolean

    *Return true if the geometry does not self intersect. *

    is_ring

    Implementations:
    is_ring(geom): -> return_type
    0. is_ring(u!geometry): -> boolean

    *Return true if the geometry’s start and end points are the same and it does not self intersect. *

    geometry_type

    Implementations:
    geometry_type(geom): -> return_type
    0. geometry_type(u!geometry): -> string

    *Return the type of geometry as a string. *

    envelope

    Implementations:
    envelope(geom): -> return_type
    0. envelope(u!geometry): -> u!geometry

    *Return the minimum bounding box for the input geometry as a geometry. The returned geometry is defined by the corner points of the bounding box. If the input geometry is a point or a line, the returned geometry can also be a point or line. *

    dimension

    Implementations:
    dimension(geom): -> return_type
    0. dimension(u!geometry): -> i8

    *Return the dimension of the input geometry. If the input is a collection of geometries, return the largest dimension from the collection. Dimensionality is determined by the complexity of the input and not the coordinate system being used. Type dimensions: POINT - 0 LINE - 1 POLYGON - 2 *

    is_valid

    Implementations:
    is_valid(geom): -> return_type
    0. is_valid(u!geometry): -> boolean

    *Return true if the input geometry is a valid 2D geometry. For 3 dimensional and 4 dimensional geometries, the validity is still only tested in 2 dimensions. *

    collection_extract

    Implementations:
    collection_extract(geom_collection): -> return_type
    0. collection_extract(u!geometry): -> u!geometry
    1. collection_extract(u!geometry, i8): -> u!geometry

    *Given the input geometry collection, return a homogenous multi-geometry. All geometries in the multi-geometry will have the same dimension. If type is not specified, the multi-geometry will only contain geometries of the highest dimension. If type is specified, the multi-geometry will only contain geometries of that type. If there are no geometries of the specified type, an empty geometry is returned. Only points, linestrings, and polygons are supported. Type numbers: POINT - 0 LINE - 1 POLYGON - 2 *

    flip_coordinates

    Implementations:
    flip_coordinates(geom_collection): -> return_type
    0. flip_coordinates(u!geometry): -> u!geometry

    *Return a version of the input geometry with the X and Y axis flipped. This operation can be performed on geometries with more than 2 dimensions. However, only X and Y axis will be flipped. *

    remove_repeated_points

    Implementations:
    remove_repeated_points(geom): -> return_type
    0. remove_repeated_points(u!geometry): -> u!geometry
    1. remove_repeated_points(u!geometry, fp64): -> u!geometry

    *Return a version of the input geometry with duplicate consecutive points removed. If the tolerance argument is provided, consecutive points within the tolerance distance of one another are considered to be duplicates. *

    buffer

    Implementations:
    buffer(geom, buffer_radius): -> return_type
    0. buffer(u!geometry, fp64): -> u!geometry

    *Compute and return an expanded version of the input geometry. All the points of the returned geometry are at a distance of buffer_radius away from the points of the input geometry. If a negative buffer_radius is provided, the geometry will shrink instead of expand. A negative buffer_radius may shrink the geometry completely, in which case an empty geometry is returned. For input the geometries of points or lines, a negative buffer_radius will always return an emtpy geometry. *

    centroid

    Implementations:
    centroid(geom): -> return_type
    0. centroid(u!geometry): -> u!geometry

    *Return a point which is the geometric center of mass of the input geometry. *

    minimum_bounding_circle

    Implementations:
    minimum_bounding_circle(geom): -> return_type
    0. minimum_bounding_circle(u!geometry): -> u!geometry

    *Return the smallest circle polygon that contains the input geometry. *

    GitHub

    functions_logarithmic.yaml

    This document file is generated for functions_logarithmic.yaml

    Scalar Functions

    ln

    Implementations:
    ln(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type
    0. ln(i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    1. ln(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32
    2. ln(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    3. ln(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    Natural logarithm of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • log10

    Implementations:
    log10(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type
    0. log10(i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    1. log10(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32
    2. log10(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    3. log10(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    Logarithm to base 10 of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • log2

    Implementations:
    log2(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type
    0. log2(i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    1. log2(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32
    2. log2(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    3. log2(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    Logarithm to base 2 of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • logb

    Implementations:
    logb(x, base, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type

  • x: The number `x` to compute the logarithm of
  • base: The logarithm base `b` to use
  • 0. logb(i64, i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    1. logb(fp32, fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32
    2. logb(fp64, fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    3. logb(decimal<P1,S1>, decimal<P1,S1>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    *Logarithm of the value with the given base logb(x, b) => log_{b} (x) *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • log1p

    Implementations:
    log1p(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type
    0. log1p(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32
    1. log1p(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    2. log1p(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    *Natural logarithm (base e) of 1 + x log1p(x) => log(1+x) *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • GitHub

    functions_logarithmic.yaml

    This document file is generated for functions_logarithmic.yaml

    Scalar Functions

    ln

    Implementations:
    ln(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type
    0. ln(i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    1. ln(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32
    2. ln(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    3. ln(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    Natural logarithm of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • log10

    Implementations:
    log10(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type
    0. log10(i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    1. log10(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32
    2. log10(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    3. log10(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    Logarithm to base 10 of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • log2

    Implementations:
    log2(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type
    0. log2(i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    1. log2(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32
    2. log2(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    3. log2(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    Logarithm to base 2 of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • logb

    Implementations:
    logb(x, base, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type

  • x: The number `x` to compute the logarithm of
  • base: The logarithm base `b` to use
  • 0. logb(i64, i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    1. logb(fp32, fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32
    2. logb(fp64, fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    3. logb(decimal<P1,S1>, decimal<P1,S1>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    *Logarithm of the value with the given base logb(x, b) => log_{b} (x) *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • log1p

    Implementations:
    log1p(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type
    0. log1p(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32
    1. log1p(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64
    2. log1p(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    *Natural logarithm (base e) of 1 + x log1p(x) => log(1+x) *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • GitHub

    functions_rounding.yaml

    This document file is generated for functions_rounding.yaml

    Scalar Functions

    ceil

    Implementations:
    ceil(x): -> return_type
    0. ceil(fp32): -> fp32
    1. ceil(fp64): -> fp64

    *Rounding to the ceiling of the value x. *

    floor

    Implementations:
    floor(x): -> return_type
    0. floor(fp32): -> fp32
    1. floor(fp64): -> fp64

    *Rounding to the floor of the value x. *

    round

    Implementations:
    round(x, s, option:rounding): -> return_type

  • x: Numerical expression to be rounded.
  • s: Number of decimal places to be rounded to. When `s` is a positive number, nothing will happen since `x` is an integer value. When `s` is a negative number, the rounding is performed to the nearest multiple of `10^(-s)`.
  • 0. round(i8, i32, option:rounding): -> i8?
    1. round(i16, i32, option:rounding): -> i16?
    2. round(i32, i32, option:rounding): -> i32?
    3. round(i64, i32, option:rounding): -> i64?
    4. round(fp32, i32, option:rounding): -> fp32?
    5. round(fp64, i32, option:rounding): -> fp64?

    *Rounding the value x to s decimal places. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR', 'AWAY_FROM_ZERO', 'TIE_DOWN', 'TIE_UP', 'TIE_TOWARDS_ZERO', 'TIE_TO_ODD']
  • GitHub

    functions_rounding.yaml

    This document file is generated for functions_rounding.yaml

    Scalar Functions

    ceil

    Implementations:
    ceil(x): -> return_type
    0. ceil(fp32): -> fp32
    1. ceil(fp64): -> fp64

    *Rounding to the ceiling of the value x. *

    floor

    Implementations:
    floor(x): -> return_type
    0. floor(fp32): -> fp32
    1. floor(fp64): -> fp64

    *Rounding to the floor of the value x. *

    round

    Implementations:
    round(x, s, option:rounding): -> return_type

  • x: Numerical expression to be rounded.
  • s: Number of decimal places to be rounded to. When `s` is a positive number, nothing will happen since `x` is an integer value. When `s` is a negative number, the rounding is performed to the nearest multiple of `10^(-s)`.
  • 0. round(i8, i32, option:rounding): -> i8?
    1. round(i16, i32, option:rounding): -> i16?
    2. round(i32, i32, option:rounding): -> i32?
    3. round(i64, i32, option:rounding): -> i64?
    4. round(fp32, i32, option:rounding): -> fp32?
    5. round(fp64, i32, option:rounding): -> fp64?

    *Rounding the value x to s decimal places. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR', 'AWAY_FROM_ZERO', 'TIE_DOWN', 'TIE_UP', 'TIE_TOWARDS_ZERO', 'TIE_TO_ODD']
  • GitHub

    functions_set.yaml

    This document file is generated for functions_set.yaml

    Scalar Functions

    index_in

    Implementations:
    index_in(needle, haystack, option:nan_equality): -> return_type
    0. index_in(any1, list<any1>, option:nan_equality): -> i64?

    *Checks the membership of a value in a list of values Returns the first 0-based index value of some input needle if needle is equal to any element in haystack. Returns NULL if not found. If needle is NULL, returns NULL. If needle is NaN: - Returns 0-based index of NaN in input (default) - Returns NULL (if NAN_IS_NOT_NAN is specified) *

    Options:
  • nan_equality ['NAN_IS_NAN', 'NAN_IS_NOT_NAN']
  • GitHub

    functions_set.yaml

    This document file is generated for functions_set.yaml

    Scalar Functions

    index_in

    Implementations:
    index_in(needle, haystack, option:nan_equality): -> return_type
    0. index_in(any1, list<any1>, option:nan_equality): -> i64?

    *Checks the membership of a value in a list of values Returns the first 0-based index value of some input needle if needle is equal to any element in haystack. Returns NULL if not found. If needle is NULL, returns NULL. If needle is NaN: - Returns 0-based index of NaN in input (default) - Returns NULL (if NAN_IS_NOT_NAN is specified) *

    Options:
  • nan_equality ['NAN_IS_NAN', 'NAN_IS_NOT_NAN']
  • GitHub

    functions_string.yaml

    This document file is generated for functions_string.yaml

    Scalar Functions

    concat

    Implementations:
    concat(input, option:null_handling): -> return_type
    0. concat(varchar<L1>, option:null_handling): -> varchar<L1>
    1. concat(string, option:null_handling): -> string

    Concatenate strings. The null_handling option determines whether or not null values will be recognized by the function. If null_handling is set to IGNORE_NULLS, null value arguments will be ignored when strings are concatenated. If set to ACCEPT_NULLS, the result will be null if any argument passed to the concat function is null.

    Options:
  • null_handling ['IGNORE_NULLS', 'ACCEPT_NULLS']
  • like

    Implementations:
    like(input, match, option:case_sensitivity): -> return_type

  • input: The input string.
  • match: The string to match against the input string.
  • 0. like(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean
    1. like(string, string, option:case_sensitivity): -> boolean

    Are two strings like each other. The case_sensitivity option applies to the match argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • substring

    Implementations:
    substring(input, start, length, option:negative_start): -> return_type
    0. substring(varchar<L1>, i32, i32, option:negative_start): -> varchar<L1>
    1. substring(string, i32, i32, option:negative_start): -> string
    2. substring(fixedchar<l1>, i32, i32, option:negative_start): -> string
    3. substring(varchar<L1>, i32, option:negative_start): -> varchar<L1>
    4. substring(string, i32, option:negative_start): -> string
    5. substring(fixedchar<l1>, i32, option:negative_start): -> string

    Extract a substring of a specified length starting from position start. A start value of 1 refers to the first characters of the string. When length is not specified the function will extract a substring starting from position start and ending at the end of the string. The negative_start option applies to the start parameter. WRAP_FROM_END means the index will start from the end of the input and move backwards. The last character has an index of -1, the second to last character has an index of -2, and so on. LEFT_OF_BEGINNING means the returned substring will start from the left of the first character. A start of -1 will begin 2 characters left of the the input, while a start of 0 begins 1 character left of the input.

    Options:
  • negative_start ['WRAP_FROM_END', 'LEFT_OF_BEGINNING', 'ERROR']
  • negative_start ['WRAP_FROM_END', 'LEFT_OF_BEGINNING']
  • regexp_match_substring

    Implementations:
    regexp_match_substring(input, pattern, position, occurrence, group, option:case_sensitivity, option:multiline, option:dotall): -> return_type
    0. regexp_match_substring(varchar<L1>, varchar<L2>, i64, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> varchar<L1>
    1. regexp_match_substring(string, string, i64, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> string

    Extract a substring that matches the given regular expression pattern. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The occurrence of the pattern to be extracted is specified using the occurrence argument. Specifying 1 means the first occurrence will be extracted, 2 means the second occurrence, and so on. The occurrence argument should be a positive non-zero integer. The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The regular expression capture group can be specified using the group argument. Specifying 0 will return the substring matching the full regular expression. Specifying 1 will return the substring matching only the first capture group, and so on. The group argument should be a non-negative integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the occurrence value is out of range, the position value is out of range, or the group value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • regexp_match_substring_all

    Implementations:
    regexp_match_substring_all(input, pattern, position, group, option:case_sensitivity, option:multiline, option:dotall): -> return_type
    0. regexp_match_substring_all(varchar<L1>, varchar<L2>, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> List<varchar<L1>>
    1. regexp_match_substring_all(string, string, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> List<string>

    Extract all substrings that match the given regular expression pattern. This will return a list of extracted strings with one value for each occurrence of a match. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The regular expression capture group can be specified using the group argument. Specifying 0 will return substrings matching the full regular expression. Specifying 1 will return substrings matching only the first capture group, and so on. The group argument should be a non-negative integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the position value is out of range, or the group value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • starts_with

    Implementations:
    starts_with(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. starts_with(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean
    1. starts_with(varchar<L1>, string, option:case_sensitivity): -> boolean
    2. starts_with(varchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean
    3. starts_with(string, string, option:case_sensitivity): -> boolean
    4. starts_with(string, varchar<L1>, option:case_sensitivity): -> boolean
    5. starts_with(string, fixedchar<L1>, option:case_sensitivity): -> boolean
    6. starts_with(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean
    7. starts_with(fixedchar<L1>, string, option:case_sensitivity): -> boolean
    8. starts_with(fixedchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean

    Whether the input string starts with the substring. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • ends_with

    Implementations:
    ends_with(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. ends_with(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean
    1. ends_with(varchar<L1>, string, option:case_sensitivity): -> boolean
    2. ends_with(varchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean
    3. ends_with(string, string, option:case_sensitivity): -> boolean
    4. ends_with(string, varchar<L1>, option:case_sensitivity): -> boolean
    5. ends_with(string, fixedchar<L1>, option:case_sensitivity): -> boolean
    6. ends_with(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean
    7. ends_with(fixedchar<L1>, string, option:case_sensitivity): -> boolean
    8. ends_with(fixedchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean

    Whether input string ends with the substring. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • contains

    Implementations:
    contains(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. contains(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean
    1. contains(varchar<L1>, string, option:case_sensitivity): -> boolean
    2. contains(varchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean
    3. contains(string, string, option:case_sensitivity): -> boolean
    4. contains(string, varchar<L1>, option:case_sensitivity): -> boolean
    5. contains(string, fixedchar<L1>, option:case_sensitivity): -> boolean
    6. contains(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean
    7. contains(fixedchar<L1>, string, option:case_sensitivity): -> boolean
    8. contains(fixedchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean

    Whether the input string contains the substring. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • strpos

    Implementations:
    strpos(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. strpos(string, string, option:case_sensitivity): -> i64
    1. strpos(varchar<L1>, varchar<L1>, option:case_sensitivity): -> i64
    2. strpos(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> i64

    Return the position of the first occurrence of a string in another string. The first character of the string is at position 1. If no occurrence is found, 0 is returned. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • regexp_strpos

    Implementations:
    regexp_strpos(input, pattern, position, occurrence, option:case_sensitivity, option:multiline, option:dotall): -> return_type
    0. regexp_strpos(varchar<L1>, varchar<L2>, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64
    1. regexp_strpos(string, string, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64

    Return the position of an occurrence of the given regular expression pattern in a string. The first character of the string is at position 1. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. Which occurrence to return the position of is specified using the occurrence argument. Specifying 1 means the position first occurrence will be returned, 2 means the position of the second occurrence, and so on. The occurrence argument should be a positive non-zero integer. If no occurrence is found, 0 is returned. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the occurrence value is out of range, or the position value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • count_substring

    Implementations:
    count_substring(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to count.
  • 0. count_substring(string, string, option:case_sensitivity): -> i64
    1. count_substring(varchar<L1>, varchar<L2>, option:case_sensitivity): -> i64
    2. count_substring(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> i64

    Return the number of non-overlapping occurrences of a substring in an input string. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • regexp_count_substring

    Implementations:
    regexp_count_substring(input, pattern, position, option:case_sensitivity, option:multiline, option:dotall): -> return_type
    0. regexp_count_substring(string, string, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64
    1. regexp_count_substring(varchar<L1>, varchar<L2>, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64
    2. regexp_count_substring(fixedchar<L1>, fixedchar<L2>, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64

    Return the number of non-overlapping occurrences of a regular expression pattern in an input string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile or the position value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • replace

    Implementations:
    replace(input, substring, replacement, option:case_sensitivity): -> return_type

  • input: Input string.
  • substring: The substring to replace.
  • replacement: The replacement string.
  • 0. replace(string, string, string, option:case_sensitivity): -> string
    1. replace(varchar<L1>, varchar<L2>, varchar<L3>, option:case_sensitivity): -> varchar<L1>

    Replace all occurrences of the substring with the replacement string. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • concat_ws

    Implementations:
    concat_ws(separator, string_arguments): -> return_type

  • separator: Character to separate strings by.
  • string_arguments: Strings to be concatenated.
  • 0. concat_ws(string, string): -> string
    1. concat_ws(varchar<L2>, varchar<L1>): -> varchar<L1>

    Concatenate strings together separated by a separator.

    repeat

    Implementations:
    repeat(input, count): -> return_type
    0. repeat(string, i64): -> string
    1. repeat(varchar<L1>, i64, i64): -> varchar<L1>

    Repeat a string count number of times.

    reverse

    Implementations:
    reverse(input): -> return_type
    0. reverse(string): -> string
    1. reverse(varchar<L1>): -> varchar<L1>
    2. reverse(fixedchar<L1>): -> fixedchar<L1>

    Returns the string in reverse order.

    replace_slice

    Implementations:
    replace_slice(input, start, length, replacement): -> return_type

  • input: Input string.
  • start: The position in the string to start deleting/inserting characters.
  • length: The number of characters to delete from the input string.
  • replacement: The new string to insert at the start position.
  • 0. replace_slice(string, i64, i64, string): -> string
    1. replace_slice(varchar<L1>, i64, i64, varchar<L2>): -> varchar<L1>

    Replace a slice of the input string. A specified ‘length’ of characters will be deleted from the input string beginning at the ‘start’ position and will be replaced by a new string. A start value of 1 indicates the first character of the input string. If start is negative or zero, or greater than the length of the input string, a null string is returned. If ‘length’ is negative, a null string is returned. If ‘length’ is zero, inserting of the new string occurs at the specified ‘start’ position and no characters are deleted. If ‘length’ is greater than the input string, deletion will occur up to the last character of the input string.

    lower

    Implementations:
    lower(input, option:char_set): -> return_type
    0. lower(string, option:char_set): -> string
    1. lower(varchar<L1>, option:char_set): -> varchar<L1>
    2. lower(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Transform the string to lower case characters. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • upper

    Implementations:
    upper(input, option:char_set): -> return_type
    0. upper(string, option:char_set): -> string
    1. upper(varchar<L1>, option:char_set): -> varchar<L1>
    2. upper(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Transform the string to upper case characters. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • swapcase

    Implementations:
    swapcase(input, option:char_set): -> return_type
    0. swapcase(string, option:char_set): -> string
    1. swapcase(varchar<L1>, option:char_set): -> varchar<L1>
    2. swapcase(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Transform the string’s lowercase characters to uppercase and uppercase characters to lowercase. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • capitalize

    Implementations:
    capitalize(input, option:char_set): -> return_type
    0. capitalize(string, option:char_set): -> string
    1. capitalize(varchar<L1>, option:char_set): -> varchar<L1>
    2. capitalize(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Capitalize the first character of the input string. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • title

    Implementations:
    title(input, option:char_set): -> return_type
    0. title(string, option:char_set): -> string
    1. title(varchar<L1>, option:char_set): -> varchar<L1>
    2. title(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Converts the input string into titlecase. Capitalize the first character of each word in the input string except for articles (a, an, the). Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • initcap

    Implementations:
    initcap(input, option:char_set): -> return_type
    0. initcap(string, option:char_set): -> string
    1. initcap(varchar<L1>, option:char_set): -> varchar<L1>
    2. initcap(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Capitalizes the first character of each word in the input string, including articles, and lowercases the rest. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • char_length

    Implementations:
    char_length(input): -> return_type
    0. char_length(string): -> i64
    1. char_length(varchar<L1>): -> i64
    2. char_length(fixedchar<L1>): -> i64

    Return the number of characters in the input string. The length includes trailing spaces.

    bit_length

    Implementations:
    bit_length(input): -> return_type
    0. bit_length(string): -> i64
    1. bit_length(varchar<L1>): -> i64
    2. bit_length(fixedchar<L1>): -> i64

    Return the number of bits in the input string.

    octet_length

    Implementations:
    octet_length(input): -> return_type
    0. octet_length(string): -> i64
    1. octet_length(varchar<L1>): -> i64
    2. octet_length(fixedchar<L1>): -> i64

    Return the number of bytes in the input string.

    regexp_replace

    Implementations:
    regexp_replace(input, pattern, replacement, position, occurrence, option:case_sensitivity, option:multiline, option:dotall): -> return_type

  • input: The input string.
  • pattern: The regular expression to search for within the input string.
  • replacement: The replacement string.
  • position: The position to start the search.
  • occurrence: Which occurrence of the match to replace.
  • 0. regexp_replace(string, string, string, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> string
    1. regexp_replace(varchar<L1>, varchar<L2>, varchar<L3>, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> varchar<L1>

    Search a string for a substring that matches a given regular expression pattern and replace it with a replacement string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github .io/icu/userguide/strings/regexp.html). The occurrence of the pattern to be replaced is specified using the occurrence argument. Specifying 1 means only the first occurrence will be replaced, 2 means the second occurrence, and so on. Specifying 0 means all occurrences will be replaced. The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The replacement string can capture groups using numbered backreferences. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the replacement contains an illegal back-reference, the occurrence value is out of range, or the position value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • ltrim

    Implementations:
    ltrim(input, characters): -> return_type

  • input: The string to remove characters from.
  • characters: The set of characters to remove.
  • 0. ltrim(varchar<L1>, varchar<L2>): -> varchar<L1>
    1. ltrim(string, string): -> string

    Remove any occurrence of the characters from the left side of the string. If no characters are specified, spaces are removed.

    rtrim

    Implementations:
    rtrim(input, characters): -> return_type

  • input: The string to remove characters from.
  • characters: The set of characters to remove.
  • 0. rtrim(varchar<L1>, varchar<L2>): -> varchar<L1>
    1. rtrim(string, string): -> string

    Remove any occurrence of the characters from the right side of the string. If no characters are specified, spaces are removed.

    trim

    Implementations:
    trim(input, characters): -> return_type

  • input: The string to remove characters from.
  • characters: The set of characters to remove.
  • 0. trim(varchar<L1>, varchar<L2>): -> varchar<L1>
    1. trim(string, string): -> string

    Remove any occurrence of the characters from the left and right sides of the string. If no characters are specified, spaces are removed.

    lpad

    Implementations:
    lpad(input, length, characters): -> return_type

  • input: The string to pad.
  • length: The length of the output string.
  • characters: The string of characters to use for padding.
  • 0. lpad(varchar<L1>, i32, varchar<L2>): -> varchar<L1>
    1. lpad(string, i32, string): -> string

    Left-pad the input string with the string of ‘characters’ until the specified length of the string has been reached. If the input string is longer than ‘length’, remove characters from the right-side to shorten it to ‘length’ characters. If the string of ‘characters’ is longer than the remaining ‘length’ needed to be filled, only pad until ‘length’ has been reached. If ‘characters’ is not specified, the default value is a single space.

    rpad

    Implementations:
    rpad(input, length, characters): -> return_type

  • input: The string to pad.
  • length: The length of the output string.
  • characters: The string of characters to use for padding.
  • 0. rpad(varchar<L1>, i32, varchar<L2>): -> varchar<L1>
    1. rpad(string, i32, string): -> string

    Right-pad the input string with the string of ‘characters’ until the specified length of the string has been reached. If the input string is longer than ‘length’, remove characters from the left-side to shorten it to ‘length’ characters. If the string of ‘characters’ is longer than the remaining ‘length’ needed to be filled, only pad until ‘length’ has been reached. If ‘characters’ is not specified, the default value is a single space.

    center

    Implementations:
    center(input, length, character, option:padding): -> return_type

  • input: The string to pad.
  • length: The length of the output string.
  • character: The character to use for padding.
  • 0. center(varchar<L1>, i32, varchar<L1>, option:padding): -> varchar<L1>
    1. center(string, i32, string, option:padding): -> string

    Center the input string by padding the sides with a single character until the specified length of the string has been reached. By default, if the length will be reached with an uneven number of padding, the extra padding will be applied to the right side. The side with extra padding can be controlled with the padding option. Behavior is undefined if the number of characters passed to the character argument is not 1.

    Options:
  • padding ['RIGHT', 'LEFT']
  • left

    Implementations:
    left(input, count): -> return_type
    0. left(varchar<L1>, i32): -> varchar<L1>
    1. left(string, i32): -> string

    Extract count characters starting from the left of the string.

    Implementations:
    right(input, count): -> return_type
    0. right(varchar<L1>, i32): -> varchar<L1>
    1. right(string, i32): -> string

    Extract count characters starting from the right of the string.

    string_split

    Implementations:
    string_split(input, separator): -> return_type

  • input: The input string.
  • separator: A character used for splitting the string.
  • 0. string_split(varchar<L1>, varchar<L2>): -> List<varchar<L1>>
    1. string_split(string, string): -> List<string>

    Split a string into a list of strings, based on a specified separator character.

    regexp_string_split

    Implementations:
    regexp_string_split(input, pattern, option:case_sensitivity, option:multiline, option:dotall): -> return_type

  • input: The input string.
  • pattern: The regular expression to search for within the input string.
  • 0. regexp_string_split(varchar<L1>, varchar<L2>, option:case_sensitivity, option:multiline, option:dotall): -> List<varchar<L1>>
    1. regexp_string_split(string, string, option:case_sensitivity, option:multiline, option:dotall): -> List<string>

    Split a string into a list of strings, based on a regular expression pattern. The substrings matched by the pattern will be used as the separators to split the input string and will not be included in the resulting list. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • Aggregate Functions

    string_agg

    Implementations:
    string_agg(input, separator): -> return_type

  • input: Column of string values.
  • separator: Separator for concatenated strings
  • 0. string_agg(string, string): -> string

    Concatenates a column of string values with a separator.

    GitHub

    functions_string.yaml

    This document file is generated for functions_string.yaml

    Scalar Functions

    concat

    Implementations:
    concat(input, option:null_handling): -> return_type
    0. concat(varchar<L1>, option:null_handling): -> varchar<L1>
    1. concat(string, option:null_handling): -> string

    Concatenate strings. The null_handling option determines whether or not null values will be recognized by the function. If null_handling is set to IGNORE_NULLS, null value arguments will be ignored when strings are concatenated. If set to ACCEPT_NULLS, the result will be null if any argument passed to the concat function is null.

    Options:
  • null_handling ['IGNORE_NULLS', 'ACCEPT_NULLS']
  • like

    Implementations:
    like(input, match, option:case_sensitivity): -> return_type

  • input: The input string.
  • match: The string to match against the input string.
  • 0. like(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean
    1. like(string, string, option:case_sensitivity): -> boolean

    Are two strings like each other. The case_sensitivity option applies to the match argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • substring

    Implementations:
    substring(input, start, length, option:negative_start): -> return_type
    0. substring(varchar<L1>, i32, i32, option:negative_start): -> varchar<L1>
    1. substring(string, i32, i32, option:negative_start): -> string
    2. substring(fixedchar<l1>, i32, i32, option:negative_start): -> string
    3. substring(varchar<L1>, i32, option:negative_start): -> varchar<L1>
    4. substring(string, i32, option:negative_start): -> string
    5. substring(fixedchar<l1>, i32, option:negative_start): -> string

    Extract a substring of a specified length starting from position start. A start value of 1 refers to the first characters of the string. When length is not specified the function will extract a substring starting from position start and ending at the end of the string. The negative_start option applies to the start parameter. WRAP_FROM_END means the index will start from the end of the input and move backwards. The last character has an index of -1, the second to last character has an index of -2, and so on. LEFT_OF_BEGINNING means the returned substring will start from the left of the first character. A start of -1 will begin 2 characters left of the the input, while a start of 0 begins 1 character left of the input.

    Options:
  • negative_start ['WRAP_FROM_END', 'LEFT_OF_BEGINNING', 'ERROR']
  • negative_start ['WRAP_FROM_END', 'LEFT_OF_BEGINNING']
  • regexp_match_substring

    Implementations:
    regexp_match_substring(input, pattern, position, occurrence, group, option:case_sensitivity, option:multiline, option:dotall): -> return_type
    0. regexp_match_substring(varchar<L1>, varchar<L2>, i64, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> varchar<L1>
    1. regexp_match_substring(string, string, i64, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> string

    Extract a substring that matches the given regular expression pattern. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The occurrence of the pattern to be extracted is specified using the occurrence argument. Specifying 1 means the first occurrence will be extracted, 2 means the second occurrence, and so on. The occurrence argument should be a positive non-zero integer. The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The regular expression capture group can be specified using the group argument. Specifying 0 will return the substring matching the full regular expression. Specifying 1 will return the substring matching only the first capture group, and so on. The group argument should be a non-negative integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the occurrence value is out of range, the position value is out of range, or the group value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • regexp_match_substring_all

    Implementations:
    regexp_match_substring_all(input, pattern, position, group, option:case_sensitivity, option:multiline, option:dotall): -> return_type
    0. regexp_match_substring_all(varchar<L1>, varchar<L2>, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> List<varchar<L1>>
    1. regexp_match_substring_all(string, string, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> List<string>

    Extract all substrings that match the given regular expression pattern. This will return a list of extracted strings with one value for each occurrence of a match. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The regular expression capture group can be specified using the group argument. Specifying 0 will return substrings matching the full regular expression. Specifying 1 will return substrings matching only the first capture group, and so on. The group argument should be a non-negative integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the position value is out of range, or the group value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • starts_with

    Implementations:
    starts_with(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. starts_with(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean
    1. starts_with(varchar<L1>, string, option:case_sensitivity): -> boolean
    2. starts_with(varchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean
    3. starts_with(string, string, option:case_sensitivity): -> boolean
    4. starts_with(string, varchar<L1>, option:case_sensitivity): -> boolean
    5. starts_with(string, fixedchar<L1>, option:case_sensitivity): -> boolean
    6. starts_with(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean
    7. starts_with(fixedchar<L1>, string, option:case_sensitivity): -> boolean
    8. starts_with(fixedchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean

    Whether the input string starts with the substring. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • ends_with

    Implementations:
    ends_with(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. ends_with(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean
    1. ends_with(varchar<L1>, string, option:case_sensitivity): -> boolean
    2. ends_with(varchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean
    3. ends_with(string, string, option:case_sensitivity): -> boolean
    4. ends_with(string, varchar<L1>, option:case_sensitivity): -> boolean
    5. ends_with(string, fixedchar<L1>, option:case_sensitivity): -> boolean
    6. ends_with(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean
    7. ends_with(fixedchar<L1>, string, option:case_sensitivity): -> boolean
    8. ends_with(fixedchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean

    Whether input string ends with the substring. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • contains

    Implementations:
    contains(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. contains(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean
    1. contains(varchar<L1>, string, option:case_sensitivity): -> boolean
    2. contains(varchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean
    3. contains(string, string, option:case_sensitivity): -> boolean
    4. contains(string, varchar<L1>, option:case_sensitivity): -> boolean
    5. contains(string, fixedchar<L1>, option:case_sensitivity): -> boolean
    6. contains(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean
    7. contains(fixedchar<L1>, string, option:case_sensitivity): -> boolean
    8. contains(fixedchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean

    Whether the input string contains the substring. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • strpos

    Implementations:
    strpos(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. strpos(string, string, option:case_sensitivity): -> i64
    1. strpos(varchar<L1>, varchar<L1>, option:case_sensitivity): -> i64
    2. strpos(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> i64

    Return the position of the first occurrence of a string in another string. The first character of the string is at position 1. If no occurrence is found, 0 is returned. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • regexp_strpos

    Implementations:
    regexp_strpos(input, pattern, position, occurrence, option:case_sensitivity, option:multiline, option:dotall): -> return_type
    0. regexp_strpos(varchar<L1>, varchar<L2>, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64
    1. regexp_strpos(string, string, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64

    Return the position of an occurrence of the given regular expression pattern in a string. The first character of the string is at position 1. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. Which occurrence to return the position of is specified using the occurrence argument. Specifying 1 means the position first occurrence will be returned, 2 means the position of the second occurrence, and so on. The occurrence argument should be a positive non-zero integer. If no occurrence is found, 0 is returned. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the occurrence value is out of range, or the position value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • count_substring

    Implementations:
    count_substring(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to count.
  • 0. count_substring(string, string, option:case_sensitivity): -> i64
    1. count_substring(varchar<L1>, varchar<L2>, option:case_sensitivity): -> i64
    2. count_substring(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> i64

    Return the number of non-overlapping occurrences of a substring in an input string. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • regexp_count_substring

    Implementations:
    regexp_count_substring(input, pattern, position, option:case_sensitivity, option:multiline, option:dotall): -> return_type
    0. regexp_count_substring(string, string, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64
    1. regexp_count_substring(varchar<L1>, varchar<L2>, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64
    2. regexp_count_substring(fixedchar<L1>, fixedchar<L2>, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64

    Return the number of non-overlapping occurrences of a regular expression pattern in an input string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile or the position value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • replace

    Implementations:
    replace(input, substring, replacement, option:case_sensitivity): -> return_type

  • input: Input string.
  • substring: The substring to replace.
  • replacement: The replacement string.
  • 0. replace(string, string, string, option:case_sensitivity): -> string
    1. replace(varchar<L1>, varchar<L2>, varchar<L3>, option:case_sensitivity): -> varchar<L1>

    Replace all occurrences of the substring with the replacement string. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • concat_ws

    Implementations:
    concat_ws(separator, string_arguments): -> return_type

  • separator: Character to separate strings by.
  • string_arguments: Strings to be concatenated.
  • 0. concat_ws(string, string): -> string
    1. concat_ws(varchar<L2>, varchar<L1>): -> varchar<L1>

    Concatenate strings together separated by a separator.

    repeat

    Implementations:
    repeat(input, count): -> return_type
    0. repeat(string, i64): -> string
    1. repeat(varchar<L1>, i64, i64): -> varchar<L1>

    Repeat a string count number of times.

    reverse

    Implementations:
    reverse(input): -> return_type
    0. reverse(string): -> string
    1. reverse(varchar<L1>): -> varchar<L1>
    2. reverse(fixedchar<L1>): -> fixedchar<L1>

    Returns the string in reverse order.

    replace_slice

    Implementations:
    replace_slice(input, start, length, replacement): -> return_type

  • input: Input string.
  • start: The position in the string to start deleting/inserting characters.
  • length: The number of characters to delete from the input string.
  • replacement: The new string to insert at the start position.
  • 0. replace_slice(string, i64, i64, string): -> string
    1. replace_slice(varchar<L1>, i64, i64, varchar<L2>): -> varchar<L1>

    Replace a slice of the input string. A specified ‘length’ of characters will be deleted from the input string beginning at the ‘start’ position and will be replaced by a new string. A start value of 1 indicates the first character of the input string. If start is negative or zero, or greater than the length of the input string, a null string is returned. If ‘length’ is negative, a null string is returned. If ‘length’ is zero, inserting of the new string occurs at the specified ‘start’ position and no characters are deleted. If ‘length’ is greater than the input string, deletion will occur up to the last character of the input string.

    lower

    Implementations:
    lower(input, option:char_set): -> return_type
    0. lower(string, option:char_set): -> string
    1. lower(varchar<L1>, option:char_set): -> varchar<L1>
    2. lower(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Transform the string to lower case characters. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • upper

    Implementations:
    upper(input, option:char_set): -> return_type
    0. upper(string, option:char_set): -> string
    1. upper(varchar<L1>, option:char_set): -> varchar<L1>
    2. upper(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Transform the string to upper case characters. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • swapcase

    Implementations:
    swapcase(input, option:char_set): -> return_type
    0. swapcase(string, option:char_set): -> string
    1. swapcase(varchar<L1>, option:char_set): -> varchar<L1>
    2. swapcase(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Transform the string’s lowercase characters to uppercase and uppercase characters to lowercase. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • capitalize

    Implementations:
    capitalize(input, option:char_set): -> return_type
    0. capitalize(string, option:char_set): -> string
    1. capitalize(varchar<L1>, option:char_set): -> varchar<L1>
    2. capitalize(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Capitalize the first character of the input string. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • title

    Implementations:
    title(input, option:char_set): -> return_type
    0. title(string, option:char_set): -> string
    1. title(varchar<L1>, option:char_set): -> varchar<L1>
    2. title(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Converts the input string into titlecase. Capitalize the first character of each word in the input string except for articles (a, an, the). Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • initcap

    Implementations:
    initcap(input, option:char_set): -> return_type
    0. initcap(string, option:char_set): -> string
    1. initcap(varchar<L1>, option:char_set): -> varchar<L1>
    2. initcap(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Capitalizes the first character of each word in the input string, including articles, and lowercases the rest. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • char_length

    Implementations:
    char_length(input): -> return_type
    0. char_length(string): -> i64
    1. char_length(varchar<L1>): -> i64
    2. char_length(fixedchar<L1>): -> i64

    Return the number of characters in the input string. The length includes trailing spaces.

    bit_length

    Implementations:
    bit_length(input): -> return_type
    0. bit_length(string): -> i64
    1. bit_length(varchar<L1>): -> i64
    2. bit_length(fixedchar<L1>): -> i64

    Return the number of bits in the input string.

    octet_length

    Implementations:
    octet_length(input): -> return_type
    0. octet_length(string): -> i64
    1. octet_length(varchar<L1>): -> i64
    2. octet_length(fixedchar<L1>): -> i64

    Return the number of bytes in the input string.

    regexp_replace

    Implementations:
    regexp_replace(input, pattern, replacement, position, occurrence, option:case_sensitivity, option:multiline, option:dotall): -> return_type

  • input: The input string.
  • pattern: The regular expression to search for within the input string.
  • replacement: The replacement string.
  • position: The position to start the search.
  • occurrence: Which occurrence of the match to replace.
  • 0. regexp_replace(string, string, string, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> string
    1. regexp_replace(varchar<L1>, varchar<L2>, varchar<L3>, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> varchar<L1>

    Search a string for a substring that matches a given regular expression pattern and replace it with a replacement string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github .io/icu/userguide/strings/regexp.html). The occurrence of the pattern to be replaced is specified using the occurrence argument. Specifying 1 means only the first occurrence will be replaced, 2 means the second occurrence, and so on. Specifying 0 means all occurrences will be replaced. The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The replacement string can capture groups using numbered backreferences. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the replacement contains an illegal back-reference, the occurrence value is out of range, or the position value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • ltrim

    Implementations:
    ltrim(input, characters): -> return_type

  • input: The string to remove characters from.
  • characters: The set of characters to remove.
  • 0. ltrim(varchar<L1>, varchar<L2>): -> varchar<L1>
    1. ltrim(string, string): -> string

    Remove any occurrence of the characters from the left side of the string. If no characters are specified, spaces are removed.

    rtrim

    Implementations:
    rtrim(input, characters): -> return_type

  • input: The string to remove characters from.
  • characters: The set of characters to remove.
  • 0. rtrim(varchar<L1>, varchar<L2>): -> varchar<L1>
    1. rtrim(string, string): -> string

    Remove any occurrence of the characters from the right side of the string. If no characters are specified, spaces are removed.

    trim

    Implementations:
    trim(input, characters): -> return_type

  • input: The string to remove characters from.
  • characters: The set of characters to remove.
  • 0. trim(varchar<L1>, varchar<L2>): -> varchar<L1>
    1. trim(string, string): -> string

    Remove any occurrence of the characters from the left and right sides of the string. If no characters are specified, spaces are removed.

    lpad

    Implementations:
    lpad(input, length, characters): -> return_type

  • input: The string to pad.
  • length: The length of the output string.
  • characters: The string of characters to use for padding.
  • 0. lpad(varchar<L1>, i32, varchar<L2>): -> varchar<L1>
    1. lpad(string, i32, string): -> string

    Left-pad the input string with the string of ‘characters’ until the specified length of the string has been reached. If the input string is longer than ‘length’, remove characters from the right-side to shorten it to ‘length’ characters. If the string of ‘characters’ is longer than the remaining ‘length’ needed to be filled, only pad until ‘length’ has been reached. If ‘characters’ is not specified, the default value is a single space.

    rpad

    Implementations:
    rpad(input, length, characters): -> return_type

  • input: The string to pad.
  • length: The length of the output string.
  • characters: The string of characters to use for padding.
  • 0. rpad(varchar<L1>, i32, varchar<L2>): -> varchar<L1>
    1. rpad(string, i32, string): -> string

    Right-pad the input string with the string of ‘characters’ until the specified length of the string has been reached. If the input string is longer than ‘length’, remove characters from the left-side to shorten it to ‘length’ characters. If the string of ‘characters’ is longer than the remaining ‘length’ needed to be filled, only pad until ‘length’ has been reached. If ‘characters’ is not specified, the default value is a single space.

    center

    Implementations:
    center(input, length, character, option:padding): -> return_type

  • input: The string to pad.
  • length: The length of the output string.
  • character: The character to use for padding.
  • 0. center(varchar<L1>, i32, varchar<L1>, option:padding): -> varchar<L1>
    1. center(string, i32, string, option:padding): -> string

    Center the input string by padding the sides with a single character until the specified length of the string has been reached. By default, if the length will be reached with an uneven number of padding, the extra padding will be applied to the right side. The side with extra padding can be controlled with the padding option. Behavior is undefined if the number of characters passed to the character argument is not 1.

    Options:
  • padding ['RIGHT', 'LEFT']
  • left

    Implementations:
    left(input, count): -> return_type
    0. left(varchar<L1>, i32): -> varchar<L1>
    1. left(string, i32): -> string

    Extract count characters starting from the left of the string.

    Implementations:
    right(input, count): -> return_type
    0. right(varchar<L1>, i32): -> varchar<L1>
    1. right(string, i32): -> string

    Extract count characters starting from the right of the string.

    string_split

    Implementations:
    string_split(input, separator): -> return_type

  • input: The input string.
  • separator: A character used for splitting the string.
  • 0. string_split(varchar<L1>, varchar<L2>): -> List<varchar<L1>>
    1. string_split(string, string): -> List<string>

    Split a string into a list of strings, based on a specified separator character.

    regexp_string_split

    Implementations:
    regexp_string_split(input, pattern, option:case_sensitivity, option:multiline, option:dotall): -> return_type

  • input: The input string.
  • pattern: The regular expression to search for within the input string.
  • 0. regexp_string_split(varchar<L1>, varchar<L2>, option:case_sensitivity, option:multiline, option:dotall): -> List<varchar<L1>>
    1. regexp_string_split(string, string, option:case_sensitivity, option:multiline, option:dotall): -> List<string>

    Split a string into a list of strings, based on a regular expression pattern. The substrings matched by the pattern will be used as the separators to split the input string and will not be included in the resulting list. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • Aggregate Functions

    string_agg

    Implementations:
    string_agg(input, separator): -> return_type

  • input: Column of string values.
  • separator: Separator for concatenated strings
  • 0. string_agg(string, string): -> string

    Concatenates a column of string values with a separator.

    GitHub

    Extensions

    In many cases, the existing objects in Substrait will be sufficient to accomplish a particular use case. However, it is sometimes helpful to create a new data type, scalar function signature or some other custom representation within a system. For that, Substrait provides a number of extension points.

    Simple Extensions

    Some kinds of primitives are so frequently extended that Substrait defines a standard YAML format that describes how the extended functionality can be interpreted. This allows different projects/systems to use the YAML definition as a specification so that interoperability isn’t constrained to the base Substrait specification. The main types of extensions that are defined in this manner include the following:

    • Data types
    • Type variations
    • Scalar Functions
    • Aggregate Functions
    • Window Functions
    • Table Functions

    To extend these items, developers can create one or more YAML files at a defined URI that describes the properties of each of these extensions. The YAML file is constructed according to the YAML Schema. Each definition in the file corresponds to the YAML-based serialization of the relevant data structure. If a user only wants to extend one of these types of objects (e.g. types), a developer does not have to provide definitions for the other extension points.

    A Substrait plan can reference one or more YAML files via URI for extension. In the places where these entities are referenced, they will be referenced using a URI + name reference. The name scheme per type works as follows:

    Category Naming scheme
    Type The name as defined on the type object.
    Type Variation The name as defined on the type variation object.
    Function Signature A function signature compound name as described below.

    A YAML file can also reference types and type variations defined in another YAML file. To do this, it must declare the YAML file it depends on using a key-value pair in the dependencies key, where the value is the URI to the YAML file, and the key is a valid identifier that can then be used as an identifier-safe alias for the URI. This alias can then be used as a .-separated namespace prefix wherever a type class or type variation name is expected.

    For example, if the YAML file at file:///extension_types.yaml defines a type called point, a different YAML file can use the type in a function declaration as follows:

    dependencies:
    + Extensions - Substrait: Cross-Language Serialization for Relational Algebra      

    Extensions

    In many cases, the existing objects in Substrait will be sufficient to accomplish a particular use case. However, it is sometimes helpful to create a new data type, scalar function signature or some other custom representation within a system. For that, Substrait provides a number of extension points.

    Simple Extensions

    Some kinds of primitives are so frequently extended that Substrait defines a standard YAML format that describes how the extended functionality can be interpreted. This allows different projects/systems to use the YAML definition as a specification so that interoperability isn’t constrained to the base Substrait specification. The main types of extensions that are defined in this manner include the following:

    • Data types
    • Type variations
    • Scalar Functions
    • Aggregate Functions
    • Window Functions
    • Table Functions

    To extend these items, developers can create one or more YAML files at a defined URI that describes the properties of each of these extensions. The YAML file is constructed according to the YAML Schema. Each definition in the file corresponds to the YAML-based serialization of the relevant data structure. If a user only wants to extend one of these types of objects (e.g. types), a developer does not have to provide definitions for the other extension points.

    A Substrait plan can reference one or more YAML files via URI for extension. In the places where these entities are referenced, they will be referenced using a URI + name reference. The name scheme per type works as follows:

    Category Naming scheme
    Type The name as defined on the type object.
    Type Variation The name as defined on the type variation object.
    Function Signature A function signature compound name as described below.

    A YAML file can also reference types and type variations defined in another YAML file. To do this, it must declare the YAML file it depends on using a key-value pair in the dependencies key, where the value is the URI to the YAML file, and the key is a valid identifier that can then be used as an identifier-safe alias for the URI. This alias can then be used as a .-separated namespace prefix wherever a type class or type variation name is expected.

    For example, if the YAML file at file:///extension_types.yaml defines a type called point, a different YAML file can use the type in a function declaration as follows:

    dependencies:
       ext: file:///extension_types.yaml
     scalar_functions:
     - name: distance
    diff --git a/faq/index.html b/faq/index.html
    index 56b36fe..68fbe8c 100644
    --- a/faq/index.html
    +++ b/faq/index.html
    @@ -1,4 +1,4 @@
    - FAQ - Substrait: Cross-Language Serialization for Relational Algebra      

    Frequently Asked Question

    What is the purpose of the post-join filter field on Join relations?

    The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join.

    See the example here that highlights how the post-join filter behaves differently than a Filter relation in the case of a left join.

    GitHub

    Frequently Asked Question

    What is the purpose of the post-join filter field on Join relations?

    The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join.

    See the example here that highlights how the post-join filter behaves differently than a Filter relation in the case of a left join.

    GitHub

    Substrait Project Governance

    The Substrait project is run by volunteers in a collaborative and open way. Its governance is inspired by the Apache Software Foundation. In most cases, people familiar with the ASF model can work with Substrait in the same way. The biggest differences between the models are:

    • Substrait does not have a separate infrastructure governing body that gatekeeps the adoption of new developer tools and technologies.
    • Substrait Management Committee (SMC) members are responsible for recognizing the corporate relationship of its members and ensuring diverse representation and corporate independence.
    • Substrait does not condone private mailing lists. All project business should be discussed in public The only exceptions to this are security escalations (security@substrait.io) and harassment (harassment@substrait.io).
    • Substrait has an automated continuous release process with no formal voting process per release.

    More details about concrete things Substrait looks to avoid can be found below.

    The Substrait Project

    The Substrait project consists of the code and repositories that reside in the substrait-io GitHub organization, the Substrait.io website, the Substrait mailing list, MS-hosted teams community calls and the Substrait Slack workspace. (All are open to everyone and recordings/transcripts are made where technology supports it.)

    Substrait Volunteers

    We recognize four groups of individuals related to the project.

    User

    A user is someone who uses Substrait. They may contribute to Substrait by providing feedback to developers in the form of bug reports and feature suggestions. Users participate in the Substrait community by helping other users on mailing lists and user support forums.

    Contributors

    A contributor is a user who contributes to the project in the form of code or documentation. They take extra steps to participate in the project (loosely defined as the set of repositories under the github substrait-io organization) , are active on the developer mailing list, participate in discussions, and provide patches, documentation, suggestions, and criticism.

    Committer

    A committer is a developer who has write access to the code repositories and has a signed Contributor License Agreement (CLA) on file. Not needing to depend on other people to make patches to the code or documentation, they are actually making short-term decisions for the project. The SMC can (even tacitly) agree and approve the changes into permanency, or they can reject them. Remember that the SMC makes the decisions, not the individual committers.

    SMC Member

    A SMC member is a committer who was elected due to merit for the evolution of the project. They have write access to the code repository, the right to cast binding votes on all proposals on community-related decisions,the right to propose other active contributors for committership, and the right to invite active committers to the SMC. The SMC as a whole is the entity that controls the project, nobody else. They are responsible for the continued shaping of this governance model.

    Substrait Management and Collaboration

    The Substrait project is managed using a collaborative, consensus-based process. We do not have a hierarchical structure; rather, different groups of contributors have different rights and responsibilities in the organization.

    Communication

    Communication must be done via mailing lists, Slack, and/or Github. Communication is always done publicly. There are no private lists and all decisions related to the project are made in public. Communication is frequently done asynchronously since members of the community are distributed across many time zones.

    Substrait Management Committee

    The Substrait Management Committee is responsible for the active management of Substrait. The main role of the SMC is to further the long-term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration takes place. As part of this, the SMC is the primary approver of specification changes, ensuring that proposed changes represent a balanced and thorough examination of possibilities. This doesn’t mean that the SMC has to be involved in the minutiae of a particular specification change but should always shepard a healthy process around specification changes.

    Substrait Voting Process

    Because one of the fundamental aspects of accomplishing things is doing so by consensus, we need a way to tell whether we have reached consensus. We do this by voting. There are several different types of voting. In all cases, it is recommended that all community members vote. The number of binding votes required to move forward and the community members who have “binding” votes differs depending on the type of proposal made. In all cases, a veto of a binding voter results in an inability to move forward.

    The rules require that a community member registering a negative vote must include an alternative proposal or a detailed explanation of the reasons for the negative vote. The community then tries to gather consensus on an alternative proposal that can resolve the issue. In the great majority of cases, the concerns leading to the negative vote can be addressed. This process is called “consensus gathering” and we consider it a very important indication of a healthy community.

    +1 votes required Binding voters Voting Location
    Process/Governance modifications & actions. This includes promoting new contributors to committer or SMC. 3 SMC Mailing List
    Format/Specification Modifications (including breaking extension changes) 2 SMC Github PR
    Documentation Updates (formatting, moves) 1 SMC Github PR
    Typos 1 Committers Github PR
    Non-breaking function introductions 1 (not including proposer) Committers Github PR
    Non-breaking extension additions & non-format code modifications 1 (not including proposer) Committers Github PR
    Changes (non-breaking or breaking) to a Substrait library (i.e. substrait-java, substrait-validator) 1 (not including proposer) Committers Github PR

    Review-Then-Commit

    Substrait follows a review-then-commit policy. This requires that all changes receive consensus approval before being committed to the code base. The specific vote requirements follow the table above.

    Expressing Votes

    The voting process may seem more than a little weird if you’ve never encountered it before. Votes are represented as numbers between -1 and +1, with ‘-1’ meaning ‘no’ and ‘+1’ meaning ‘yes.’

    The in-between values indicate how strongly the voting individual feels. Here are some examples of fractional votes and what the voter might be communicating with them:

    • +0: ‘I don’t feel strongly about it, but I’m okay with this.’
    • -0: ‘I won’t get in the way, but I’d rather we didn’t do this.’
    • -0.5: ‘I don’t like this idea, but I can’t find any rational justification for my feelings.’
    • ++1: ‘Wow! I like this! Let’s do it!’
    • -0.9: ‘I really don’t like this, but I’m not going to stand in the way if everyone else wants to go ahead with it.’
    • +0.9: ‘This is a cool idea and I like it, but I don’t have time/the skills necessary to help out.’

    Votes on Code Modification

    For code-modification votes, +1 votes (review approvals in Github are considered equivalent to a +1) are in favor of the proposal, but -1 votes are vetoes and kill the proposal dead until all vetoers withdraw their -1 votes.

    Vetoes

    A -1 (or an unaddressed PR request for changes) vote by a qualified voter stops a code-modification proposal in its tracks. This constitutes a veto, and it cannot be overruled nor overridden by anyone. Vetoes stand until and unless the individual withdraws their veto.

    To prevent vetoes from being used capriciously, the voter must provide with the veto a technical or community justification showing why the change is bad.

    Why do we vote?

    Votes help us to openly resolve conflicts. Without a process, people tend to avoid conflict and thrash around. Votes help to make sure we do the hard work of resolving the conflict.

    Substrait is non-commercial but commercially-aware

    Substrait’s mission is to produce software for the public good. All Substrait software is always available for free, and solely under the Apache License.

    We’re happy to have third parties, including for-profit corporations, take our software and use it for their own purposes. However it is important in these cases to ensure that the third party does not misuse the brand and reputation of the Substrait project for its own purposes. It is important for the longevity and community health of Substrait that the community gets the appropriate credit for producing freely available software.

    The SMC actively track the corporate allegiances of community members and strives to ensure influence around any particular aspect of the project isn’t overly skewed towards a single corporate entity.

    Substrait Trademark

    The SMC is responsible for protecting the Substrait name and brand. TBD what action is taken to support this.

    Project Roster

    Substrait Management Committee (SMC)

    Name Association
    Phillip Cloud Voltron Data
    Weston Pace LanceDB
    Jacques Nadeau Sundeck
    Victor Barua Datadog
    David Sisson Voltron Data

    Substrait Committers

    Name Association
    Jeroen van Straten Qblox
    Carlo Curino Microsoft
    James Taylor Sundeck
    Sutou Kouhei Clearcode
    Micah Kornfeld Google
    Jinfeng Ni Sundeck
    Andy Grove Nvidia
    Jesus Camacho Rodriguez Microsoft
    Rich Tia Voltron Data
    Vibhatha Abeykoon Voltron Data
    Nic Crane Recast
    Gil Forsyth Voltron Data
    ChaoJun Zhang Intel
    Matthijs Brobbel Voltron Data
    Matt Topol Voltron Data

    Additional detail about differences from ASF

    Corporate Awareness: The ASF takes a blind-eye approach that has proven to be too slow to correct corporate influence which has substantially undermined many OSS projects. In contrast, Substrait SMC members are responsible for identifying corporate risks and over-representation and adjusting inclusion in the project based on that (limiting committership, SMC membership, etc). Each member of the SMC shares responsibility to expand the community and seek out corporate diversity.

    Infrastructure: The ASF shows its age wrt to infrastructure, having been originally built on SVN. Some examples of requirements that Substrait is eschewing that exist in ASF include: custom git infrastructure, release process that is manual, project external gatekeeping around the use of new tools/technologies.

    GitHub

    Substrait Project Governance

    The Substrait project is run by volunteers in a collaborative and open way. Its governance is inspired by the Apache Software Foundation. In most cases, people familiar with the ASF model can work with Substrait in the same way. The biggest differences between the models are:

    • Substrait does not have a separate infrastructure governing body that gatekeeps the adoption of new developer tools and technologies.
    • Substrait Management Committee (SMC) members are responsible for recognizing the corporate relationship of its members and ensuring diverse representation and corporate independence.
    • Substrait does not condone private mailing lists. All project business should be discussed in public The only exceptions to this are security escalations (security@substrait.io) and harassment (harassment@substrait.io).
    • Substrait has an automated continuous release process with no formal voting process per release.

    More details about concrete things Substrait looks to avoid can be found below.

    The Substrait Project

    The Substrait project consists of the code and repositories that reside in the substrait-io GitHub organization, the Substrait.io website, the Substrait mailing list, MS-hosted teams community calls and the Substrait Slack workspace. (All are open to everyone and recordings/transcripts are made where technology supports it.)

    Substrait Volunteers

    We recognize four groups of individuals related to the project.

    User

    A user is someone who uses Substrait. They may contribute to Substrait by providing feedback to developers in the form of bug reports and feature suggestions. Users participate in the Substrait community by helping other users on mailing lists and user support forums.

    Contributors

    A contributor is a user who contributes to the project in the form of code or documentation. They take extra steps to participate in the project (loosely defined as the set of repositories under the github substrait-io organization) , are active on the developer mailing list, participate in discussions, and provide patches, documentation, suggestions, and criticism.

    Committer

    A committer is a developer who has write access to the code repositories and has a signed Contributor License Agreement (CLA) on file. Not needing to depend on other people to make patches to the code or documentation, they are actually making short-term decisions for the project. The SMC can (even tacitly) agree and approve the changes into permanency, or they can reject them. Remember that the SMC makes the decisions, not the individual committers.

    SMC Member

    A SMC member is a committer who was elected due to merit for the evolution of the project. They have write access to the code repository, the right to cast binding votes on all proposals on community-related decisions,the right to propose other active contributors for committership, and the right to invite active committers to the SMC. The SMC as a whole is the entity that controls the project, nobody else. They are responsible for the continued shaping of this governance model.

    Substrait Management and Collaboration

    The Substrait project is managed using a collaborative, consensus-based process. We do not have a hierarchical structure; rather, different groups of contributors have different rights and responsibilities in the organization.

    Communication

    Communication must be done via mailing lists, Slack, and/or Github. Communication is always done publicly. There are no private lists and all decisions related to the project are made in public. Communication is frequently done asynchronously since members of the community are distributed across many time zones.

    Substrait Management Committee

    The Substrait Management Committee is responsible for the active management of Substrait. The main role of the SMC is to further the long-term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration takes place. As part of this, the SMC is the primary approver of specification changes, ensuring that proposed changes represent a balanced and thorough examination of possibilities. This doesn’t mean that the SMC has to be involved in the minutiae of a particular specification change but should always shepard a healthy process around specification changes.

    Substrait Voting Process

    Because one of the fundamental aspects of accomplishing things is doing so by consensus, we need a way to tell whether we have reached consensus. We do this by voting. There are several different types of voting. In all cases, it is recommended that all community members vote. The number of binding votes required to move forward and the community members who have “binding” votes differs depending on the type of proposal made. In all cases, a veto of a binding voter results in an inability to move forward.

    The rules require that a community member registering a negative vote must include an alternative proposal or a detailed explanation of the reasons for the negative vote. The community then tries to gather consensus on an alternative proposal that can resolve the issue. In the great majority of cases, the concerns leading to the negative vote can be addressed. This process is called “consensus gathering” and we consider it a very important indication of a healthy community.

    +1 votes required Binding voters Voting Location
    Process/Governance modifications & actions. This includes promoting new contributors to committer or SMC. 3 SMC Mailing List
    Format/Specification Modifications (including breaking extension changes) 2 SMC Github PR
    Documentation Updates (formatting, moves) 1 SMC Github PR
    Typos 1 Committers Github PR
    Non-breaking function introductions 1 (not including proposer) Committers Github PR
    Non-breaking extension additions & non-format code modifications 1 (not including proposer) Committers Github PR
    Changes (non-breaking or breaking) to a Substrait library (i.e. substrait-java, substrait-validator) 1 (not including proposer) Committers Github PR

    Review-Then-Commit

    Substrait follows a review-then-commit policy. This requires that all changes receive consensus approval before being committed to the code base. The specific vote requirements follow the table above.

    Expressing Votes

    The voting process may seem more than a little weird if you’ve never encountered it before. Votes are represented as numbers between -1 and +1, with ‘-1’ meaning ‘no’ and ‘+1’ meaning ‘yes.’

    The in-between values indicate how strongly the voting individual feels. Here are some examples of fractional votes and what the voter might be communicating with them:

    • +0: ‘I don’t feel strongly about it, but I’m okay with this.’
    • -0: ‘I won’t get in the way, but I’d rather we didn’t do this.’
    • -0.5: ‘I don’t like this idea, but I can’t find any rational justification for my feelings.’
    • ++1: ‘Wow! I like this! Let’s do it!’
    • -0.9: ‘I really don’t like this, but I’m not going to stand in the way if everyone else wants to go ahead with it.’
    • +0.9: ‘This is a cool idea and I like it, but I don’t have time/the skills necessary to help out.’

    Votes on Code Modification

    For code-modification votes, +1 votes (review approvals in Github are considered equivalent to a +1) are in favor of the proposal, but -1 votes are vetoes and kill the proposal dead until all vetoers withdraw their -1 votes.

    Vetoes

    A -1 (or an unaddressed PR request for changes) vote by a qualified voter stops a code-modification proposal in its tracks. This constitutes a veto, and it cannot be overruled nor overridden by anyone. Vetoes stand until and unless the individual withdraws their veto.

    To prevent vetoes from being used capriciously, the voter must provide with the veto a technical or community justification showing why the change is bad.

    Why do we vote?

    Votes help us to openly resolve conflicts. Without a process, people tend to avoid conflict and thrash around. Votes help to make sure we do the hard work of resolving the conflict.

    Substrait is non-commercial but commercially-aware

    Substrait’s mission is to produce software for the public good. All Substrait software is always available for free, and solely under the Apache License.

    We’re happy to have third parties, including for-profit corporations, take our software and use it for their own purposes. However it is important in these cases to ensure that the third party does not misuse the brand and reputation of the Substrait project for its own purposes. It is important for the longevity and community health of Substrait that the community gets the appropriate credit for producing freely available software.

    The SMC actively track the corporate allegiances of community members and strives to ensure influence around any particular aspect of the project isn’t overly skewed towards a single corporate entity.

    Substrait Trademark

    The SMC is responsible for protecting the Substrait name and brand. TBD what action is taken to support this.

    Project Roster

    Substrait Management Committee (SMC)

    Name Association
    Phillip Cloud Voltron Data
    Weston Pace LanceDB
    Jacques Nadeau Sundeck
    Victor Barua Datadog
    David Sisson Voltron Data

    Substrait Committers

    Name Association
    Jeroen van Straten Qblox
    Carlo Curino Microsoft
    James Taylor Sundeck
    Sutou Kouhei Clearcode
    Micah Kornfeld Google
    Jinfeng Ni Sundeck
    Andy Grove Nvidia
    Jesus Camacho Rodriguez Microsoft
    Rich Tia Voltron Data
    Vibhatha Abeykoon Voltron Data
    Nic Crane Recast
    Gil Forsyth Voltron Data
    ChaoJun Zhang Intel
    Matthijs Brobbel Voltron Data
    Matt Topol Voltron Data

    Additional detail about differences from ASF

    Corporate Awareness: The ASF takes a blind-eye approach that has proven to be too slow to correct corporate influence which has substantially undermined many OSS projects. In contrast, Substrait SMC members are responsible for identifying corporate risks and over-representation and adjusting inclusion in the project based on that (limiting committership, SMC membership, etc). Each member of the SMC shares responsibility to expand the community and seek out corporate diversity.

    Infrastructure: The ASF shows its age wrt to infrastructure, having been originally built on SVN. Some examples of requirements that Substrait is eschewing that exist in ASF include: custom git infrastructure, release process that is manual, project external gatekeeping around the use of new tools/technologies.

    GitHub

    Substrait: Cross-Language Serialization for Relational Algebra

    What is Substrait?

    Substrait is a format for describing compute operations on structured data. It is designed for interoperability across different languages and systems.

    How does it work?

    Substrait provides a well-defined, cross-language specification for data compute operations. This includes a consistent declaration of common operations, custom operations and one or more serialized representations of this specification. The spec focuses on the semantics of each operation. In addition to the specification the Substrait ecosystem also includes a number of libraries and useful tools.

    We highly recommend the tutorial to learn how a Substrait plan is constructed.

    Benefits

    • Avoids every system needing to create a communication method between every other system – each system merely supports ingesting and producing Substrait and it instantly becomes a part of the greater ecosystem.
    • Makes every part of the system upgradable. There’s a new query engine that’s ten times faster? Just plug it in!
    • Enables heterogeneous environments – run on a cluster of an unknown set of execution engines!
    • The text version of the Substrait plan allows you to quickly see how a plan functions without needing a visualizer (although there are Substrait visualizers as well!).

    Example Use Cases

    • Communicate a compute plan between a SQL parser and an execution engine (e.g. Calcite SQL parsing to Arrow C++ compute kernel)
    • Serialize a plan that represents a SQL view for consistent use in multiple systems (e.g. Iceberg views in Spark and Trino)
    • Submit a plan to different execution engines (e.g. Datafusion and Postgres) and get a consistent interpretation of the semantics.
    • Create an alternative plan generation implementation that can connect an existing end-user compute expression system to an existing end-user processing engine (e.g. Pandas operations executed inside SingleStore)
    • Build a pluggable plan visualization tool (e.g. D3 based plan visualizer)
    GitHub

    Substrait: Cross-Language Serialization for Relational Algebra

    What is Substrait?

    Substrait is a format for describing compute operations on structured data. It is designed for interoperability across different languages and systems.

    How does it work?

    Substrait provides a well-defined, cross-language specification for data compute operations. This includes a consistent declaration of common operations, custom operations and one or more serialized representations of this specification. The spec focuses on the semantics of each operation. In addition to the specification the Substrait ecosystem also includes a number of libraries and useful tools.

    We highly recommend the tutorial to learn how a Substrait plan is constructed.

    Benefits

    • Avoids every system needing to create a communication method between every other system – each system merely supports ingesting and producing Substrait and it instantly becomes a part of the greater ecosystem.
    • Makes every part of the system upgradable. There’s a new query engine that’s ten times faster? Just plug it in!
    • Enables heterogeneous environments – run on a cluster of an unknown set of execution engines!
    • The text version of the Substrait plan allows you to quickly see how a plan functions without needing a visualizer (although there are Substrait visualizers as well!).

    Example Use Cases

    • Communicate a compute plan between a SQL parser and an execution engine (e.g. Calcite SQL parsing to Arrow C++ compute kernel)
    • Serialize a plan that represents a SQL view for consistent use in multiple systems (e.g. Iceberg views in Spark and Trino)
    • Submit a plan to different execution engines (e.g. Datafusion and Postgres) and get a consistent interpretation of the semantics.
    • Create an alternative plan generation implementation that can connect an existing end-user compute expression system to an existing end-user processing engine (e.g. Pandas operations executed inside SingleStore)
    • Build a pluggable plan visualization tool (e.g. D3 based plan visualizer)
    GitHub

    Basics

    Substrait is designed to allow a user to construct an arbitrarily complex data transformation plan. The plan is composed of one or more relational operations. Relational operations are well-defined transformation operations that work by taking zero or more input datasets and transforming them into zero or more output transformations. Substrait defines a core set of transformations, but users are also able to extend the operations with their own specialized operations.

    Each relational operation is composed of several properties. Common properties for relational operations include the following:

    Property Description Type
    Emit The set of columns output from this operation and the order of those columns. Logical & Physical
    Hints A set of optionally provided, optionally consumed information about an operation that better informs execution. These might include estimated number of input and output records, estimated record size, likely filter reduction, estimated dictionary size, etc. These can also include implementation specific pieces of execution information. Physical
    Constraint A set of runtime constraints around the operation, limiting its consumption based on real-world resources (CPU, memory) as well as virtual resources like number of records produced, the largest record size, etc. Physical

    Relational Signatures

    In functions, function signatures are declared externally to the use of those signatures (function bindings). In the case of relational operations, signatures are declared directly in the specification. This is due to the speed of change and number of total operations. Relational operations in the specification are expected to be <100 for several years with additions being infrequent. On the other hand, there is an expectation of both a much larger number of functions (1,000s) and a much higher velocity of additions.

    Each relational operation must declare the following:

    • Transformation logic around properties of the data. For example, does a relational operation maintain sortedness of a field? Does an operation change the distribution of data?
    • How many input relations does an operation require?
    • Does the operator produce an output (by specification, we limit relational operations to a single output at this time)
    • What is the schema and field ordering of an output (see emit below)?

    Emit: Output Ordering

    A relational operation uses field references to access specific fields of the input stream. Field references are always ordinal based on the order of the incoming streams. Each relational operation must declare the order of its output data. To simplify things, each relational operation can be in one of two modes:

    1. Direct output: The order of outputs is based on the definition declared by the relational operation.
    2. Remap: A listed ordering of the direct outputs. This remapping can be also used to drop columns no longer used (such as a filter field or join keys after a join). Note that remapping/exclusion can only be done at the outputs root struct. Filtering of compound values or extracting subsets must be done through other operation types (e.g. projection).

    Relation Properties

    There are a number of predefined properties that exist in Substrait relations. These include the following.

    Distribution

    When data is partitioned across multiple sibling sets, distribution describes that set of properties that apply to any one partition. This is based on a set of distribution expression properties. A distribution is declared as a set of one or more fields and a distribution type across all fields.

    Property Description Required
    Distribution Fields List of fields references that describe distribution (e.g. [0,2:4,5:0:0]). The order of these references do not impact results. Required for partitioned distribution type. Disallowed for singleton distribution type.
    Distribution Type PARTITIONED: For a discrete tuple of values for the declared distribution fields, all records with that tuple are located in the same partition. SINGLETON: there will only be a single partition for this operation. Required

    Orderedness

    A guarantee that data output from this operation is provided with a sort order. The sort order will be declared based on a set of sort field definitions based on the emitted output of this operation.

    Property Description Required
    Sort Fields A list of fields that the data are ordered by. The list is in order of the sort. If we sort by [0,1] then this means we only consider the data for field 1 to be ordered within each discrete value of field 0. At least one required.
    Per - Sort Field A field reference that the data is sorted by. Required
    Per - Sort Direction The direction of the data. See direction options below. Required

    Ordering Directions

    Direction Descriptions Nulls Position
    Ascending Returns data in ascending order based on the quality function associated with the type. Nulls are included before any values. First
    Descending Returns data in descending order based on the quality function associated with the type. Nulls are included before any values. First
    Ascending Returns data in ascending order based on the quality function associated with the type. Nulls are included after any values. Last
    Descending Returns data in descending order based on the quality function associated with the type. Nulls are included after any values. Last
    Custom function identifier Returns data using a custom function that returns -1, 0, or 1 depending on the order of the data. Per Function
    Clustered Ensures that all equal values are coalesced (but no ordering between values is defined). E.g. for values 1,2,3,1,2,3, output could be any of the following: 1,1,2,2,3,3 or 1,1,3,3,2,2 or 2,2,1,1,3,3 or 2,2,3,3,1,1 or 3,3,1,1,2,2 or 3,3,2,2,1,1. N/A, may appear anywhere but will be coalesced.
    Discussion Points
    • Should read definition types be more extensible in the same way that function signatures are? Are extensible read definition types necessary if we have custom relational operators?
    • How are decomposed reads expressed? For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object.
    GitHub

    Basics

    Substrait is designed to allow a user to describe arbitrarily complex data transformations. These transformations are composed of one or more relational operations. Relational operations are well-defined transformation operations that work by taking zero or more input datasets and transforming them into zero or more output transformations. Substrait defines a core set of transformations, but users are also able to extend the operations with their own specialized operations.

    Plans

    A plan is a tree of relations. The root of the tree is the final output of the plan. Each node in the tree is a relational operation. The children of a node are the inputs to the operation. The leaves of the tree are the input datasets to the plan.

    Plans can be composed together using reference relations. This allows for the construction of common plans that can be reused in multiple places. If a plan has no cycles (there is only one plan or each reference relation only references later plans) then the plan will form a DAG (Directed Acyclic Graph).

    Relational Operators

    Each relational operation is composed of several properties. Common properties for relational operations include the following:

    Property Description Type
    Emit The set of columns output from this operation and the order of those columns. Logical & Physical
    Hints A set of optionally provided, optionally consumed information about an operation that better informs execution. These might include estimated number of input and output records, estimated record size, likely filter reduction, estimated dictionary size, etc. These can also include implementation specific pieces of execution information. Physical
    Constraint A set of runtime constraints around the operation, limiting its consumption based on real-world resources (CPU, memory) as well as virtual resources like number of records produced, the largest record size, etc. Physical

    Relational Signatures

    In functions, function signatures are declared externally to the use of those signatures (function bindings). In the case of relational operations, signatures are declared directly in the specification. This is due to the speed of change and number of total operations. Relational operations in the specification are expected to be <100 for several years with additions being infrequent. On the other hand, there is an expectation of both a much larger number of functions (1,000s) and a much higher velocity of additions.

    Each relational operation must declare the following:

    • Transformation logic around properties of the data. For example, does a relational operation maintain sortedness of a field? Does an operation change the distribution of data?
    • How many input relations does an operation require?
    • Does the operator produce an output (by specification, we limit relational operations to a single output at this time)
    • What is the schema and field ordering of an output (see emit below)?

    Emit: Output Ordering

    A relational operation uses field references to access specific fields of the input stream. Field references are always ordinal based on the order of the incoming streams. Each relational operation must declare the order of its output data. To simplify things, each relational operation can be in one of two modes:

    1. Direct output: The order of outputs is based on the definition declared by the relational operation.
    2. Remap: A listed ordering of the direct outputs. This remapping can be also used to drop columns no longer used (such as a filter field or join keys after a join). Note that remapping/exclusion can only be done at the outputs root struct. Filtering of compound values or extracting subsets must be done through other operation types (e.g. projection).

    Relation Properties

    There are a number of predefined properties that exist in Substrait relations. These include the following.

    Distribution

    When data is partitioned across multiple sibling sets, distribution describes that set of properties that apply to any one partition. This is based on a set of distribution expression properties. A distribution is declared as a set of one or more fields and a distribution type across all fields.

    Property Description Required
    Distribution Fields List of fields references that describe distribution (e.g. [0,2:4,5:0:0]). The order of these references do not impact results. Required for partitioned distribution type. Disallowed for singleton distribution type.
    Distribution Type PARTITIONED: For a discrete tuple of values for the declared distribution fields, all records with that tuple are located in the same partition. SINGLETON: there will only be a single partition for this operation. Required

    Orderedness

    A guarantee that data output from this operation is provided with a sort order. The sort order will be declared based on a set of sort field definitions based on the emitted output of this operation.

    Property Description Required
    Sort Fields A list of fields that the data are ordered by. The list is in order of the sort. If we sort by [0,1] then this means we only consider the data for field 1 to be ordered within each discrete value of field 0. At least one required.
    Per - Sort Field A field reference that the data is sorted by. Required
    Per - Sort Direction The direction of the data. See direction options below. Required

    Ordering Directions

    Direction Descriptions Nulls Position
    Ascending Returns data in ascending order based on the quality function associated with the type. Nulls are included before any values. First
    Descending Returns data in descending order based on the quality function associated with the type. Nulls are included before any values. First
    Ascending Returns data in ascending order based on the quality function associated with the type. Nulls are included after any values. Last
    Descending Returns data in descending order based on the quality function associated with the type. Nulls are included after any values. Last
    Custom function identifier Returns data using a custom function that returns -1, 0, or 1 depending on the order of the data. Per Function
    Clustered Ensures that all equal values are coalesced (but no ordering between values is defined). E.g. for values 1,2,3,1,2,3, output could be any of the following: 1,1,2,2,3,3 or 1,1,3,3,2,2 or 2,2,1,1,3,3 or 2,2,3,3,1,1 or 3,3,1,1,2,2 or 3,3,2,2,1,1. N/A, may appear anywhere but will be coalesced.
    Discussion Points
    • Should read definition types be more extensible in the same way that function signatures are? Are extensible read definition types necessary if we have custom relational operators?
    • How are decomposed reads expressed? For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object.
    GitHub

    Embedded Relations

    Pending.

    Embedded relations allow a Substrait producer to define a set operation that will be embedded in the plan.

    TODO: define lots of details about what interfaces, languages, formats, etc. Should reasonably be an extension of embedded user defined table functions.

    GitHub

    Embedded Relations

    Pending.

    Embedded relations allow a Substrait producer to define a set operation that will be embedded in the plan.

    TODO: define lots of details about what interfaces, languages, formats, etc. Should reasonably be an extension of embedded user defined table functions.

    \ No newline at end of file +-->
    \ No newline at end of file diff --git a/relations/logical_relations/index.html b/relations/logical_relations/index.html index 6bc2078..a463ce5 100644 --- a/relations/logical_relations/index.html +++ b/relations/logical_relations/index.html @@ -1,4 +1,4 @@ - Logical Relations - Substrait: Cross-Language Serialization for Relational Algebra

    Logical Relations

    Read Operator

    The read operator is an operator that produces one output. A simple example would be the reading of a Parquet file. It is expected that many types of reads will be added over time.

    Signature Value
    Inputs 0
    Outputs 1
    Property Maintenance N/A (no inputs)
    Direct Output Order Defaults to the schema of the data read after the optional projection (masked complex expression) is applied.

    Read Properties

    Property Description Required
    Definition The contents of the read property definition. Required
    Direct Schema Defines the schema of the output of the read (before any projection or emit remapping/hiding). Required
    Filter A boolean Substrait expression that describes a filter that must be applied to the data. The filter should be interpreted against the direct schema. Optional, defaults to none.
    Best Effort Filter A boolean Substrait expression that describes a filter that may be applied to the data. The filter should be interpreted against the direct schema. Optional, defaults to none.
    Projection A masked complex expression describing the portions of the content that should be read Optional, defaults to all of schema
    Output Properties Declaration of orderedness and/or distribution properties this read produces. Optional, defaults to no properties.
    Properties A list of name/value pairs associated with the read. Optional, defaults to empty

    Read Filtering

    The read relation has two different filter properties. A filter, which must be satisfied by the operator and a best effort filter, which does not have to be satisfied. This reflects the way that consumers are often implemented. A consumer is often only able to fully apply a limited set of operations in the scan. There can then be an extended set of operations which a consumer can apply in a best effort fashion. A producer, when setting these two fields, should take care to only use expressions that the consumer is capable of handling.

    As an example, a consumer may only be able to fully apply (in the read relation) <, =, and > on integral types. The consumer may be able to apply <, =, and > in a best effort fashion on decimal and string types. Consider the filter expression my_int < 10 && my_string < "x" && upper(my_string) > "B". In this case the filter should be set to my_int < 10 and the best_effort_filter should be set to my_string < "x" and the remaining portion (upper(my_string) > "B") should be put into a filter relation.

    A filter expression must be interpreted against the direct schema before the projection expression has been applied. As a result, fields may be referenced by the filter expression which are not included in the relation’s output.

    Read Definition Types

    Adding new Read Definition Types

    If you have a read definition that’s not covered here, see the process for adding new read definition types.

    Read definition types (like the rest of the features in Substrait) are built by the community and added to the specification.

    Virtual Table

    A virtual table is a table whose contents are embedded in the plan itself. The table data is encoded as records consisting of literal values.

    Property Description Required
    Data Required Required

    Named Table

    A named table is a reference to data defined elsewhere. For example, there may be a catalog of tables with unique names that both the producer and consumer agree on. This catalog would provide the consumer with more information on how to retrieve the data.

    Property Description Required
    Names A list of namespaced strings that, together, form the table name Required (at least one)

    Files Type

    Property Description Required
    Items An array of Items (path or path glob) associated with the read. Required
    Format per item Enumeration of available formats. Only current option is PARQUET. Required
    Slicing parameters per item Information to use when reading a slice of a file. Optional
    Slicing Files

    A read operation is allowed to only read part of a file. This is convenient, for example, when distributing a read operation across several nodes. The slicing parameters are specified as byte offsets into the file.

    Many file formats consist of indivisible “chunks” of data (e.g. Parquet row groups). If this happens the consumer can determine which slice a particular chunk belongs to. For example, one possible approach is that a chunk should only be read if the midpoint of the chunk (dividing by 2 and rounding down) is contained within the asked-for byte range.

    message ReadRel {
    + Logical Relations - Substrait: Cross-Language Serialization for Relational Algebra      

    Logical Relations

    Read Operator

    The read operator is an operator that produces one output. A simple example would be the reading of a Parquet file. It is expected that many types of reads will be added over time.

    Signature Value
    Inputs 0
    Outputs 1
    Property Maintenance N/A (no inputs)
    Direct Output Order Defaults to the schema of the data read after the optional projection (masked complex expression) is applied.

    Read Properties

    Property Description Required
    Definition The contents of the read property definition. Required
    Direct Schema Defines the schema of the output of the read (before any projection or emit remapping/hiding). Required
    Filter A boolean Substrait expression that describes a filter that must be applied to the data. The filter should be interpreted against the direct schema. Optional, defaults to none.
    Best Effort Filter A boolean Substrait expression that describes a filter that may be applied to the data. The filter should be interpreted against the direct schema. Optional, defaults to none.
    Projection A masked complex expression describing the portions of the content that should be read Optional, defaults to all of schema
    Output Properties Declaration of orderedness and/or distribution properties this read produces. Optional, defaults to no properties.
    Properties A list of name/value pairs associated with the read. Optional, defaults to empty

    Read Filtering

    The read relation has two different filter properties. A filter, which must be satisfied by the operator and a best effort filter, which does not have to be satisfied. This reflects the way that consumers are often implemented. A consumer is often only able to fully apply a limited set of operations in the scan. There can then be an extended set of operations which a consumer can apply in a best effort fashion. A producer, when setting these two fields, should take care to only use expressions that the consumer is capable of handling.

    As an example, a consumer may only be able to fully apply (in the read relation) <, =, and > on integral types. The consumer may be able to apply <, =, and > in a best effort fashion on decimal and string types. Consider the filter expression my_int < 10 && my_string < "x" && upper(my_string) > "B". In this case the filter should be set to my_int < 10 and the best_effort_filter should be set to my_string < "x" and the remaining portion (upper(my_string) > "B") should be put into a filter relation.

    A filter expression must be interpreted against the direct schema before the projection expression has been applied. As a result, fields may be referenced by the filter expression which are not included in the relation’s output.

    Read Definition Types

    Adding new Read Definition Types

    If you have a read definition that’s not covered here, see the process for adding new read definition types.

    Read definition types (like the rest of the features in Substrait) are built by the community and added to the specification.

    Virtual Table

    A virtual table is a table whose contents are embedded in the plan itself. The table data is encoded as records consisting of literal values.

    Property Description Required
    Data Required Required

    Named Table

    A named table is a reference to data defined elsewhere. For example, there may be a catalog of tables with unique names that both the producer and consumer agree on. This catalog would provide the consumer with more information on how to retrieve the data.

    Property Description Required
    Names A list of namespaced strings that, together, form the table name Required (at least one)

    Files Type

    Property Description Required
    Items An array of Items (path or path glob) associated with the read. Required
    Format per item Enumeration of available formats. Only current option is PARQUET. Required
    Slicing parameters per item Information to use when reading a slice of a file. Optional
    Slicing Files

    A read operation is allowed to only read part of a file. This is convenient, for example, when distributing a read operation across several nodes. The slicing parameters are specified as byte offsets into the file.

    Many file formats consist of indivisible “chunks” of data (e.g. Parquet row groups). If this happens the consumer can determine which slice a particular chunk belongs to. For example, one possible approach is that a chunk should only be read if the midpoint of the chunk (dividing by 2 and rounding down) is contained within the asked-for byte range.

    message ReadRel {
       RelCommon common = 1;
       NamedStruct base_schema = 2;
       Expression filter = 3;
    diff --git a/relations/physical_relations/index.html b/relations/physical_relations/index.html
    index 6f15833..22569d8 100644
    --- a/relations/physical_relations/index.html
    +++ b/relations/physical_relations/index.html
    @@ -1,4 +1,4 @@
    - Physical Relations - Substrait: Cross-Language Serialization for Relational Algebra      

    Physical Relations

    There is no true distinction between logical and physical operations in Substrait. By convention, certain operations are classified as physical, but all operations can be potentially used in any kind of plan. A particular set of transformations or target operators may (by convention) be considered the “physical plan” but this is a characteristic of the system consuming substrait as opposed to a definition within Substrait.

    Hash Equijoin Operator

    The hash equijoin join operator will build a hash table out of the right input based on a set of join keys. It will then probe that hash table for incoming inputs, finding matches.

    Signature Value
    Inputs 2
    Outputs 1
    Property Maintenance Distribution is maintained. Orderedness of the left set is maintained in INNER join cases, otherwise it is eliminated.
    Direct Output Order Same as the Join operator.

    Hash Equijoin Properties

    Property Description Required
    Left Input A relational input.(Probe-side) Required
    Right Input A relational input.(Build-side) Required
    Left Keys References to the fields to join on in the left input. Required
    Right Keys References to the fields to join on in the right input. Required
    Post Join Predicate An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. Optional, defaults true.
    Join Type One of the join types defined in the Join operator. Required

    NLJ (Nested Loop Join) Operator

    The nested loop join operator does a join by holding the entire right input and then iterating over it using the left input, evaluating the join expression on the Cartesian product of all rows, only outputting rows where the expression is true. Will also include non-matching rows in the OUTER, LEFT and RIGHT operations per the join type requirements.

    Signature Value
    Inputs 2
    Outputs 1
    Property Maintenance Distribution is maintained. Orderedness is eliminated.
    Direct Output Order Same as the Join operator.

    NLJ Properties

    Property Description Required
    Left Input A relational input. Required
    Right Input A relational input. Required
    Join Expression A boolean condition that describes whether each record from the left set “match” the record from the right set. Optional. Defaults to true (a Cartesian join).
    Join Type One of the join types defined in the Join operator. Required

    Merge Equijoin Operator

    The merge equijoin does a join by taking advantage of two sets that are sorted on the join keys. This allows the join operation to be done in a streaming fashion.

    Signature Value
    Inputs 2
    Outputs 1
    Property Maintenance Distribution is maintained. Orderedness is eliminated.
    Direct Output Order Same as the Join operator.

    Merge Join Properties

    Property Description Required
    Left Input A relational input. Required
    Right Input A relational input. Required
    Left Keys References to the fields to join on in the left input. Required
    Right Keys References to the fields to join on in the right input. Reauired
    Post Join Predicate An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. Optional, defaults true.
    Join Type One of the join types defined in the Join operator. Required

    Exchange Operator

    The exchange operator will redistribute data based on an exchange type definition. Applying this operation will lead to an output that presents the desired distribution.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Orderedness is maintained. Distribution is overwritten based on configuration.
    Direct Output Order Order of the input.

    Exchange Types

    Type Description
    Scatter Distribute data using a system defined hashing function that considers one or more fields. For the same type of fields and same ordering of values, the same partition target should be identified for different ExchangeRels
    Single Bucket Define an expression that provides a single i32 bucket number. Optionally define whether the expression will only return values within the valid number of partition counts. If not, the system should modulo the return value to determine a target partition.
    Multi Bucket Define an expression that provides a List<i32> of bucket numbers. Optionally define whether the expression will only return values within the valid number of partition counts. If not, the system should modulo the return value to determine a target partition. The records should be sent to all bucket numbers provided by the expression.
    Broadcast Send all records to all partitions.
    Round Robin Send records to each target in sequence. Can follow either exact or approximate behavior. Approximate will attempt to balance the number of records sent to each destination but may not exactly distribute evenly and may send batches of records to each target before moving to the next.

    Exchange Properties

    Property Description Required
    Input The relational input. Required.
    Distribution Type One of the distribution types defined above. Required.
    Partition Count The number of partitions targeted for output. Optional. If not defined, implementation system should decide the number of partitions. Note that when not defined, single or multi bucket expressions should not be constrained to count.
    Expression Mapping Describes a relationship between each partition ID and the destination that partition should be sent to. Optional. A partition may be sent to 0..N locations. Value can either be a URI or arbitrary value.

    Merging Capture

    A receiving operation that will merge multiple ordered streams to maintain orderedness.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Orderedness and distribution are maintained.
    Direct Output Order Order of the input.

    Merging Capture Properties

    Property Description Required
    Blocking Whether the merging should block incoming data. Blocking should be used carefully, based on whether a deadlock can be produced. Optional, defaults to false

    Simple Capture

    A receiving operation that will merge multiple streams in an arbitrary order.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Orderness is empty after this operation. Distribution are maintained.
    Direct Output Order Order of the input.

    Naive Capture Properties

    Property Description Required
    Input The relational input. Required

    Top-N Operation

    The top-N operator reorders a dataset based on one or more identified sort fields as well as a sorting function. Rather than sort the entire dataset, the top-N will only maintain the total number of records required to ensure a limited output. A top-n is a combination of a logical sort and logical fetch operations.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Will update orderedness property to the output of the sort operation. Distribution property only remapped based on emit.
    Direct Output Order The field order of the input.

    Top-N Properties

    Property Description Required
    Input The relational input. Required
    Sort Fields List of one or more fields to sort by. Uses the same properties as the orderedness property. One sort field required
    Offset A positive integer. Declares the offset for retrieval of records. Optional, defaults to 0.
    Count A positive integer. Declares the number of records that should be returned. Required

    Hash Aggregate Operation

    The hash aggregate operation maintains a hash table for each grouping set to coalesce equivalent tuples.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Maintains distribution if all distribution fields are contained in every grouping set. No orderness guaranteed.
    Direct Output Order Same as defined by Aggregate operation.

    Hash Aggregate Properties

    Property Description Required
    Input The relational input. Required
    Grouping Sets One or more grouping sets. Optional, required if no measures.
    Per Grouping Set A list of expression grouping that the aggregation measured should be calculated for. Optional, defaults to 0.
    Measures A list of one or more aggregate expressions. Implementations may or may not support aggregate ordering expressions. Optional, required if no grouping sets.

    Streaming Aggregate Operation

    The streaming aggregate operation leverages data ordered by the grouping expressions to calculate data each grouping set tuple-by-tuple in streaming fashion. All grouping sets and orderings requested on each aggregate must be compatible to allow multiple grouping sets or aggregate orderings.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Maintains distribution if all distribution fields are contained in every grouping set. Maintains input ordering.
    Direct Output Order Same as defined by Aggregate operation.

    Streaming Aggregate Properties

    Property Description Required
    Input The relational input. Required
    Grouping Sets One or more grouping sets. If multiple grouping sets are declared, sets must all be compatible with the input sortedness. Optional, required if no measures.
    Per Grouping Set A list of expression grouping that the aggregation measured should be calculated for. Optional, defaults to 0.
    Measures A list of one or more aggregate expressions. Aggregate expressions ordering requirements must be compatible with expected ordering. Optional, required if no grouping sets.

    Consistent Partition Window Operation

    A consistent partition window operation is a special type of project operation where every function is a window function and all of the window functions share the same sorting and partitioning. This allows for the sort and partition to be calculated once and shared between the various function evaluations.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Maintains distribution and ordering.
    Direct Output Order Same as Project operator (input followed by each window expression).

    Window Properties

    Property Description Required
    Input The relational input. Required
    Window Functions One or more window functions. At least one required.

    Expand Operation

    The expand operation creates duplicates of input records based on the Expand Fields. Each Expand Field can be a Switching Field or an expression. Switching Fields are described below. If an Expand Field is an expression then its value is consistent across all duplicate rows.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Distribution is maintained if all the distribution fields are consistent fields with direct references. Ordering can only be maintained down to the level of consistent fields that are kept.
    Direct Output Order The expand fields followed by an i32 column describing the index of the duplicate that the row is derived from.

    Expand Properties

    Property Description Required
    Input The relational input. Required
    Direct Fields Expressions describing the output fields. These refer to the schema of the input. Each Direct Field must be an expression or a Switching Field Required

    Switching Field Properties

    A switching field is a field whose value is different in each duplicated row. All switching fields in an Expand Operation must have the same number of duplicates.

    Property Description Required
    Duplicates List of one or more expressions. The output will contain a row for each expression. Required

    Hashing Window Operation

    A window aggregate operation that will build hash tables for each distinct partition expression.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Maintains distribution. Eliminates ordering.
    Direct Output Order Same as Project operator (input followed by each window expression).

    Hashing Window Properties

    Property Description Required
    Input The relational input. Required
    Window Expressions One or more window expressions. At least one required.

    Streaming Window Operation

    A window aggregate operation that relies on a partition/ordering sorted input.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Maintains distribution. Eliminates ordering.
    Direct Output Order Same as Project operator (input followed by each window expression).

    Streaming Window Properties

    Property Description Required
    Input The relational input. Required
    Window Expressions One or more window expressions. Must be supported by the sortedness of the input. At least one required.
    GitHub

    Physical Relations

    There is no true distinction between logical and physical operations in Substrait. By convention, certain operations are classified as physical, but all operations can be potentially used in any kind of plan. A particular set of transformations or target operators may (by convention) be considered the “physical plan” but this is a characteristic of the system consuming substrait as opposed to a definition within Substrait.

    Hash Equijoin Operator

    The hash equijoin join operator will build a hash table out of the right input based on a set of join keys. It will then probe that hash table for incoming inputs, finding matches.

    Signature Value
    Inputs 2
    Outputs 1
    Property Maintenance Distribution is maintained. Orderedness of the left set is maintained in INNER join cases, otherwise it is eliminated.
    Direct Output Order Same as the Join operator.

    Hash Equijoin Properties

    Property Description Required
    Left Input A relational input.(Probe-side) Required
    Right Input A relational input.(Build-side) Required
    Left Keys References to the fields to join on in the left input. Required
    Right Keys References to the fields to join on in the right input. Required
    Post Join Predicate An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. Optional, defaults true.
    Join Type One of the join types defined in the Join operator. Required

    NLJ (Nested Loop Join) Operator

    The nested loop join operator does a join by holding the entire right input and then iterating over it using the left input, evaluating the join expression on the Cartesian product of all rows, only outputting rows where the expression is true. Will also include non-matching rows in the OUTER, LEFT and RIGHT operations per the join type requirements.

    Signature Value
    Inputs 2
    Outputs 1
    Property Maintenance Distribution is maintained. Orderedness is eliminated.
    Direct Output Order Same as the Join operator.

    NLJ Properties

    Property Description Required
    Left Input A relational input. Required
    Right Input A relational input. Required
    Join Expression A boolean condition that describes whether each record from the left set “match” the record from the right set. Optional. Defaults to true (a Cartesian join).
    Join Type One of the join types defined in the Join operator. Required

    Merge Equijoin Operator

    The merge equijoin does a join by taking advantage of two sets that are sorted on the join keys. This allows the join operation to be done in a streaming fashion.

    Signature Value
    Inputs 2
    Outputs 1
    Property Maintenance Distribution is maintained. Orderedness is eliminated.
    Direct Output Order Same as the Join operator.

    Merge Join Properties

    Property Description Required
    Left Input A relational input. Required
    Right Input A relational input. Required
    Left Keys References to the fields to join on in the left input. Required
    Right Keys References to the fields to join on in the right input. Reauired
    Post Join Predicate An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. Optional, defaults true.
    Join Type One of the join types defined in the Join operator. Required

    Exchange Operator

    The exchange operator will redistribute data based on an exchange type definition. Applying this operation will lead to an output that presents the desired distribution.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Orderedness is maintained. Distribution is overwritten based on configuration.
    Direct Output Order Order of the input.

    Exchange Types

    Type Description
    Scatter Distribute data using a system defined hashing function that considers one or more fields. For the same type of fields and same ordering of values, the same partition target should be identified for different ExchangeRels
    Single Bucket Define an expression that provides a single i32 bucket number. Optionally define whether the expression will only return values within the valid number of partition counts. If not, the system should modulo the return value to determine a target partition.
    Multi Bucket Define an expression that provides a List<i32> of bucket numbers. Optionally define whether the expression will only return values within the valid number of partition counts. If not, the system should modulo the return value to determine a target partition. The records should be sent to all bucket numbers provided by the expression.
    Broadcast Send all records to all partitions.
    Round Robin Send records to each target in sequence. Can follow either exact or approximate behavior. Approximate will attempt to balance the number of records sent to each destination but may not exactly distribute evenly and may send batches of records to each target before moving to the next.

    Exchange Properties

    Property Description Required
    Input The relational input. Required.
    Distribution Type One of the distribution types defined above. Required.
    Partition Count The number of partitions targeted for output. Optional. If not defined, implementation system should decide the number of partitions. Note that when not defined, single or multi bucket expressions should not be constrained to count.
    Expression Mapping Describes a relationship between each partition ID and the destination that partition should be sent to. Optional. A partition may be sent to 0..N locations. Value can either be a URI or arbitrary value.

    Merging Capture

    A receiving operation that will merge multiple ordered streams to maintain orderedness.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Orderedness and distribution are maintained.
    Direct Output Order Order of the input.

    Merging Capture Properties

    Property Description Required
    Blocking Whether the merging should block incoming data. Blocking should be used carefully, based on whether a deadlock can be produced. Optional, defaults to false

    Simple Capture

    A receiving operation that will merge multiple streams in an arbitrary order.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Orderness is empty after this operation. Distribution are maintained.
    Direct Output Order Order of the input.

    Naive Capture Properties

    Property Description Required
    Input The relational input. Required

    Top-N Operation

    The top-N operator reorders a dataset based on one or more identified sort fields as well as a sorting function. Rather than sort the entire dataset, the top-N will only maintain the total number of records required to ensure a limited output. A top-n is a combination of a logical sort and logical fetch operations.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Will update orderedness property to the output of the sort operation. Distribution property only remapped based on emit.
    Direct Output Order The field order of the input.

    Top-N Properties

    Property Description Required
    Input The relational input. Required
    Sort Fields List of one or more fields to sort by. Uses the same properties as the orderedness property. One sort field required
    Offset A positive integer. Declares the offset for retrieval of records. Optional, defaults to 0.
    Count A positive integer. Declares the number of records that should be returned. Required

    Hash Aggregate Operation

    The hash aggregate operation maintains a hash table for each grouping set to coalesce equivalent tuples.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Maintains distribution if all distribution fields are contained in every grouping set. No orderness guaranteed.
    Direct Output Order Same as defined by Aggregate operation.

    Hash Aggregate Properties

    Property Description Required
    Input The relational input. Required
    Grouping Sets One or more grouping sets. Optional, required if no measures.
    Per Grouping Set A list of expression grouping that the aggregation measured should be calculated for. Optional, defaults to 0.
    Measures A list of one or more aggregate expressions. Implementations may or may not support aggregate ordering expressions. Optional, required if no grouping sets.

    Streaming Aggregate Operation

    The streaming aggregate operation leverages data ordered by the grouping expressions to calculate data each grouping set tuple-by-tuple in streaming fashion. All grouping sets and orderings requested on each aggregate must be compatible to allow multiple grouping sets or aggregate orderings.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Maintains distribution if all distribution fields are contained in every grouping set. Maintains input ordering.
    Direct Output Order Same as defined by Aggregate operation.

    Streaming Aggregate Properties

    Property Description Required
    Input The relational input. Required
    Grouping Sets One or more grouping sets. If multiple grouping sets are declared, sets must all be compatible with the input sortedness. Optional, required if no measures.
    Per Grouping Set A list of expression grouping that the aggregation measured should be calculated for. Optional, defaults to 0.
    Measures A list of one or more aggregate expressions. Aggregate expressions ordering requirements must be compatible with expected ordering. Optional, required if no grouping sets.

    Consistent Partition Window Operation

    A consistent partition window operation is a special type of project operation where every function is a window function and all of the window functions share the same sorting and partitioning. This allows for the sort and partition to be calculated once and shared between the various function evaluations.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Maintains distribution and ordering.
    Direct Output Order Same as Project operator (input followed by each window expression).

    Window Properties

    Property Description Required
    Input The relational input. Required
    Window Functions One or more window functions. At least one required.

    Expand Operation

    The expand operation creates duplicates of input records based on the Expand Fields. Each Expand Field can be a Switching Field or an expression. Switching Fields are described below. If an Expand Field is an expression then its value is consistent across all duplicate rows.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Distribution is maintained if all the distribution fields are consistent fields with direct references. Ordering can only be maintained down to the level of consistent fields that are kept.
    Direct Output Order The expand fields followed by an i32 column describing the index of the duplicate that the row is derived from.

    Expand Properties

    Property Description Required
    Input The relational input. Required
    Direct Fields Expressions describing the output fields. These refer to the schema of the input. Each Direct Field must be an expression or a Switching Field Required

    Switching Field Properties

    A switching field is a field whose value is different in each duplicated row. All switching fields in an Expand Operation must have the same number of duplicates.

    Property Description Required
    Duplicates List of one or more expressions. The output will contain a row for each expression. Required

    Hashing Window Operation

    A window aggregate operation that will build hash tables for each distinct partition expression.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Maintains distribution. Eliminates ordering.
    Direct Output Order Same as Project operator (input followed by each window expression).

    Hashing Window Properties

    Property Description Required
    Input The relational input. Required
    Window Expressions One or more window expressions. At least one required.

    Streaming Window Operation

    A window aggregate operation that relies on a partition/ordering sorted input.

    Signature Value
    Inputs 1
    Outputs 1
    Property Maintenance Maintains distribution. Eliminates ordering.
    Direct Output Order Same as Project operator (input followed by each window expression).

    Streaming Window Properties

    Property Description Required
    Input The relational input. Required
    Window Expressions One or more window expressions. Must be supported by the sortedness of the input. At least one required.
    GitHub
    GitHub
    GitHub

    Basics

    Substrait is designed to be serialized into various different formats. Currently we support a binary serialization for transmission of plans between programs (e.g. IPC or network communication) and a text serialization for debugging and human readability. Other formats may be added in the future.

    These formats serialize a collection of plans. Substrait does not define how a collection of plans is to be interpreted. For example, the following scenarios are all valid uses of a collection of plans:

    • A query engine receives a plan and executes it. It receives a collection of plans with a single root plan. The top-level node of the root plan defines the output of the query. Non-root plans may be included as common subplans which are referenced from the root plan.
    • A transpiler may convert plans from one dialect to another. It could take, as input, a single root plan. Then it could output a serialized binary containing multiple root plans. Each root plan is a representation of the input plan in a different dialect.
    • A distributed scheduler might expect 1+ root plans. Each root plan describes a different stage of computation.

    Libraries should make sure to thoroughly describe the way plan collections will be produced or consumed.

    Root plans

    We often refer to query plans as a graph of nodes (typically a DAG unless the query is recursive). However, we encode this graph as a collection of trees with a single root tree that references other trees (which may also transitively reference other trees). Plan serializations all have some way to indicate which plan(s) are “root” plans. Any plan that is not a root plan and is not referenced (directly or transitively) by some root plan can safely be ignored.

    \ No newline at end of file diff --git a/serialization/binary_serialization/index.html b/serialization/binary_serialization/index.html index ee6b0c0..8bf27c8 100644 --- a/serialization/binary_serialization/index.html +++ b/serialization/binary_serialization/index.html @@ -1,4 +1,4 @@ - Binary Serialization - Substrait: Cross-Language Serialization for Relational Algebra

    Binary Serialization

    Substrait can be serialized into a protobuf-based binary representation. The proto schema/IDL files can be found on GitHub. Proto files are place in the io.substrait namespace for C++/Java and the Substrait.Protobuf namespace for C#.

    Plan

    The main top-level object used to communicate a Substrait plan using protobuf is a Plan message (see the ExtendedExpression for an alternative other top-level object). The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees.

    message Plan {
    + Binary Serialization - Substrait: Cross-Language Serialization for Relational Algebra      

    Binary Serialization

    Substrait can be serialized into a protobuf-based binary representation. The proto schema/IDL files can be found on GitHub. Proto files are place in the io.substrait namespace for C++/Java and the Substrait.Protobuf namespace for C#.

    Plan

    The main top-level object used to communicate a Substrait plan using protobuf is a Plan message (see the ExtendedExpression for an alternative other top-level object). The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees.

    message Plan {
       // Substrait version of the plan. Optional up to 0.17.0, required for later
       // versions.
       Version version = 6;
    @@ -126,4 +126,4 @@
       LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
       FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
       IN THE SOFTWARE.
    --->   
    \ No newline at end of file +-->
    \ No newline at end of file diff --git a/serialization/text_serialization/index.html b/serialization/text_serialization/index.html index 0ba0a51..dd0ee78 100644 --- a/serialization/text_serialization/index.html +++ b/serialization/text_serialization/index.html @@ -1,4 +1,4 @@ - Text Serialization - Substrait: Cross-Language Serialization for Relational Algebra

    Text Serialization

    To maximize the new user experience, it is important for Substrait to have a text representation of plans. This allows people to experiment with basic tooling. Building simple CLI tools that do things like SQL > Plan and Plan > SQL or REPL plan construction can all be done relatively straightforwardly with a text representation.

    The recommended text serialization format is JSON. Since the text format is not designed for performance, the format can be produced to maximize readability. This also allows nice symmetry between the construction of plans and the configuration of various extensions such as function signatures and user defined types.

    To ensure the JSON is valid, the object will be defined using the OpenApi 3.1 specification. This not only allows strong validation, the OpenApi specification enables code generators to be easily used to produce plans in many languages.

    While JSON will be used for much of the plan serialization, Substrait uses a custom simplistic grammar for record level expressions. While one can construct an equation such as (10 + 5)/2 using a tree of function and literal objects, it is much more human-readable to consume a plan when the information is written similarly to the way one typically consumes scalar expressions. This grammar will be maintained in an ANTLR grammar (targetable to multiple programming languages) and is also planned to be supported via JSON schema definition format tag so that the grammar can be validated as part of the schema validation.

    GitHub

    Text Serialization

    To maximize the new user experience, it is important for Substrait to have a text representation of plans. This allows people to experiment with basic tooling. Building simple CLI tools that do things like SQL > Plan and Plan > SQL or REPL plan construction can all be done relatively straightforwardly with a text representation.

    The recommended text serialization format is JSON. Since the text format is not designed for performance, the format can be produced to maximize readability. This also allows nice symmetry between the construction of plans and the configuration of various extensions such as function signatures and user defined types.

    To ensure the JSON is valid, the object will be defined using the OpenApi 3.1 specification. This not only allows strong validation, the OpenApi specification enables code generators to be easily used to produce plans in many languages.

    While JSON will be used for much of the plan serialization, Substrait uses a custom simplistic grammar for record level expressions. While one can construct an equation such as (10 + 5)/2 using a tree of function and literal objects, it is much more human-readable to consume a plan when the information is written similarly to the way one typically consumes scalar expressions. This grammar will be maintained in an ANTLR grammar (targetable to multiple programming languages) and is also planned to be supported via JSON schema definition format tag so that the grammar can be validated as part of the schema validation.

    GitHub

    Extending

    Substrait is a community project and requires consensus about new additions to the specification in order to maintain consistency. The best way to get consensus is to discuss ideas. The main ways to communicate are:

    • Substrait Mailing List
    • Substrait Slack
    • Community Meeting

    Minor changes

    Simple changes like typos and bug fixes do not require as much effort. File an issue or send a PR and we can discuss it there.

    Complex changes

    For complex features it is useful to discuss the change first. It will be useful to gather some background information to help get everyone on the same page.

    Outline the issue

    Language

    Every engine has its own terminology. Every Spark user probably knows what an “attribute” is. Velox users will know what a “RowVector” means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible.

    Motivation

    What problems does this relation solve? If it is a more logical relation then how does it allow users to express new capabilities? If it is more of an internal relation then how does it map to existing logical relations? How is it different than other existing relations? Why do we need this?

    Examples

    Provide example input and output for the relation. Show example plans. Try and motivate your examples, as best as possible, with something that looks like a real world problem. These will go a long ways towards helping others understand the purpose of a relation.

    Alternatives

    Discuss what alternatives are out there. Are there other ways to achieve similar results? Do some systems handle this problem differently?

    Survey existing implementation

    It’s unlikely that this is the first time that this has been done. Figuring out

    Prototype the feature

    Novel approaches should be implemented as an extension first.

    Substrait design principles

    Substrait is designed around interoperability so a feature only used by a single system may not be accepted. But don’t dispair! Substrait has a highly developed extension system for this express purpose.

    You don’t have to do it alone

    If you are hoping to add a feature and these criteria seem intimidating then feel free to start a mailing list discussion before you have all the information and ask for help. Investigating other implementations, in particular, is something that can be quite difficult to do on your own.

    GitHub

    Extending

    Substrait is a community project and requires consensus about new additions to the specification in order to maintain consistency. The best way to get consensus is to discuss ideas. The main ways to communicate are:

    • Substrait Mailing List
    • Substrait Slack
    • Community Meeting

    Minor changes

    Simple changes like typos and bug fixes do not require as much effort. File an issue or send a PR and we can discuss it there.

    Complex changes

    For complex features it is useful to discuss the change first. It will be useful to gather some background information to help get everyone on the same page.

    Outline the issue

    Language

    Every engine has its own terminology. Every Spark user probably knows what an “attribute” is. Velox users will know what a “RowVector” means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible.

    Motivation

    What problems does this relation solve? If it is a more logical relation then how does it allow users to express new capabilities? If it is more of an internal relation then how does it map to existing logical relations? How is it different than other existing relations? Why do we need this?

    Examples

    Provide example input and output for the relation. Show example plans. Try and motivate your examples, as best as possible, with something that looks like a real world problem. These will go a long ways towards helping others understand the purpose of a relation.

    Alternatives

    Discuss what alternatives are out there. Are there other ways to achieve similar results? Do some systems handle this problem differently?

    Survey existing implementation

    It’s unlikely that this is the first time that this has been done. Figuring out

    Prototype the feature

    Novel approaches should be implemented as an extension first.

    Substrait design principles

    Substrait is designed around interoperability so a feature only used by a single system may not be accepted. But don’t dispair! Substrait has a highly developed extension system for this express purpose.

    You don’t have to do it alone

    If you are hoping to add a feature and these criteria seem intimidating then feel free to start a mailing list discussion before you have all the information and ask for help. Investigating other implementations, in particular, is something that can be quite difficult to do on your own.

    GitHub

    Specification

    Status

    The specification has passed the initial design phase and is now in the final stages of being fleshed out. The community is encouraged to identify (and address) any perceived gaps in functionality using GitHub issues and PRs. Once all of the planned implementations have been completed all deprecated fields will be eliminated and version 1.0 will be released.

    Components (Complete)

    Section Description
    Simple Types A way to describe the set of basic types that will be operated on within a plan. Only includes simple types such as integers and doubles (nothing configurable or compound).
    Compound Types Expression of types that go beyond simple scalar values. Key concepts here include: configurable types such as fixed length and numeric types as well as compound types such as structs, maps, lists, etc.
    Type Variations Physical variations to base types.
    User Defined Types Extensions that can be defined for specific IR producers/consumers.
    Field References Expressions to identify which portions of a record should be operated on.
    Scalar Functions Description of how functions are specified. Concepts include arguments, variadic functions, output type derivation, etc.
    Scalar Function List A list of well-known canonical functions in YAML format.
    Specialized Record Expressions Specialized expression types that are more naturally expressed outside the function paradigm. Examples include items such as if/then/else and switch statements.
    Aggregate Functions Functions that are expressed in aggregation operations. Examples include things such as SUM, COUNT, etc. Operations take many records and collapse them into a single (possibly compound) value.
    Window Functions Functions that relate a record to a set of encompassing records. Examples in SQL include RANK, NTILE, etc.
    User Defined Functions Reusable named functions that are built beyond the core specification. Implementations are typically registered thorough external means (drop a file in a directory, send a special command with implementation, etc.)
    Embedded Functions Functions implementations embedded directly within the plan. Frequently used in data science workflows where business logic is interspersed with standard operations.
    Relation Basics Basic concepts around relational algebra, record emit and properties.
    Logical Relations Common relational operations used in compute plans including project, join, aggregation, etc.
    Text Serialization A human producible & consumable representation of the plan specification.
    Binary Serialization A high performance & compact binary representation of the plan specification.

    Components (Designed but not Implemented)

    Section Description
    Table Functions Functions that convert one or more values from an input record into 0..N output records. Example include operations such as explode, pos-explode, etc.
    User Defined Relations Installed and reusable relational operations customized to a particular platform.
    Embedded Relations Relational operations where plans contain the “machine code” to directly execute the necessary operations.
    Physical Relations Specific execution sub-variations of common relational operations that describe have multiple unique physical variants associated with a single logical operation. Examples include hash join, merge join, nested loop join, etc.
    GitHub

    Specification

    Status

    The specification has passed the initial design phase and is now in the final stages of being fleshed out. The community is encouraged to identify (and address) any perceived gaps in functionality using GitHub issues and PRs. Once all of the planned implementations have been completed all deprecated fields will be eliminated and version 1.0 will be released.

    Components (Complete)

    Section Description
    Simple Types A way to describe the set of basic types that will be operated on within a plan. Only includes simple types such as integers and doubles (nothing configurable or compound).
    Compound Types Expression of types that go beyond simple scalar values. Key concepts here include: configurable types such as fixed length and numeric types as well as compound types such as structs, maps, lists, etc.
    Type Variations Physical variations to base types.
    User Defined Types Extensions that can be defined for specific IR producers/consumers.
    Field References Expressions to identify which portions of a record should be operated on.
    Scalar Functions Description of how functions are specified. Concepts include arguments, variadic functions, output type derivation, etc.
    Scalar Function List A list of well-known canonical functions in YAML format.
    Specialized Record Expressions Specialized expression types that are more naturally expressed outside the function paradigm. Examples include items such as if/then/else and switch statements.
    Aggregate Functions Functions that are expressed in aggregation operations. Examples include things such as SUM, COUNT, etc. Operations take many records and collapse them into a single (possibly compound) value.
    Window Functions Functions that relate a record to a set of encompassing records. Examples in SQL include RANK, NTILE, etc.
    User Defined Functions Reusable named functions that are built beyond the core specification. Implementations are typically registered thorough external means (drop a file in a directory, send a special command with implementation, etc.)
    Embedded Functions Functions implementations embedded directly within the plan. Frequently used in data science workflows where business logic is interspersed with standard operations.
    Relation Basics Basic concepts around relational algebra, record emit and properties.
    Logical Relations Common relational operations used in compute plans including project, join, aggregation, etc.
    Text Serialization A human producible & consumable representation of the plan specification.
    Binary Serialization A high performance & compact binary representation of the plan specification.

    Components (Designed but not Implemented)

    Section Description
    Table Functions Functions that convert one or more values from an input record into 0..N output records. Example include operations such as explode, pos-explode, etc.
    User Defined Relations Installed and reusable relational operations customized to a particular platform.
    Embedded Relations Relational operations where plans contain the “machine code” to directly execute the necessary operations.
    Physical Relations Specific execution sub-variations of common relational operations that describe have multiple unique physical variants associated with a single logical operation. Examples include hash join, merge join, nested loop join, etc.
    GitHub

    Technology Principles

    • Provide a good suite of well-specified common functionality in databases and data science applications.
    • Make it easy for users to privately or publicly extend the representation to support specialized/custom operations.
    • Produce something that is language agnostic and requires minimal work to start developing against in a new language.
    • Drive towards a common format that avoids specialization for single favorite producer or consumer.
    • Establish clear delineation between specifications that MUST be respected to and those that can be optionally ignored.
    • Establish a forgiving compatibility approach and versioning scheme that supports cross-version compatibility in maximum number of cases.
    • Minimize the need for consumer intelligence by excluding concepts like overloading, type coercion, implicit casting, field name handling, etc. (Note: this is weak and should be better stated.)
    • Decomposability/severability: A particular producer or consumer should be able to produce or consume only a subset of the specification and interact well with any other Substrait system as long the specific operations requested fit within the subset of specification supported by the counter system.
    GitHub

    Technology Principles

    • Provide a good suite of well-specified common functionality in databases and data science applications.
    • Make it easy for users to privately or publicly extend the representation to support specialized/custom operations.
    • Produce something that is language agnostic and requires minimal work to start developing against in a new language.
    • Drive towards a common format that avoids specialization for single favorite producer or consumer.
    • Establish clear delineation between specifications that MUST be respected to and those that can be optionally ignored.
    • Establish a forgiving compatibility approach and versioning scheme that supports cross-version compatibility in maximum number of cases.
    • Minimize the need for consumer intelligence by excluding concepts like overloading, type coercion, implicit casting, field name handling, etc. (Note: this is weak and should be better stated.)
    • Decomposability/severability: A particular producer or consumer should be able to produce or consume only a subset of the specification and interact well with any other Substrait system as long the specific operations requested fit within the subset of specification supported by the counter system.
    GitHub

    Versioning

    As an interface specification, the goal of Substrait is to reach a point where (breaking) changes will never need to happen again, or at least be few and far between. By analogy, Apache Arrow’s in-memory format specification has stayed functionally constant, despite many major library versions being released. However, we’re not there yet. When we believe that we’ve reached this point, we will signal this by releasing version 1.0.0. Until then, we will remain in the 0.x.x version regime.

    Despite this, we strive to maintain backward compatibility for both the binary representation and the text representation by means of deprecation. When a breaking change cannot be reasonably avoided, we may remove previously deprecated fields. All deprecated fields will be removed for the 1.0.0 release.

    Substrait uses semantic versioning for its version numbers, with the addition that, during 0.x.y, we increment the x digit for breaking changes and new features, and the y digit for fixes and other nonfunctional changes. The release process is currently automated and makes a new release every week, provided something has changed on the main branch since the previous release. This release cadence will likely be slowed down as stability increases over time. Conventional commits are used to distinguish between breaking changes, new features, and fixes, and GitHub actions are used to verify that there are indeed no breaking protobuf changes in a commit, unless the commit message states this.

    GitHub

    Versioning

    As an interface specification, the goal of Substrait is to reach a point where (breaking) changes will never need to happen again, or at least be few and far between. By analogy, Apache Arrow’s in-memory format specification has stayed functionally constant, despite many major library versions being released. However, we’re not there yet. When we believe that we’ve reached this point, we will signal this by releasing version 1.0.0. Until then, we will remain in the 0.x.x version regime.

    Despite this, we strive to maintain backward compatibility for both the binary representation and the text representation by means of deprecation. When a breaking change cannot be reasonably avoided, we may remove previously deprecated fields. All deprecated fields will be removed for the 1.0.0 release.

    Substrait uses semantic versioning for its version numbers, with the addition that, during 0.x.y, we increment the x digit for breaking changes and new features, and the y digit for fixes and other nonfunctional changes. The release process is currently automated and makes a new release every week, provided something has changed on the main branch since the previous release. This release cadence will likely be slowed down as stability increases over time. Conventional commits are used to distinguish between breaking changes, new features, and fixes, and GitHub actions are used to verify that there are indeed no breaking protobuf changes in a commit, unless the commit message states this.

    GitHub
    GitHub
    GitHub
    GitHub
    GitHub

    Third Party Tools

    Substrait-tools

    The substrait-tools python package provides a command line interface for producing/consuming substrait plans by leveraging the APIs from different producers and consumers.

    Substrait Fiddle

    Substrait Fiddle is an online tool to share, debug, and prototype Substrait plans.

    The Substrait Fiddle Source is available allowing it to be run in any environment.

    GitHub

    Third Party Tools

    Substrait-tools

    The substrait-tools python package provides a command line interface for producing/consuming substrait plans by leveraging the APIs from different producers and consumers.

    Substrait Fiddle

    Substrait Fiddle is an online tool to share, debug, and prototype Substrait plans.

    The Substrait Fiddle Source is available allowing it to be run in any environment.

    GitHub

    SQL to Substrait tutorial

    This is an introductory tutorial to learn the basics of Substrait for readers already familiar with SQL. We will look at how to construct a Substrait plan from an example query.

    We’ll present the Substrait in JSON form to make it relatively readable to newcomers. Typically Substrait is exchanged as a protobuf message, but for debugging purposes it is often helpful to look at a serialized form. Plus, it’s not uncommon for unit tests to represent plans as JSON strings. So if you are developing with Substrait, it’s useful to have experience reading them.

    Note

    Substrait is currently only defined with Protobuf. The JSON provided here is the Protobuf JSON output, but it is not the official Substrait text format. Eventually, Substrait will define it’s own human-readable text format, but for now this tutorial will make due with what Protobuf provides.

    Substrait is designed to communicate plans (mostly logical plans). Those plans contain types, schemas, expressions, extensions, and relations. We’ll look at them in that order, going from simplest to most complex until we can construct full plans.

    This tutorial won’t cover all the details of each piece, but it will give you an idea of how they connect together. For a detailed reference of each individual field, the best place to look is reading the protobuf definitions. They represent the source-of-truth of the spec and are well-commented to address ambiguities.

    Problem Set up

    To learn Substrait, we’ll build up to a specific query. We’ll be using the tables:

    CREATE TABLE orders (
    + SQL to Substrait tutorial - Substrait: Cross-Language Serialization for Relational Algebra      

    SQL to Substrait tutorial

    This is an introductory tutorial to learn the basics of Substrait for readers already familiar with SQL. We will look at how to construct a Substrait plan from an example query.

    We’ll present the Substrait in JSON form to make it relatively readable to newcomers. Typically Substrait is exchanged as a protobuf message, but for debugging purposes it is often helpful to look at a serialized form. Plus, it’s not uncommon for unit tests to represent plans as JSON strings. So if you are developing with Substrait, it’s useful to have experience reading them.

    Note

    Substrait is currently only defined with Protobuf. The JSON provided here is the Protobuf JSON output, but it is not the official Substrait text format. Eventually, Substrait will define it’s own human-readable text format, but for now this tutorial will make due with what Protobuf provides.

    Substrait is designed to communicate plans (mostly logical plans). Those plans contain types, schemas, expressions, extensions, and relations. We’ll look at them in that order, going from simplest to most complex until we can construct full plans.

    This tutorial won’t cover all the details of each piece, but it will give you an idea of how they connect together. For a detailed reference of each individual field, the best place to look is reading the protobuf definitions. They represent the source-of-truth of the spec and are well-commented to address ambiguities.

    Problem Set up

    To learn Substrait, we’ll build up to a specific query. We’ll be using the tables:

    CREATE TABLE orders (
       product_id: i64 NOT NULL,
       quantity: i32 NOT NULL,
       order_date: date NOT NULL,
    diff --git a/types/type_classes/index.html b/types/type_classes/index.html
    index ff8e3fa..dafc746 100644
    --- a/types/type_classes/index.html
    +++ b/types/type_classes/index.html
    @@ -1,4 +1,4 @@
    - Type Classes - Substrait: Cross-Language Serialization for Relational Algebra      

    Type Classes

    In Substrait, the “class” of a type, not to be confused with the concept from object-oriented programming, defines the set of non-null values that instances of a type may assume.

    Implementations of a Substrait type must support at least this set of values, but may include more; for example, an i8 could be represented using the same in-memory format as an i32, as long as functions operating on i8 values within [-128..127] behave as specified (in this case, this means 8-bit overflow must work as expected). Operating on values outside the specified range is unspecified behavior.

    Simple Types

    Simple type classes are those that don’t support any form of configuration. For simplicity, any generic type that has only a small number of discrete implementations is declared directly, as opposed to via configuration.

    Type Name Description Protobuf representation for literals
    boolean A value that is either True or False. bool
    i8 A signed integer within [-128..127], typically represented as an 8-bit two’s complement number. int32
    i16 A signed integer within [-32,768..32,767], typically represented as a 16-bit two’s complement number. int32
    i32 A signed integer within [-2147483648..2,147,483,647], typically represented as a 32-bit two’s complement number. int32
    i64 A signed integer within [−9,223,372,036,854,775,808..9,223,372,036,854,775,807], typically represented as a 64-bit two’s complement number. int64
    fp32 A 4-byte single-precision floating point number with the same range and precision as defined for the IEEE 754 32-bit floating-point format. float
    fp64 An 8-byte double-precision floating point number with the same range and precision as defined for the IEEE 754 64-bit floating-point format. double
    string A unicode string of text, [0..2,147,483,647] UTF-8 bytes in length. string
    binary A binary value, [0..2,147,483,647] bytes in length. binary
    timestamp A naive timestamp with microsecond precision. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. int64 microseconds since 1970-01-01 00:00:00.000000 (in an unspecified timezone)
    timestamp_tz A timezone-aware timestamp with microsecond precision. Similar to aware datetime in Python. int64 microseconds since 1970-01-01 00:00:00.000000 UTC
    date A date within [1000-01-01..9999-12-31]. int32 days since 1970-01-01
    time A time since the beginning of any day. Range of [0..86,399,999,999] microseconds; leap seconds need not be supported. int64 microseconds past midnight
    interval_year Interval year to month. Supports a range of [-10,000..10,000] years with month precision (= [-120,000..120,000] months). Usually stored as separate integers for years and months, but only the total number of months is significant, i.e. 1y 0m is considered equal to 0y 12m or 1001y -12000m. int32 years and int32 months, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. -10000y 200000m is not allowed)
    interval_day Interval day to second. Supports a range of [-3,650,000..3,650,000] days with microsecond precision (= [-315,360,000,000,000,000..315,360,000,000,000,000] microseconds). Usually stored as separate integers for various components, but only the total number of microseconds is significant, i.e. 1d 0s is considered equal to 0d 86400s. int32 days, int32 seconds, and int32 microseconds, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. 3650001d -86400s 0us is not allowed)
    uuid A universally-unique identifier composed of 128 bits. Typically presented to users in the following hexadecimal format: c48ffa9e-64f4-44cb-ae47-152b4e60e77b. Any 128-bit value is allowed, without specific adherence to RFC4122. 16-byte binary

    Compound Types

    Compound type classes are type classes that need to be configured by means of a parameter pack.

    Type Name Description Protobuf representation for literals
    FIXEDCHAR<L> A fixed-length unicode string of L characters. L must be within [1..2,147,483,647]. L-character string
    VARCHAR<L> A unicode string of at most L characters.L must be within [1..2,147,483,647]. string with at most L characters
    FIXEDBINARY<L> A binary string of L bytes. When casting, values shorter than L are padded with zeros, and values longer than L are right-trimmed. L-byte bytes
    DECIMAL<P, S> A fixed-precision decimal value having precision (P, number of digits) <= 38 and scale (S, number of fractional digits) 0 <= S <= P. 16-byte bytes representing a little-endian 128-bit integer, to be divided by 10^S to get the decimal value
    STRUCT<T1,…,Tn> A list of types in a defined order. repeated Literal, types matching T1..Tn
    NSTRUCT<N:T1,…,N:Tn> Pseudo-type: A struct that maps unique names to value types. Each name is a UTF-8-encoded string. Each value can have a distinct type. Note that NSTRUCT is actually a pseudo-type, because Substrait’s core type system is based entirely on ordinal positions, not named fields. Nonetheless, when working with systems outside Substrait, names are important. n/a
    LIST<T> A list of values of type T. The list can be between [0..2,147,483,647] values in length. repeated Literal, all types matching T
    MAP<K, V> An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. repeated KeyValue (in turn two Literals), all key types matching K and all value types matching V
    PRECISIONTIMESTAMP<P> A timestamp with fractional second precision (P, number of digits) 0 <= P <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. int64 seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone)
    PRECISIONTIMESTAMPTZ<P> A timezone-aware timestamp, with fractional second precision (P, number of digits) 0 <= P <= 9. Similar to aware datetime in Python. int64 seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 UTC

    User-Defined Types

    User-defined type classes are defined as part of simple extensions. An extension can declare an arbitrary number of user-defined extension types. Once a type has been declared, it can be used in function declarations.

    For example, the following declares a type named point (namespaced to the associated YAML file) and two scalar functions that operate on it.

    types:
    + Type Classes - Substrait: Cross-Language Serialization for Relational Algebra      

    Type Classes

    In Substrait, the “class” of a type, not to be confused with the concept from object-oriented programming, defines the set of non-null values that instances of a type may assume.

    Implementations of a Substrait type must support at least this set of values, but may include more; for example, an i8 could be represented using the same in-memory format as an i32, as long as functions operating on i8 values within [-128..127] behave as specified (in this case, this means 8-bit overflow must work as expected). Operating on values outside the specified range is unspecified behavior.

    Simple Types

    Simple type classes are those that don’t support any form of configuration. For simplicity, any generic type that has only a small number of discrete implementations is declared directly, as opposed to via configuration.

    Type Name Description Protobuf representation for literals
    boolean A value that is either True or False. bool
    i8 A signed integer within [-128..127], typically represented as an 8-bit two’s complement number. int32
    i16 A signed integer within [-32,768..32,767], typically represented as a 16-bit two’s complement number. int32
    i32 A signed integer within [-2147483648..2,147,483,647], typically represented as a 32-bit two’s complement number. int32
    i64 A signed integer within [−9,223,372,036,854,775,808..9,223,372,036,854,775,807], typically represented as a 64-bit two’s complement number. int64
    fp32 A 4-byte single-precision floating point number with the same range and precision as defined for the IEEE 754 32-bit floating-point format. float
    fp64 An 8-byte double-precision floating point number with the same range and precision as defined for the IEEE 754 64-bit floating-point format. double
    string A unicode string of text, [0..2,147,483,647] UTF-8 bytes in length. string
    binary A binary value, [0..2,147,483,647] bytes in length. binary
    timestamp A naive timestamp with microsecond precision. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. int64 microseconds since 1970-01-01 00:00:00.000000 (in an unspecified timezone)
    timestamp_tz A timezone-aware timestamp with microsecond precision. Similar to aware datetime in Python. int64 microseconds since 1970-01-01 00:00:00.000000 UTC
    date A date within [1000-01-01..9999-12-31]. int32 days since 1970-01-01
    time A time since the beginning of any day. Range of [0..86,399,999,999] microseconds; leap seconds need not be supported. int64 microseconds past midnight
    interval_year Interval year to month. Supports a range of [-10,000..10,000] years with month precision (= [-120,000..120,000] months). Usually stored as separate integers for years and months, but only the total number of months is significant, i.e. 1y 0m is considered equal to 0y 12m or 1001y -12000m. int32 years and int32 months, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. -10000y 200000m is not allowed)
    interval_day Interval day to second. Supports a range of [-3,650,000..3,650,000] days with microsecond precision (= [-315,360,000,000,000,000..315,360,000,000,000,000] microseconds). Usually stored as separate integers for various components, but only the total number of microseconds is significant, i.e. 1d 0s is considered equal to 0d 86400s. int32 days, int32 seconds, and int32 microseconds, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. 3650001d -86400s 0us is not allowed)
    uuid A universally-unique identifier composed of 128 bits. Typically presented to users in the following hexadecimal format: c48ffa9e-64f4-44cb-ae47-152b4e60e77b. Any 128-bit value is allowed, without specific adherence to RFC4122. 16-byte binary

    Compound Types

    Compound type classes are type classes that need to be configured by means of a parameter pack.

    Type Name Description Protobuf representation for literals
    FIXEDCHAR<L> A fixed-length unicode string of L characters. L must be within [1..2,147,483,647]. L-character string
    VARCHAR<L> A unicode string of at most L characters.L must be within [1..2,147,483,647]. string with at most L characters
    FIXEDBINARY<L> A binary string of L bytes. When casting, values shorter than L are padded with zeros, and values longer than L are right-trimmed. L-byte bytes
    DECIMAL<P, S> A fixed-precision decimal value having precision (P, number of digits) <= 38 and scale (S, number of fractional digits) 0 <= S <= P. 16-byte bytes representing a little-endian 128-bit integer, to be divided by 10^S to get the decimal value
    STRUCT<T1,…,Tn> A list of types in a defined order. repeated Literal, types matching T1..Tn
    NSTRUCT<N:T1,…,N:Tn> Pseudo-type: A struct that maps unique names to value types. Each name is a UTF-8-encoded string. Each value can have a distinct type. Note that NSTRUCT is actually a pseudo-type, because Substrait’s core type system is based entirely on ordinal positions, not named fields. Nonetheless, when working with systems outside Substrait, names are important. n/a
    LIST<T> A list of values of type T. The list can be between [0..2,147,483,647] values in length. repeated Literal, all types matching T
    MAP<K, V> An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. repeated KeyValue (in turn two Literals), all key types matching K and all value types matching V
    PRECISIONTIMESTAMP<P> A timestamp with fractional second precision (P, number of digits) 0 <= P <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. int64 seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone)
    PRECISIONTIMESTAMPTZ<P> A timezone-aware timestamp, with fractional second precision (P, number of digits) 0 <= P <= 9. Similar to aware datetime in Python. int64 seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 UTC

    User-Defined Types

    User-defined type classes are defined as part of simple extensions. An extension can declare an arbitrary number of user-defined extension types. Once a type has been declared, it can be used in function declarations.

    For example, the following declares a type named point (namespaced to the associated YAML file) and two scalar functions that operate on it.

    types:
       - name: "point"
     
     scalar_functions:
    diff --git a/types/type_parsing/index.html b/types/type_parsing/index.html
    index 7cc662a..08c65e6 100644
    --- a/types/type_parsing/index.html
    +++ b/types/type_parsing/index.html
    @@ -1,4 +1,4 @@
    - Type Syntax Parsing - Substrait: Cross-Language Serialization for Relational Algebra      

    Type Syntax Parsing

    In many places, it is useful to have a human-readable string representation of data types. Substrait has a custom syntax for type declaration. The basic structure of a type declaration is:

    name?[variation]<param0,...,paramN>
    + Type Syntax Parsing - Substrait: Cross-Language Serialization for Relational Algebra      

    Type Syntax Parsing

    In many places, it is useful to have a human-readable string representation of data types. Substrait has a custom syntax for type declaration. The basic structure of a type declaration is:

    name?[variation]<param0,...,paramN>
     

    The components of this expression are:

    Component Description Required
    Name Each type has a name. A type is expressed by providing a name. This name can be expressed in arbitrary case (e.g. varchar and vArChAr are equivalent) although lowercase is preferred.
    Nullability indicator A type is either non-nullable or nullable. To express nullability, a question mark is added after the type name (before any parameters). Optional, defaults to non-nullable
    Variation When expressing a type, a user can define the type based on a type variation. Some systems use type variations to describe different underlying representations of the same data type. This is expressed as a bracketed integer such as [2]. Optional, defaults to [0]
    Parameters Compound types may have one or more configurable properties. The two main types of properties are integer and type properties. The parameters for each type correspond to a list of known properties associated with a type as declared in the order defined in the type specification. For compound types (types that contain types), the data type syntax will include nested type declarations. The one exception is structs, which are further outlined below. Required where parameters are defined

    Grammars

    It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an ANTLR grammar to ease consumption and production of types. (The grammar also supports an entire language for representing plans as text.)

    Structs & Named Structs

    Structs are unique from other types because they have an arbitrary number of parameters. The parameters are recursive and may include their own subproperties. Struct parsing is declared in the following two ways:

    # Struct
     struct?[variation]<type0, type1,..., typeN>
     
    diff --git a/types/type_system/index.html b/types/type_system/index.html
    index f30487c..44abd74 100644
    --- a/types/type_system/index.html
    +++ b/types/type_system/index.html
    @@ -1,4 +1,4 @@
    - Type System - Substrait: Cross-Language Serialization for Relational Algebra      

    Type System

    Substrait tries to cover the most common types used in data manipulation. Types beyond this common core may be represented using simple extensions.

    Substrait types fundamentally consist of four components:

    Component Condition Examples Description
    Class Always i8, string, STRUCT, extensions Together with the parameter pack, describes the set of non-null values supported by the type. Subdivided into simple and compound type classes.
    Nullability Always Either NULLABLE (? suffix) or REQUIRED (no suffix) Describes whether values of this type can be null. Note that null is considered to be a special value of a nullable type, rather than the only value of a special null type.
    Variation Always No suffix or explicitly [0] (system-preferred), or an extension Allows different variations of the same type class to exist in a system at a time, usually distinguished by in-memory format.
    Parameters Compound types only <10, 2> (for DECIMAL), <i32, string> (for STRUCT) Some combination of zero or more data types or integers. The expected set of parameters and the significance of each parameter depends on the type class.

    Refer to Type Parsing for a description of the syntax used to describe types.

    Note

    Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via cast expressions.

    GitHub

    Type System

    Substrait tries to cover the most common types used in data manipulation. Types beyond this common core may be represented using simple extensions.

    Substrait types fundamentally consist of four components:

    Component Condition Examples Description
    Class Always i8, string, STRUCT, extensions Together with the parameter pack, describes the set of non-null values supported by the type. Subdivided into simple and compound type classes.
    Nullability Always Either NULLABLE (? suffix) or REQUIRED (no suffix) Describes whether values of this type can be null. Note that null is considered to be a special value of a nullable type, rather than the only value of a special null type.
    Variation Always No suffix or explicitly [0] (system-preferred), or an extension Allows different variations of the same type class to exist in a system at a time, usually distinguished by in-memory format.
    Parameters Compound types only <10, 2> (for DECIMAL), <i32, string> (for STRUCT) Some combination of zero or more data types or integers. The expected set of parameters and the significance of each parameter depends on the type class.

    Refer to Type Parsing for a description of the syntax used to describe types.

    Note

    Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via cast expressions.

    GitHub

    Type Variations

    Type variations may be used to represent differences in representation between different consumers. For example, an engine might support dictionary encoding for a string, or could be using either a row-wise or columnar representation of a struct. All variations of a type are expected to have the same semantics when operated on by functions or other expressions.

    All variations except the “system-preferred” variation (a.k.a. [0], see Type Parsing) must be defined using simple extensions. The key properties of these variations are:

    Property Description
    Base Type Class The type class that this variation belongs to.
    Name The name used to reference this type. Should be unique within type variations for this parent type within a simple extension.
    Description A human description of the purpose of this type variation.
    Function Behavior INHERITS or SEPARATE: whether functions that support the system-preferred variation implicitly also support this variation, or whether functions should be resolved independently. For example, if one has the function add(i8,i8) defined and then defines an i8 variation, this determines whether the i8 variation can be bound to the base add operation (inherits) or whether a specialized version of add needs to be defined specifically for this variation (separate). Defaults to inherits.
    GitHub

    Type Variations

    Type variations may be used to represent differences in representation between different consumers. For example, an engine might support dictionary encoding for a string, or could be using either a row-wise or columnar representation of a struct. All variations of a type are expected to have the same semantics when operated on by functions or other expressions.

    All variations except the “system-preferred” variation (a.k.a. [0], see Type Parsing) must be defined using simple extensions. The key properties of these variations are:

    Property Description
    Base Type Class The type class that this variation belongs to.
    Name The name used to reference this type. Should be unique within type variations for this parent type within a simple extension.
    Description A human description of the purpose of this type variation.
    Function Behavior INHERITS or SEPARATE: whether functions that support the system-preferred variation implicitly also support this variation, or whether functions should be resolved independently. For example, if one has the function add(i8,i8) defined and then defines an i8 variation, this determines whether the i8 variation can be bound to the base add operation (inherits) or whether a specialized version of add needs to be defined specifically for this variation (separate). Defaults to inherits.