diff --git a/relations/logical_relations/index.html b/relations/logical_relations/index.html index 3d0de96..cae52ad 100644 --- a/relations/logical_relations/index.html +++ b/relations/logical_relations/index.html @@ -199,14 +199,37 @@ } } -

Fetch Operation

The fetch operation eliminates records outside a desired window. Typically corresponds to a fetch/offset SQL clause. Will only returns records between the start offset and the end offset.

Signature Value
Inputs 1
Outputs 1
Property Maintenance Maintains distribution and orderedness.
Direct Output Order Unchanged from input.

Fetch Properties

Property Description Required
Input A relational input, typically with a desired orderedness property. Required
Offset A non-negative integer. Declares the offset for retrieval of records. Optional, defaults to 0.
Count A non-negative integer or -1. Declares the number of records that should be returned. -1 signals that ALL records should be returned. Required
message FetchRel {
+

Fetch Operation

The fetch operation eliminates records outside a desired window. Typically corresponds to a fetch/offset SQL clause. Will only returns records between the start offset and the end offset.

Signature Value
Inputs 1
Outputs 1
Property Maintenance Maintains distribution and orderedness.
Direct Output Order Unchanged from input.

Fetch Properties

Property Description Required
Input A relational input, typically with a desired orderedness property. Required
Offset Expression An expression which evaluates to a non-negative integer or null (recommended type is i64). Declares the offset for retrieval of records. An expression evaluating to null is treated as 0. Optional, defaults to a 0 literal.
Count Expression An expression which evaluates to a non-negative integer or null (recommended type is i64). Declares the number of records that should be returned. An expression evaluating to null indicates that all records should be returned. Optional, defaults to a null literal.
message FetchRel {
   RelCommon common = 1;
   Rel input = 2;
-  // the offset expressed in number of records
-  int64 offset = 3;
-  // the amount of records to return
-  // use -1 to signal that ALL records should be returned
-  int64 count = 4;
+  // Note: A oneof field is inherently optional, whereas individual fields
+  // within a oneof cannot be marked as optional. The unset state of offset
+  // should therefore be checked at the oneof level. Unset is treated as 0.
+  oneof offset_mode {
+    // the offset expressed in number of records
+    // Deprecated: use `offset_expr` instead
+    int64 offset = 3 [deprecated = true];
+    // Expression evaluated into a non-negative integer specifying the number
+    // of records to skip. An expression evaluating to null is treated as 0.
+    // Evaluating to a negative integer should result in an error.
+    // Recommended type for offset is int64.
+    Expression offset_expr = 5;
+  }
+  // Note: A oneof field is inherently optional, whereas individual fields
+  // within a oneof cannot be marked as optional. The unset state of count
+  // should therefore be checked at the oneof level. Unset is treated as ALL.
+  oneof count_mode {
+    // the amount of records to return
+    // use -1 to signal that ALL records should be returned
+    // Deprecated: use `count_expr` instead
+    int64 count = 4 [deprecated = true];
+    // Expression evaluated into a non-negative integer specifying the number
+    // of records to return. An expression evaluating to null signals that ALL
+    // records should be returned.
+    // Evaluating to a negative integer should result in an error.
+    // Recommended type for count is int64.
+    Expression count_expr = 6;
+  }
   substrait.extensions.AdvancedExtension advanced_extension = 10;
 
 }
diff --git a/search/search_index.json b/search/search_index.json
index 9f1165b..d8a448f 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Substrait: Cross-Language Serialization for Relational Algebra","text":""},{"location":"#what-is-substrait","title":"What is Substrait?","text":"

Substrait is a format for describing compute operations on structured data. It is designed for interoperability across different languages and systems.

"},{"location":"#how-does-it-work","title":"How does it work?","text":"

Substrait provides a well-defined, cross-language specification for data compute operations. This includes a consistent declaration of common operations, custom operations and one or more serialized representations of this specification. The spec focuses on the semantics of each operation. In addition to the specification the Substrait ecosystem also includes a number of libraries and useful tools.

We highly recommend the tutorial to learn how a Substrait plan is constructed.

"},{"location":"#benefits","title":"Benefits","text":"
  • Avoids every system needing to create a communication method between every other system \u2013 each system merely supports ingesting and producing Substrait and it instantly becomes a part of the greater ecosystem.
  • Makes every part of the system upgradable. There\u2019s a new query engine that\u2019s ten times faster? Just plug it in!
  • Enables heterogeneous environments \u2013 run on a cluster of an unknown set of execution engines!
  • The text version of the Substrait plan allows you to quickly see how a plan functions without needing a visualizer (although there are Substrait visualizers as well!).
"},{"location":"#example-use-cases","title":"Example Use Cases","text":"
  • Communicate a compute plan between a SQL parser and an execution engine (e.g. Calcite SQL parsing to Arrow C++ compute kernel)
  • Serialize a plan that represents a SQL view for consistent use in multiple systems (e.g. Iceberg views in Spark and Trino)
  • Submit a plan to different execution engines (e.g. Datafusion and Postgres) and get a consistent interpretation of the semantics.
  • Create an alternative plan generation implementation that can connect an existing end-user compute expression system to an existing end-user processing engine (e.g. Pandas operations executed inside SingleStore)
  • Build a pluggable plan visualization tool (e.g. D3 based plan visualizer)
"},{"location":"about/","title":"Substrait: Cross-Language Serialization for Relational Algebra","text":""},{"location":"about/#project-vision","title":"Project Vision","text":"

The Substrait project aims to create a well-defined, cross-language specification for data compute operations. The specification declares a set of common operations, defines their semantics, and describes their behavior unambiguously. The project also defines extension points and serialized representations of the specification.

In many ways, the goal of this project is similar to that of the Apache Arrow project. Arrow is focused on a standardized memory representation of columnar data. Substrait is focused on what should be done to data.

"},{"location":"about/#why-not-use-sql","title":"Why not use SQL?","text":"

SQL is a well known language for describing queries against relational data. It is designed to be simple and allow reading and writing by humans. Substrait is not intended as a replacement for SQL and works alongside SQL to provide capabilities that SQL lacks. SQL is not a great fit for systems that actually satisfy the query because it does not provide sufficient detail and is not represented in a format that is easy for processing. Because of this, most modern systems will first translate the SQL query into a query plan, sometimes called the execution plan. There can be multiple levels of a query plan (e.g. physical and logical), a query plan may be split up and distributed across multiple systems, and a query plan often undergoes simplifying or optimizing transformations. The SQL standard does not define the format of the query or execution plan and there is no open format that is supported by a broad set of systems. Substrait was created to provide a standard and open format for these query plans.

"},{"location":"about/#why-not-just-do-this-within-an-existing-oss-project","title":"Why not just do this within an existing OSS project?","text":"

A key goal of the Substrait project is to not be coupled to any single existing technology. Trying to get people involved in something can be difficult when it seems to be primarily driven by the opinions and habits of a single community. In many ways, this situation is similar to the early situation with Arrow. The precursor to Arrow was the Apache Drill ValueVectors concepts. As part of creating Arrow, Wes and Jacques recognized the need to create a new community to build a fresh consensus (beyond just what the Apache Drill community wanted). This separation and new independent community was a key ingredient to Arrow\u2019s current success. The needs here are much the same: many separate communities could benefit from Substrait, but each have their own pain points, type systems, development processes and timelines. To help resolve these tensions, one of the approaches proposed in Substrait is to set a bar that at least two of the top four OSS data technologies (Arrow, Spark, Iceberg, Trino) supports something before incorporating it directly into the Substrait specification. (Another goal is to support strong extension points at key locations to avoid this bar being a limiter to broad adoption.)

"},{"location":"about/#related-technologies","title":"Related Technologies","text":"
  • Apache Calcite: Many ideas in Substrait are inspired by the Calcite project. Calcite is a great JVM-based SQL query parsing and optimization framework. A key goal of the Substrait project is to expose Calcite capabilities more easily to non-JVM technologies as well as expose query planning operations as microservices.
  • Apache Arrow: The Arrow format for data is what the Substrait specification attempts to be for compute expressions. A key goal of Substrait is to enable Substrait producers to execute work within the Arrow Rust and C++ compute kernels.
"},{"location":"about/#why-the-name-substrait","title":"Why the name Substrait?","text":"

A strait is a narrow connector of water between two other pieces of water. In analytics, data is often thought of as water. Substrait is focused on instructions related to the data. In other words, what defines or supports the movement of water between one or more larger systems. Thus, the underlayment for the strait connecting different pools of water => sub-strait.

"},{"location":"faq/","title":"Frequently Asked Questions","text":""},{"location":"faq/#what-is-the-purpose-of-the-post-join-filter-field-on-join-relations","title":"What is the purpose of the post-join filter field on Join relations?","text":"

The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join.

See the example here that highlights how the post-join filter behaves differently than a Filter relation in the case of a left join.

"},{"location":"faq/#why-does-the-project-relation-keep-existing-columns","title":"Why does the project relation keep existing columns?","text":"

In several relational algebra systems (DuckDB, Velox, Apache Spark, Apache DataFusion, etc.) the project relation is used both to add new columns and remove existing columns. It is defined by a list of expressions and there is one output column for each expression.

In Substrait, the project relation is only used to add new columns. Any relation can remove columns by using the emit property in RelCommon. This is because it is very common for optimized plans to discard columns once they are no longer needed and this can happen anywhere in a plan. If this discard required a project relation then optimized plans would be cluttered with project relations that only remove columns.

As a result, Substrait\u2019s project relation is a little different. It is also defined by a list of expressions. However, the output columns are a combination of the input columns and one column for each of the expressions.

"},{"location":"faq/#where-are-field-names-represented","title":"Where are field names represented?","text":"

Some relational algebra systems, such as Spark, give names to the output fields of a relation. For example, in PySpark I might run df.withColumn(\"num_chars\", length(\"text\")).filter(\"num_chars > 10\"). This creates a project relation, which calculates a new field named num_chars. This field is then referenced in the filter relation. Spark\u2019s logical plan maps closely to this and includes both the expression (length(\"text\")) and the name of the output field (num_chars) in its project relation.

Substrait does not name intermediate fields in a plan. This is because these field names have no effect on the computation that must be performed. In addition, it opens the door to name-based references, which Substrait also does not support, because these can be a source of errors and confusion. One of the goals of Substrait is to make it very easy for consumers to understand plans. All references in Substrait are done with ordinals.

In order to allow plans that do use named fields to round-trip through Substrait there is a hint that can be used to add field names to a plan. This hint is called output_names and is located in RelCommon. Consumers should not rely on this hint being present in a plan but, if present, it can be used to provide field names to intermediate relations in a plan for round-trip or debugging purposes.

There are a few places where Substrait DOES define field names:

  • Read relations have field names in the base schema. This is because it is quite common for reads to do a name-based lookup to determine the columns that need to be read from source files.
  • The root relation has field names. This is because the root relation is the final output of the plan and it is useful to have names for the fields in the final output.
"},{"location":"governance/","title":"Substrait Project Governance","text":"

The Substrait project is run by volunteers in a collaborative and open way. Its governance is inspired by the Apache Software Foundation. In most cases, people familiar with the ASF model can work with Substrait in the same way. The biggest differences between the models are:

  • Substrait does not have a separate infrastructure governing body that gatekeeps the adoption of new developer tools and technologies.
  • Substrait Management Committee (SMC) members are responsible for recognizing the corporate relationship of its members and ensuring diverse representation and corporate independence.
  • Substrait does not condone private mailing lists. All project business should be discussed in public The only exceptions to this are security escalations (security@substrait.io) and harassment (harassment@substrait.io).
  • Substrait has an automated continuous release process with no formal voting process per release.

More details about concrete things Substrait looks to avoid can be found below.

"},{"location":"governance/#the-substrait-project","title":"The Substrait Project","text":"

The Substrait project consists of the code and repositories that reside in the substrait-io GitHub organization (consisting of core repositories and -contrib repositories, which have relaxed requirements), the Substrait.io website, the Substrait mailing list, MS-hosted teams community calls and the Substrait Slack workspace. (All are open to everyone and recordings/transcripts are made where technology supports it.)

"},{"location":"governance/#substrait-volunteers","title":"Substrait Volunteers","text":"

We recognize four groups of individuals related to the project.

"},{"location":"governance/#user","title":"User","text":"

A user is someone who uses Substrait. They may contribute to Substrait by providing feedback to developers in the form of bug reports and feature suggestions. Users participate in the Substrait community by helping other users on mailing lists and user support forums.

"},{"location":"governance/#contributors","title":"Contributors","text":"

A contributor is a user who contributes to the project in the form of code or documentation. They take extra steps to participate in the project (loosely defined as the set of repositories under the github substrait-io organization), are active on the developer mailing list, participate in discussions, and provide patches, documentation, suggestions, and criticism.

Contributors may be given write access to specific -contrib repositories by an SMC consensus vote per repository. The vote should be open for a week to allow adequate time for other SMC members to voice any concerns prior to providing write access.

"},{"location":"governance/#committer","title":"Committer","text":"

A committer is a developer who has write access to all (i.e., core and -contrib) repositories and has a signed Contributor License Agreement (CLA) on file. Not needing to depend on other people to make patches to the code or documentation, they are actually making short-term decisions for the project. The SMC can (even tacitly) agree and approve the changes into permanency, or they can reject them. Remember that the SMC makes the decisions, not the individual committers.

"},{"location":"governance/#smc-member","title":"SMC Member","text":"

A SMC member is a committer who was elected due to merit for the evolution of the project. They have write access to the code repository, the right to cast binding votes on all proposals on community-related decisions, the right to propose other active contributors for committership, and the right to invite active committers to the SMC. The SMC as a whole is the entity that controls the project, nobody else. They are responsible for the continued shaping of this governance model.

"},{"location":"governance/#substrait-management-and-collaboration","title":"Substrait Management and Collaboration","text":"

The Substrait project is managed using a collaborative, consensus-based process. We do not have a hierarchical structure; rather, different groups of contributors have different rights and responsibilities in the organization.

"},{"location":"governance/#communication","title":"Communication","text":"

Communication must be done via mailing lists, Slack, and/or Github. Communication is always done publicly. There are no private lists and all decisions related to the project are made in public. Communication is frequently done asynchronously since members of the community are distributed across many time zones.

"},{"location":"governance/#substrait-management-committee","title":"Substrait Management Committee","text":"

The Substrait Management Committee is responsible for the active management of Substrait. The main role of the SMC is to further the long-term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration takes place. As part of this, the SMC is the primary approver of specification changes, ensuring that proposed changes represent a balanced and thorough examination of possibilities. This doesn\u2019t mean that the SMC has to be involved in the minutiae of a particular specification change but should always shepard a healthy process around specification changes.

"},{"location":"governance/#substrait-voting-process","title":"Substrait Voting Process","text":"

Because one of the fundamental aspects of accomplishing things is doing so by consensus, we need a way to tell whether we have reached consensus. We do this by voting. There are several different types of voting. In all cases, it is recommended that all community members vote. The number of binding votes required to move forward and the community members who have \u201cbinding\u201d votes differs depending on the type of proposal made. In all cases, a veto of a binding voter results in an inability to move forward.

The rules require that a community member registering a negative vote must include an alternative proposal or a detailed explanation of the reasons for the negative vote. The community then tries to gather consensus on an alternative proposal that can resolve the issue. In the great majority of cases, the concerns leading to the negative vote can be addressed. This process is called \u201cconsensus gathering\u201d and we consider it a very important indication of a healthy community.

+1 votes required Binding voters Voting Location Process/Governance modifications & actions. This includes promoting new contributors to committer or SMC. 3 SMC Mailing List Management of -contrib repositories including adding repositories and giving write access to them 3 SMC Mailing List Format/Specification Modifications (including breaking extension changes) 2 SMC Github PR Documentation Updates (formatting, moves) 1 SMC Github PR Typos 1 Committers Github PR Non-breaking function introductions 1 (not including proposer) Committers Github PR Non-breaking extension additions & non-format code modifications 1 (not including proposer) Committers Github PR Changes (non-breaking or breaking) to a Substrait library (i.e. substrait-java, substrait-validator) 1 (not including proposer) Committers Github PR Changes to a Substrait -contrib repository 1 (not including proposer) Contributors Github PR"},{"location":"governance/#review-then-commit","title":"Review-Then-Commit","text":"

Substrait follows a review-then-commit policy. This requires that all changes receive consensus approval before being committed to the code base. The specific vote requirements follow the table above.

"},{"location":"governance/#expressing-votes","title":"Expressing Votes","text":"

The voting process may seem more than a little weird if you\u2019ve never encountered it before. Votes are represented as numbers between -1 and +1, with \u2018-1\u2019 meaning \u2018no\u2019 and \u2018+1\u2019 meaning \u2018yes.\u2019

The in-between values indicate how strongly the voting individual feels. Here are some examples of fractional votes and what the voter might be communicating with them:

  • +0: \u2018I don\u2019t feel strongly about it, but I\u2019m okay with this.\u2019
  • -0: \u2018I won\u2019t get in the way, but I\u2019d rather we didn\u2019t do this.\u2019
  • -0.5: \u2018I don\u2019t like this idea, but I can\u2019t find any rational justification for my feelings.\u2019
  • ++1: \u2018Wow! I like this! Let\u2019s do it!\u2019
  • -0.9: \u2018I really don\u2019t like this, but I\u2019m not going to stand in the way if everyone else wants to go ahead with it.\u2019
  • +0.9: \u2018This is a cool idea and I like it, but I don\u2019t have time/the skills necessary to help out.\u2019
"},{"location":"governance/#votes-on-code-modification","title":"Votes on Code Modification","text":"

For code-modification votes, +1 votes (review approvals in Github are considered equivalent to a +1) are in favor of the proposal, but -1 votes are vetoes and kill the proposal dead until all vetoers withdraw their -1 votes.

"},{"location":"governance/#vetoes","title":"Vetoes","text":"

A -1 (or an unaddressed PR request for changes) vote by a qualified voter stops a code-modification proposal in its tracks. This constitutes a veto, and it cannot be overruled nor overridden by anyone. Vetoes stand until and unless the individual withdraws their veto.

To prevent vetoes from being used capriciously, the voter must provide with the veto a technical or community justification showing why the change is bad.

"},{"location":"governance/#why-do-we-vote","title":"Why do we vote?","text":"

Votes help us to openly resolve conflicts. Without a process, people tend to avoid conflict and thrash around. Votes help to make sure we do the hard work of resolving the conflict.

"},{"location":"governance/#substrait-is-non-commercial-but-commercially-aware","title":"Substrait is non-commercial but commercially-aware","text":"

Substrait\u2019s mission is to produce software for the public good. All Substrait software is always available for free, and solely under the Apache License.

We\u2019re happy to have third parties, including for-profit corporations, take our software and use it for their own purposes. However it is important in these cases to ensure that the third party does not misuse the brand and reputation of the Substrait project for its own purposes. It is important for the longevity and community health of Substrait that the community gets the appropriate credit for producing freely available software.

The SMC actively track the corporate allegiances of community members and strives to ensure influence around any particular aspect of the project isn\u2019t overly skewed towards a single corporate entity.

"},{"location":"governance/#substrait-trademark","title":"Substrait Trademark","text":"

The SMC is responsible for protecting the Substrait name and brand. TBD what action is taken to support this.

"},{"location":"governance/#project-roster","title":"Project Roster","text":""},{"location":"governance/#substrait-management-committee-smc","title":"Substrait Management Committee (SMC)","text":"Name Association Phillip Cloud Voltron Data Weston Pace LanceDB Jacques Nadeau Sundeck Victor Barua Datadog David Sisson Voltron Data"},{"location":"governance/#substrait-committers","title":"Substrait Committers","text":"Name Association Jeroen van Straten Qblox Carlo Curino Microsoft James Taylor Sundeck Sutou Kouhei Clearcode Micah Kornfeld Google Jinfeng Ni Sundeck Andy Grove Nvidia Jesus Camacho Rodriguez Microsoft Rich Tia Voltron Data Vibhatha Abeykoon Voltron Data Nic Crane Recast Gil Forsyth Voltron Data ChaoJun Zhang Intel Matthijs Brobbel Voltron Data Matt Topol Voltron Data Ingo M\u00fcller Google Arttu Voutilainen Palantir Technologies Bruno Volpato Datadog Anshul Data Sundeck Chandra Sanapala Sundeck"},{"location":"governance/#additional-detail-about-differences-from-asf","title":"Additional detail about differences from ASF","text":"

Corporate Awareness: The ASF takes a blind-eye approach that has proven to be too slow to correct corporate influence which has substantially undermined many OSS projects. In contrast, Substrait SMC members are responsible for identifying corporate risks and over-representation and adjusting inclusion in the project based on that (limiting committership, SMC membership, etc). Each member of the SMC shares responsibility to expand the community and seek out corporate diversity.

Infrastructure: The ASF shows its age wrt to infrastructure, having been originally built on SVN. Some examples of requirements that Substrait is eschewing that exist in ASF include: custom git infrastructure, release process that is manual, project external gatekeeping around the use of new tools/technologies.

"},{"location":"community/","title":"Community","text":"

Substrait is developed as a consensus-driven open source product under the Apache 2.0 license. Development is done in the open leveraging GitHub issues and PRs.

"},{"location":"community/#get-in-touch","title":"Get In Touch","text":"Mailing List/Google Group We use the mailing list to discuss questions, formulate plans and collaborate asynchronously. Slack Channel The developers of Substrait frequent the Slack channel. You can get an invite to the channel by following this link. GitHub Issues Substrait is developed via GitHub issues and pull requests. If you see a problem or want to enhance the product, we suggest you file a GitHub issue for developers to review. Twitter The @substrait_io account on Twitter is our official account. Follow-up to keep to date on what is happening with Substrait! Docs Our website is all maintained in our source repository. If there is something you think can be improved, feel free to fork our repository and post a pull request. Meetings Our community meets every other week on Wednesday."},{"location":"community/#talks","title":"Talks","text":"

Want to learn more about Substrait? Try the following presentations and slide decks.

  • Substrait: A Common Representation for Data Compute Plans (Jacques Nadeau, April 2022) [slides]
"},{"location":"community/#citation","title":"Citation","text":"

If you use Substrait in your research, please cite it using the following BibTeX entry:

@misc{substrait,\n  author = {substrait-io},\n  title = {Substrait: Cross-Language Serialization for Relational Algebra},\n  year = {2021},\n  month = {8},\n  day = {31},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/substrait-io/substrait}}\n}\n
"},{"location":"community/#contribution","title":"Contribution","text":"

All contributors are welcome to Substrait. If you want to join the project, open a PR or get in touch with us as above.

"},{"location":"community/#principles","title":"Principles","text":"
  • Be inclusive and open to all.
  • Ensure a diverse set of contributors that come from multiple data backgrounds to maximize general utility.
  • Build a specification based on open consensus.
  • Avoid over-reliance/coupling to any single technology.
  • Make the specification and all tools freely available on a permissive license (ApacheV2)
"},{"location":"community/powered_by/","title":"Powered by Substrait","text":"

In addition to the work maintained in repositories within the substrait-io GitHub organization, a growing list of other open source projects have adopted Substrait.

Acero Acero is a query execution engine implemented as a part of the Apache Arrow C++ library. Acero provides a Substrait consumer interface. ADBC ADBC (Arrow Database Connectivity) is an API specification for Apache Arrow-based database access. ADBC allows applications to pass queries either as SQL strings or Substrait plans. Arrow Flight SQL Arrow Flight SQL is a client-server protocol for interacting with databases and query engines using the Apache Arrow in-memory columnar format and the Arrow Flight RPC framework. Arrow Flight SQL allows clients to send queries as SQL strings or Substrait plans. DataFusion DataFusion is an extensible query planning, optimization, and execution framework, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion provides a Substrait producer and consumer that can convert DataFusion logical plans to and from Substrait plans. It can be used through the DataFusion Python bindings. DuckDB DuckDB is an in-process SQL OLAP database management system. DuckDB provides a Substrait extension that allows users to produce and consume Substrait plans through DuckDB\u2019s SQL, Python, and R APIs. Gluten Gluten is a plugin for Apache Spark that allows computation to be offloaded to engines that have better performance or efficiency than Spark\u2019s built-in JVM-based engine. Gluten converts Spark physical plans to Substrait plans. Ibis Ibis is a Python library that provides a lightweight, universal interface for data wrangling. It includes a dataframe API for Python with support for more than 10 query execution engines, plus a Substrait producer to enable support for Substrait-consuming execution engines. Substrait R Interface The Substrait R interface package allows users to construct Substrait plans from R for evaluation by Substrait-consuming execution engines. The package provides a dplyr backend as well as lower-level interfaces for creating Substrait plans and integrations with Acero and DuckDB. Velox Velox is a unified execution engine aimed at accelerating data management systems and streamlining their development. Velox provides a Substrait consumer interface.

To add your project to this list, please open a pull request.

"},{"location":"expressions/aggregate_functions/","title":"Aggregate Functions","text":"

Aggregate functions are functions that define an operation which consumes values from multiple records to a produce a single output. Aggregate functions in SQL are typically used in GROUP BY functions. Aggregate functions are similar to scalar functions and function signatures with a small set of different properties.

Aggregate function signatures contain all the properties defined for scalar functions. Additionally, they contain the properties below:

Property Description Required Inherits All properties defined for scalar function. N/A Ordered Whether the result of this function is sensitive to sort order. Optional, defaults to false Maximum set size Maximum allowed set size as an unsigned integer. Optional, defaults to unlimited Decomposable Whether the function can be executed in one or more intermediate steps. Valid options are: NONE, ONE, MANY, describing how intermediate steps can be taken. Optional, defaults to NONE Intermediate Output Type If the function is decomposable, represents the intermediate output type that is used, if the function is defined as either ONE or MANY decomposable. Will be a struct in many cases. Required for ONE and MANY. Invocation Whether the function uses all or only distinct values in the aggregation calculation. Valid options are: ALL, DISTINCT. Optional, defaults to ALL"},{"location":"expressions/aggregate_functions/#aggregate-binding","title":"Aggregate Binding","text":"

When binding an aggregate function, the binding must include the following additional properties beyond the standard scalar binding properties:

Property Description Phase Describes the input type of the data: [INITIAL_TO_INTERMEDIATE, INTERMEDIATE_TO_INTERMEDIATE, INITIAL_TO_RESULT, INTERMEDIATE_TO_RESULT] describing what portion of the operation is required. For functions that are NOT decomposable, the only valid option will be INITIAL_TO_RESULT. Ordering Zero or more ordering keys along with key order (ASC|DESC|NULL FIRST, etc.), declared similar to the sort keys in an ORDER BY relational operation. If no sorts are specified, the records are not sorted prior to being passed to the aggregate function."},{"location":"expressions/embedded_functions/","title":"Embedded Functions","text":"

Embedded functions are a special kind of function where the implementation is embedded within the actual plan. They are commonly used in tools where a user intersperses business logic within a data pipeline. This is more common in data science workflows than traditional SQL workflows.

Embedded functions are not pre-registered. Embedded functions require that data be consumed and produced with a standard API, may require memory allocation and have determinate error reporting behavior. They may also have specific runtime dependencies. For example, a Python pickle function may depend on pyarrow 5.0 and pynessie 1.0.

Properties for an embedded function include:

Property Description Required Function Type The type of embedded function presented. Required Function Properties Function properties, one of those items defined below. Required Output Type The fully resolved output type for this embedded function. Required

The binary representation of an embedded function is:

Binary RepresentationHuman Readable Representation
message EmbeddedFunction {\n  repeated Expression arguments = 1;\n  Type output_type = 2;\n  oneof kind {\n    PythonPickleFunction python_pickle_function = 3;\n    WebAssemblyFunction web_assembly_function = 4;\n  }\n\n  message PythonPickleFunction {\n    bytes function = 1;\n    repeated string prerequisite = 2;\n  }\n\n  message WebAssemblyFunction {\n    bytes script = 1;\n    repeated string prerequisite = 2;\n  }\n}\n

As the bytes are opaque to Substrait there is no equivalent human readable form.

"},{"location":"expressions/embedded_functions/#function-details","title":"Function Details","text":"

There are many types of possible stored functions. For each, Substrait works to expose the function in as descriptive a way as possible to support the largest number of consumers.

"},{"location":"expressions/embedded_functions/#python-pickle-function-type","title":"Python Pickle Function Type","text":"Property Description Required Pickle Body binary pickle encoded function using [TBD] API representation to access arguments. True Prereqs A list of specific Python conda packages that are prerequisites for access (a structured version of a requirements.txt file). Optional, defaults to none"},{"location":"expressions/embedded_functions/#webassembly-function-type","title":"WebAssembly Function Type","text":"Property Description Required Script WebAssembly function True Prereqs A list of AssemblyScript prerequisites required to compile the assemblyscript function using NPM coordinates. Optional, defaults to none Discussion Points
  • What are the common embedded function formats?
  • How do we expose the data for a function?
  • How do we express batching capabilities?
  • How do we ensure/declare containerization?
"},{"location":"expressions/extended_expression/","title":"Extended Expression","text":"

Extended Expression messages are provided for expression-level protocols as an alternative to using a Plan. They mainly target expression-only evaluations, such as those computed in Filter/Project/Aggregation rels. Unlike the original Expression defined in the substrait protocol, Extended Expression messages require more information to completely describe the computation context including: input data schema, referred function signatures, and output schema.

Since Extended Expression will be used seperately from the Plan rel representation, it will need to include basic fields like Version.

ExtendedExpression Message
message ExtendedExpression {\n  // Substrait version of the expression. Optional up to 0.17.0, required for later\n  // versions.\n  Version version = 7;\n\n  // a list of yaml specifications this expression may depend on\n  repeated substrait.extensions.SimpleExtensionURI extension_uris = 1;\n\n  // a list of extensions this expression may depend on\n  repeated substrait.extensions.SimpleExtensionDeclaration extensions = 2;\n\n  // one or more expression trees with same order in plan rel\n  repeated ExpressionReference referred_expr = 3;\n\n  NamedStruct base_schema = 4;\n  // additional extensions associated with this expression.\n  substrait.extensions.AdvancedExtension advanced_extensions = 5;\n\n  // A list of com.google.Any entities that this plan may use. Can be used to\n  // warn if some embedded message types are unknown. Note that this list may\n  // include message types that are ignorable (optimizations) or that are\n  // unused. In many cases, a consumer may be able to work with a plan even if\n  // one or more message types defined here are unknown.\n  repeated string expected_type_urls = 6;\n\n}\n
"},{"location":"expressions/extended_expression/#input-and-output-data-schema","title":"Input and output data schema","text":"

Similar to base_schema defined in ReadRel, the input data schema describes the name/type/nullibilty and layout info of input data for the target expression evalutation. It also has a field name to define the name of the output data.

"},{"location":"expressions/extended_expression/#referred-expression","title":"Referred expression","text":"

An Extended Exression will have one or more referred expressions, which can be either Expression or AggregateFunction. Additional types of expressions may be added in the future.

For a message with multiple expressions, users may produce each Extended Expression in the same order as they occur in the original Plan rel. But, the consumer does NOT have to handle them in this order. A consumer needs only to ensure that the columns in the final output are organized in the same order as defined in the message.

"},{"location":"expressions/extended_expression/#function-extensions","title":"Function extensions","text":"

Function extensions work the same for both Extended Expression and the original Expression defined in the Substrait protocol.

"},{"location":"expressions/field_references/","title":"Field References","text":"

In Substrait, all fields are dealt with on a positional basis. Field names are only used at the edge of a plan, for the purposes of naming fields for the outside world. Each operation returns a simple or compound data type. Additional operations can refer to data within that initial operation using field references. To reference a field, you use a reference based on the type of field position you want to reference.

Reference Type Properties Type Applicability Type return Struct Field Ordinal position. Zero-based. Only legal within the range of possible fields within a struct. Selecting an ordinal outside the applicable field range results in an invalid plan. struct Type of field referenced Array Value Array offset. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Negative and positive overflows return null values (no wrapping). list type of list Array Slice Array offset and element count. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Position does not wrap, nor does length. list Same type as original list Map Key A map value that is matched exactly against available map keys and returned. map Value type of map Map KeyExpression A wildcard string that is matched against a simplified form of regular expressions. Requires the key type of the map to be a character type. [Format detail needed, intention to include basic regex concepts such as greedy/non-greedy.] map List of map value type Masked Complex Expression An expression that provides a mask over a schema declaring which portions of the schema should be presented. This allows a user to select a portion of a complex object but mask certain subsections of that same object. any any"},{"location":"expressions/field_references/#compound-references","title":"Compound References","text":"

References are typically constructed as a sequence. For example: [struct position 0, struct position 1, array offset 2, array slice 1..3].

Field references are in the same order they are defined in their schema. For example, let\u2019s consider the following schema:

column a:\n  struct<\n    b: list<\n      struct<\n        c: map<string, \n          struct<\n            x: i32>>>>>\n

If we want to represent the SQL expression:

a.b[2].c['my_map_key'].x\n

We will need to declare the nested field such that:

Struct field reference a\nStruct field b\nList offset 2\nStruct field c\nMap key my_map_key\nStruct field x\n

Or more formally in Protobuf Text, we get:

selection {\n  direct_reference {\n    struct_field {\n      field: 0 # .a\n      child {\n        struct_field {\n          field: 0 # .b\n          child {\n            list_element {\n              offset: 2\n              child {\n                struct_field {\n                  field: 0 # .c\n                  child {\n                    map_key {\n                      map_key {\n                        string: \"my_map_key\" # ['my_map_key']\n                      }\n                      child {\n                        struct_field {\n                          field: 0 # .x\n                        }\n                      }\n                    }\n                  }\n                }\n              }\n            }\n          }\n        }\n      }\n    }\n  }\n  root_reference { }\n}\n
"},{"location":"expressions/field_references/#validation","title":"Validation","text":"

References must validate against the schema of the record being referenced. If not, an error is expected.

"},{"location":"expressions/field_references/#masked-complex-expression","title":"Masked Complex Expression","text":"

A masked complex expression is used to do a subselection of a portion of a complex record. It allows a user to specify the portion of the complex object to consume. Imagine you have a schema of (note that structs are lists of fields here, as they are in general in Substrait as field names are not used internally in Substrait):

struct:\n  - struct:\n    - integer\n    - list:\n      struct:\n        - i32\n        - string\n        - string\n     - i32\n  - i16\n  - i32\n  - i64\n

Given this schema, you could declare a mask of fields to include in pseudocode, such as:

0:[0,1:[..5:[0,2]]],2,3\n\nOR\n\n0:\n  - 0\n  - 1:\n    ..5:\n      -0\n      -2\n2\n3\n

This mask states that we would like to include fields 0 2 and 3 at the top-level. Within field 0, we want to include subfields 0 and 1. For subfield 0.1, we want to include up to only the first 5 records in the array and only includes fields 0 and 2 within the struct within that array. The resulting schema would be:

struct:\n  - struct:\n    - integer\n    - list:\n      struct: \n        - i32\n        - string\n  - i32\n  - i64\n
"},{"location":"expressions/field_references/#unwrapping-behavior","title":"Unwrapping Behavior","text":"

By default, when only a single field is selected from a struct, that struct is removed. When only a single element is removed from a list, the list is removed. A user can also configure the mask to avoid unwrapping in these cases. [TBD how we express this in the serialization formats.]

Discussion Points
  • Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.)
"},{"location":"expressions/scalar_functions/","title":"Scalar Functions","text":"

A function is a scalar function if that function takes in values from a single record and produces an output value. To clearly specify the definition of functions, Substrait declares an extensible specification plus binding approach to function resolution. A scalar function signature includes the following properties:

Property Description Required Name One or more user-friendly UTF-8 strings that are used to reference this function. At least one value is required. List of arguments Argument properties are defined below. Arguments can be fully defined or calculated with a type expression. See further details below. Optional, defaults to niladic. Deterministic Whether this function is expected to reproduce the same output when it is invoked multiple times with the same input. This informs a plan consumer on whether it can constant-reduce the defined function. An example would be a random() function, which is typically expected to be evaluated repeatedly despite having the same set of inputs. Optional, defaults to true. Session Dependent Whether this function is influenced by the session context it is invoked within. For example, a function may be influenced by a user who is invoking the function, the time zone of a session, or some other non-obvious parameter. This can inform caching systems on whether a particular function is cacheable. Optional, defaults to false. Variadic Behavior Whether the last argument of the function is variadic or a single argument. If variadic, the argument can optionally have a lower bound (minimum number of instances) and an upper bound (maximum number of instances). Optional, defaults to single value. Nullability Handling Describes how nullability of input arguments maps to nullability of output arguments. Three options are: MIRROR, DECLARED_OUTPUT and DISCRETE. More details about nullability handling are listed below. Optional, defaults to MIRROR Description Additional description of function for implementers or users. Should be written human-readable to allow exposure to end users. Presented as a map with language => description mappings. E.g. { \"en\": \"This adds two numbers together.\", \"fr\": \"cela ajoute deux nombres\"}. Optional Return Value The output type of the expression. Return types can be expressed as a fully-defined type or a type expression. See below for more on type expressions. Required Implementation Map A map of implementation locations for one or more implementations of the given function. Each key is a function implementation type. Implementation types include examples such as: AthenaArrowLambda, TrinoV361Jar, ArrowCppKernelEnum, GandivaEnum, LinkedIn Transport Jar, etc. [Definition TBD]. Implementation type has one or more properties associated with retrieval of that implementation. Optional"},{"location":"expressions/scalar_functions/#argument-types","title":"Argument Types","text":"

There are three main types of arguments: value arguments, type arguments, and enumerations. Every defined arguments must be specified in every invocation of the function. When specified, the position of these arguments in the function invocation must match the position of the arguments as defined in the YAML function definition.

  • Value arguments: arguments that refer to a data value. These could be constants (literal expressions defined in the plan) or variables (a reference expression that references data being processed by the plan). This is the most common type of argument. The value of a value argument is not available in output derivation, but its type is. Value arguments can be declared in one of two ways: concrete or parameterized. Concrete types are either simple types or compound types with all parameters fully defined (without referencing any type arguments). Examples include i32, fp32, VARCHAR<20>, List<fp32>, etc. Parameterized types are discussed further below.
  • Type arguments: arguments that are used only to inform the evaluation and/or type derivation of the function. For example, you might have a function which is truncate(<type> DECIMAL<P0,S0>, <value> DECIMAL<P1, S1>, <value> i32). This function declares two value arguments and a type argument. The difference between them is that the type argument has no value at runtime, while the value arguments do.
  • Enumeration: arguments that support a fixed set of declared values as constant arguments. These arguments must be specified as part of an expression. While these could also have been implemented as constant string value arguments, they are formally included to improve validation/contextual help/etc. for frontend processors and IDEs. An example might be extract([DAY|YEAR|MONTH], <date value>). In this example, a producer must specify a type of date part to extract. Note, the value of a required enumeration cannot be used in type derivation.
"},{"location":"expressions/scalar_functions/#value-argument-properties","title":"Value Argument Properties","text":"Property Description Required Name A human-readable name for this argument to help clarify use. Optional, defaults to a name based on position (e.g. arg0) Description Additional description of this argument. Optional Value A fully defined type or a type expression. Required Constant Whether this argument is required to be a constant for invocation. For example, in some system a regular expression pattern would only be accepted as a literal and not a column value reference. Optional, defaults to false"},{"location":"expressions/scalar_functions/#type-argument-properties","title":"Type Argument Properties","text":"Property Description Required Type A partially or completely parameterized type. E.g. List<K> or K Required Name A human-readable name for this argument to help clarify use. Optional, defaults to a name based on position (e.g. arg0) Description Additional description of this argument. Optional"},{"location":"expressions/scalar_functions/#required-enumeration-properties","title":"Required Enumeration Properties","text":"Property Description Required Options List of valid string options for this argument Required Name A human-readable name for this argument to help clarify use. Optional, defaults to a name based on position (e.g. arg0) Description Additional description of this argument. Optional"},{"location":"expressions/scalar_functions/#options","title":"Options","text":"

In addition to arguments each call may specify zero or more options. These are similar to a required enumeration but more focused on supporting alternative behaviors. Options can be left unspecified and the consumer is free to choose which implementation to use. An example use case might be OVERFLOW_BEHAVIOR:[OVERFLOW, SATURATE, ERROR] If unspecified, an engine is free to use any of the three choices or even some alternative behavior (e.g. setting the value to null on overflow). If specified, the engine would be expected to behave as specified or fail. Note, the value of an optional enumeration cannot be used in type derivation.

"},{"location":"expressions/scalar_functions/#option-preference","title":"Option Preference","text":"

A producer may specify multiple values for an option. If the producer does so then the consumer must deliver the first behavior in the list of values that the consumer is capable of delivering. For example, considering overflow as defined above, if a producer specified [ERROR, SATURATE] then the consumer must deliver ERROR if it is capable of doing so. If it is not then it may deliver SATURATE. If the consumer cannot deliver either behavior then it is an error and the consumer must reject the plan.

"},{"location":"expressions/scalar_functions/#optional-properties","title":"Optional Properties","text":"Property Description Required Values A list of valid strings for this option. Required Name A human-readable name for this option. Required"},{"location":"expressions/scalar_functions/#nullability-handling","title":"Nullability Handling","text":"Mode Description MIRROR This means that the function has the behavior that if at least one of the input arguments are nullable, the return type is also nullable. If all arguments are non-nullable, the return type will be non-nullable. An example might be the + function. DECLARED_OUTPUT Input arguments are accepted of any mix of nullability. The nullability of the output function is whatever the return type expression states. Example use might be the function is_null() where the output is always boolean independent of the nullability of the input. DISCRETE The input and arguments all define concrete nullability and can only be bound to the types that have those nullability. For example, if a type input is declared i64? and one has an i64 literal, the i64 literal must be specifically cast to i64? to allow the operation to bind."},{"location":"expressions/scalar_functions/#parameterized-types","title":"Parameterized Types","text":"

Types are parameterized by two types of values: by inner types (e.g. List<K>) and numeric values (e.g. DECIMAL<P,S>). Parameter names are simple strings (frequently a single character). There are two types of parameters: integer parameters and type parameters.

When the same parameter name is used multiple times in a function definition, the function can only bind if the exact same value is used for all parameters of that name. For example, if one had a function with a signature of fn(VARCHAR<N>, VARCHAR<N>), the function would be only be usable if both VARCHAR types had the same length value N. This necessitates that all instances of the same parameter name must be of the same parameter type (all instances are a type parameter or all instances are an integer parameter).

"},{"location":"expressions/scalar_functions/#type-parameter-resolution-in-variadic-functions","title":"Type Parameter Resolution in Variadic Functions","text":"

When the last argument of a function is variadic and declares a type parameter e.g. fn(A, B, C...), the C parameter can be marked as either consistent or inconsistent. If marked as consistent, the function can only be bound to arguments where all the C types are the same concrete type. If marked as inconsistent, each unique C can be bound to a different type within the constraints of what T allows.

"},{"location":"expressions/scalar_functions/#output-type-derivation","title":"Output Type Derivation","text":""},{"location":"expressions/scalar_functions/#concrete-return-types","title":"Concrete Return Types","text":"

A concrete return type is one that is fully known at function definition time. Examples of simple concrete return types would be things such as i32, fp32. For compound types, a concrete return type must be fully declared. Example of fully defined compound types: VARCHAR<20>, DECIMAL<25,5>

"},{"location":"expressions/scalar_functions/#return-type-expressions","title":"Return Type Expressions","text":"

Any function can declare a return type expression. A return type expression uses a simplified set of expressions to describe how the return type should be returned. For example, a return expression could be as simple as the return of parameter declared in the arguments. For example f(List<K>) => K or can be a simple mathematical or conditional expression such as add(decimal<a,b>, decimal<c,d>) => decimal<a+c, b+d>. For the simple expression language, there is a very narrow set of types:

  • Integer: 64-bit signed integer (can be a literal or a parameter value)
  • Boolean: True and False
  • Type: A Substrait type (with possibly additional embedded expressions)

These types are evaluated using a small set of operations to support common scenarios. List of valid operations:

Math: +, -, *, /, min, max\nBoolean: &&, ||, !, <, >, ==\nParameters: type, integer\nLiterals: type, integer\n

Fully defined with argument types:

  • type_parameter(string name) => type
  • integer_parameter(string name) => integer
  • not(boolean x) => boolean
  • and(boolean a, boolean b) => boolean
  • or(boolean a, boolean b) => boolean
  • multiply(integer a, integer b) => integer
  • divide(integer a, integer b) => integer
  • add(integer a, integer b) => integer
  • subtract(integer a, integer b) => integer
  • min(integer a, integer b) => integer
  • max(integer a, integer b) => integer
  • equal(integer a, integer b) => boolean
  • greater_than(integer a, integer b) => boolean
  • less_than(integer a, integer b) => boolean
  • covers(Type a, Type b) => boolean Covers means that type b matches type A for as much as type B is defined. For example, if type A is VARCHAR<20> and type B is VARCHAR<N>, type B would be considered covering. Similarlily if type A was List<Struct<a:f32, b:f32>>and type B was List<Struct<>>, it would be considered covering. Note that this is directional \u201cas in B covers A\u201d or \u201cB can be further enhanced to match the definition A\u201d.
  • if(boolean a) then (integer) else (integer)
  • if(boolean a) then (type) else (type)
"},{"location":"expressions/scalar_functions/#example-type-expressions","title":"Example Type Expressions","text":"

For reference, here are are some common output type derivations and how they can be expressed with a return type expression:

Operation Definition Add item to list add(List<T>, T) => List<T> Decimal Division divide(Decimal<P1,S1>, Decimal<P2,S2>) => Decimal<P1 -S1 + S2 + MAX(6, S1 + P2 + 1), MAX(6, S1 + P2 + 1)> Select a subset of map keys based on a regular expression (requires stringlike keys) extract_values(regex:string, map:Map<K,V>) => List<V> WHERE K IN [STRING, VARCHAR<N>, FIXEDCHAR<N>] Concatenate two fixed sized character strings concat(FIXEDCHAR<A>, FIXEDCHAR<B>) => FIXEDCHAR<A+B> Make a struct of a set of fields and a struct definition. make_struct(<type> T, K...) => T"},{"location":"expressions/specialized_record_expressions/","title":"Specialized Record Expressions","text":"

While all types of operations could be reduced to functions, in some cases this would be overly simplistic. Instead, it is helpful to construct some other expression constructs.

These constructs should be focused on different expression types as opposed to something that directly related to syntactic sugar. For example, CAST and EXTRACT or SQL operations that are presented using specialized syntax. However, they can easily be modeled using a function paradigm with minimal complexity.

"},{"location":"expressions/specialized_record_expressions/#literal-expressions","title":"Literal Expressions","text":"

For each data type, it is possible to create a literal value for that data type. The representation depends on the serialization format. Literal expressions include both a type literal and a possibly null value.

"},{"location":"expressions/specialized_record_expressions/#nested-type-constructor-expressions","title":"Nested Type Constructor Expressions","text":"

These expressions allow structs, lists, and maps to be constructed from a set of expressions. For example, they allow a struct expression like (field 0 - field 1, field 0 + field 1) to be represented.

"},{"location":"expressions/specialized_record_expressions/#cast-expression","title":"Cast Expression","text":"

To convert a value from one type to another, Substrait defines a cast expression. Cast expressions declare an expected type, an input argument and an enumeration specifying failure behavior, indicating whether cast should return null on failure or throw an exception.

Note that Substrait always requires a cast expression whenever the current type is not exactly equal to (one of) the expected types. For example, it is illegal to directly pass a value of type i8[0] to a function that only supports an i8?[0] argument.

"},{"location":"expressions/specialized_record_expressions/#if-expression","title":"If Expression","text":"

An if value expression is an expression composed of one if clause, zero or more else if clauses and an else clause. In pseudocode, they are envisioned as:

if <boolean expression> then <result expression 1>\nelse if <boolean expression> then <result expression 2> (zero or more times)\nelse <result expression 3>\n

When an if expression is declared, all return expressions must be the same identical type.

"},{"location":"expressions/specialized_record_expressions/#shortcut-behavior","title":"Shortcut Behavior","text":"

An if expression is expected to logically short-circuit on a positive outcome. This means that a skipped else/elseif expression cannot cause an error. For example, this should not actually throw an error despite the fact that the cast operation should fail.

if 'value' = 'value' then 0\nelse cast('hello' as integer) \n
"},{"location":"expressions/specialized_record_expressions/#switch-expression","title":"Switch Expression","text":"

Switch expression allow a selection of alternate branches based on the value of a given expression. They are an optimized form of a generic if expression where all conditions are equality to the same value. In pseudocode:

switch(value)\n<value 1> => <return 1> (1 or more times)\n<else> => <return default>\n

Return values for a switch expression must all be of identical type.

"},{"location":"expressions/specialized_record_expressions/#shortcut-behavior_1","title":"Shortcut Behavior","text":"

As in if expressions, switch expression evaluation should not be interrupted by \u201croads not taken\u201d.

"},{"location":"expressions/specialized_record_expressions/#or-list-equality-expression","title":"Or List Equality Expression","text":"

A specialized structure that is often used is a large list of possible values. In SQL, these are typically large IN lists. They can be composed from one or more fields. There are two common patterns, single value and multi value. In pseudocode they are represented as:

Single Value:\nexpression, [<value1>, <value2>, ... <valueN>]\n\nMulti Value:\n[expressionA, expressionB], [[value1a, value1b], [value2a, value2b].. [valueNa, valueNb]]\n

For single value expressions, these are a compact equivalent of expression = value1 OR expression = value2 OR .. OR expression = valueN. When using an expression of this type, two things are required; the types of the test expression and all value expressions that are related must be of the same type. Additionally, a function signature for equality must be available for the expression type used.

"},{"location":"expressions/subqueries/","title":"Subqueries","text":"

Subqueries are scalar expressions comprised of another query.

"},{"location":"expressions/subqueries/#forms","title":"Forms","text":""},{"location":"expressions/subqueries/#scalar","title":"Scalar","text":"

Scalar subqueries are subqueries that return one row and one column.

Property Description Required Input Input relation Yes"},{"location":"expressions/subqueries/#in-predicate","title":"IN predicate","text":"

An IN subquery predicate checks that the left expression is contained in the right subquery.

"},{"location":"expressions/subqueries/#examples","title":"Examples","text":"
SELECT *\nFROM t1\nWHERE x IN (SELECT * FROM t2)\n
SELECT *\nFROM t1\nWHERE (x, y) IN (SELECT a, b FROM t2)\n
Property Description Required Needles Expressions whose existence will be checked Yes Haystack Subquery to check Yes"},{"location":"expressions/subqueries/#set-predicates","title":"Set predicates","text":"

A set predicate is a predicate over a set of rows in the form of a subquery.

EXISTS and UNIQUE are common SQL spellings of these kinds of predicates.

Property Description Required Operation The operation to perform over the set Yes Tuples Set of tuples to check using the operation Yes"},{"location":"expressions/subqueries/#set-comparisons","title":"Set comparisons","text":"

A set comparison subquery is a subquery comparison using ANY or ALL operations.

"},{"location":"expressions/subqueries/#examples_1","title":"Examples","text":"
SELECT *\nFROM t1\nWHERE x < ANY(SELECT y from t2)\n
Property Description Required Reduction operation The kind of reduction to use over the subquery Yes Comparison operation The kind of comparison operation to use Yes Expression Left-hand side expression to check Yes Subquery Subquery to check Yes Protobuf Representation
message Subquery {\n  oneof subquery_type {\n    // Scalar subquery\n    Scalar scalar = 1;\n    // x IN y predicate\n    InPredicate in_predicate = 2;\n    // EXISTS/UNIQUE predicate\n    SetPredicate set_predicate = 3;\n    // ANY/ALL predicate\n    SetComparison set_comparison = 4;\n  }\n\n  // A subquery with one row and one column. This is often an aggregate\n  // though not required to be.\n  message Scalar {\n    Rel input = 1;\n  }\n\n  // Predicate checking that the left expression is contained in the right\n  // subquery\n  //\n  // Examples:\n  //\n  // x IN (SELECT * FROM t)\n  // (x, y) IN (SELECT a, b FROM t)\n  message InPredicate {\n    repeated Expression needles = 1;\n    Rel haystack = 2;\n  }\n\n  // A predicate over a set of rows in the form of a subquery\n  // EXISTS and UNIQUE are common SQL forms of this operation.\n  message SetPredicate {\n    enum PredicateOp {\n      PREDICATE_OP_UNSPECIFIED = 0;\n      PREDICATE_OP_EXISTS = 1;\n      PREDICATE_OP_UNIQUE = 2;\n    }\n    // TODO: should allow expressions\n    PredicateOp predicate_op = 1;\n    Rel tuples = 2;\n  }\n\n  // A subquery comparison using ANY or ALL.\n  // Examples:\n  //\n  // SELECT *\n  // FROM t1\n  // WHERE x < ANY(SELECT y from t2)\n  message SetComparison {\n    enum ComparisonOp {\n      COMPARISON_OP_UNSPECIFIED = 0;\n      COMPARISON_OP_EQ = 1;\n      COMPARISON_OP_NE = 2;\n      COMPARISON_OP_LT = 3;\n      COMPARISON_OP_GT = 4;\n      COMPARISON_OP_LE = 5;\n      COMPARISON_OP_GE = 6;\n    }\n\n    enum ReductionOp {\n      REDUCTION_OP_UNSPECIFIED = 0;\n      REDUCTION_OP_ANY = 1;\n      REDUCTION_OP_ALL = 2;\n    }\n\n    // ANY or ALL\n    ReductionOp reduction_op = 1;\n    // A comparison operator\n    ComparisonOp comparison_op = 2;\n    // left side of the expression\n    Expression left = 3;\n    // right side of the expression\n    Rel right = 4;\n  }\n}\n
"},{"location":"expressions/table_functions/","title":"Table Functions","text":"

Table functions produce zero or more records for each input record. Table functions use a signature similar to scalar functions. However, they are not allowed in the same contexts.

to be completed\u2026

"},{"location":"expressions/user_defined_functions/","title":"User-Defined Functions","text":"

Substrait supports the creation of custom functions using simple extensions, using the facilities described in scalar functions. The functions defined by Substrait use the same mechanism. The extension files for standard functions can be found here.

Here\u2019s an example function that doubles its input:

Implementation Note

This implementation is only defined on 32-bit floats and integers but could be defined on all numbers (and even lists and strings). The user of the implementation can specify what happens when the resulting value falls outside of the valid range for a 32-bit float (either return NAN or raise an error).

%YAML 1.2\n---\nscalar_functions:\n  -\n    name: \"double\"\n    description: \"Double the value\"\n    impls:\n      - args:\n          - name: x\n            value: fp32\n        options:\n          on_domain_error:\n            values: [ NAN, ERROR ]\n        return: fp32\n      - args:\n          - name: x\n            value: i32\n        options:\n          on_domain_error:\n            values: [ NAN, ERROR ]\n        return: i32\n
"},{"location":"expressions/window_functions/","title":"Window Functions","text":"

Window functions are functions which consume values from multiple records to produce a single output. They are similar to aggregate functions, but also have a focused window of analysis to compare to their partition window. Window functions are similar to scalar values to an end user, producing a single value for each input record. However, the consumption visibility for the production of each single record can be many records.

Window function signatures contain all the properties defined for aggregate functions. Additionally, they contain the properties below

Property Description Required Inherits All properties defined for aggregate functions. N/A Window Type STREAMING or PARTITION. Describes whether the function needs to see all data for the specific partition operation simultaneously. Operations like SUM can produce values in a streaming manner with no complete visibility of the partition. NTILE requires visibility of the entire partition before it can start producing values. Optional, defaults to PARTITION

When binding an aggregate function, the binding must include the following additional properties beyond the standard scalar binding properties:

Property Description Required Partition A list of partitioning expressions. False, defaults to a single partition for the entire dataset Lower Bound Bound Following(int64), Bound Trailing(int64) or CurrentRow. False, defaults to start of partition Upper Bound Bound Following(int64), Bound Trailing(int64) or CurrentRow. False, defaults to end of partition"},{"location":"expressions/window_functions/#aggregate-functions-as-window-functions","title":"Aggregate Functions as Window Functions","text":"

Aggregate functions can be treated as a window functions with Window Type set to STREAMING.

AVG, COUNT, MAX, MIN and SUM are examples of aggregate functions that are commonly allowed in window contexts.

"},{"location":"extensions/","title":"Extensions","text":"

In many cases, the existing objects in Substrait will be sufficient to accomplish a particular use case. However, it is sometimes helpful to create a new data type, scalar function signature or some other custom representation within a system. For that, Substrait provides a number of extension points.

"},{"location":"extensions/#simple-extensions","title":"Simple Extensions","text":"

Some kinds of primitives are so frequently extended that Substrait defines a standard YAML format that describes how the extended functionality can be interpreted. This allows different projects/systems to use the YAML definition as a specification so that interoperability isn\u2019t constrained to the base Substrait specification. The main types of extensions that are defined in this manner include the following:

  • Data types
  • Type variations
  • Scalar Functions
  • Aggregate Functions
  • Window Functions
  • Table Functions

To extend these items, developers can create one or more YAML files at a defined URI that describes the properties of each of these extensions. The YAML file is constructed according to the YAML Schema. Each definition in the file corresponds to the YAML-based serialization of the relevant data structure. If a user only wants to extend one of these types of objects (e.g. types), a developer does not have to provide definitions for the other extension points.

A Substrait plan can reference one or more YAML files via URI for extension. In the places where these entities are referenced, they will be referenced using a URI + name reference. The name scheme per type works as follows:

Category Naming scheme Type The name as defined on the type object. Type Variation The name as defined on the type variation object. Function Signature A function signature compound name as described below.

A YAML file can also reference types and type variations defined in another YAML file. To do this, it must declare the YAML file it depends on using a key-value pair in the dependencies key, where the value is the URI to the YAML file, and the key is a valid identifier that can then be used as an identifier-safe alias for the URI. This alias can then be used as a .-separated namespace prefix wherever a type class or type variation name is expected.

For example, if the YAML file at file:///extension_types.yaml defines a type called point, a different YAML file can use the type in a function declaration as follows:

dependencies:\n  ext: file:///extension_types.yaml\nscalar_functions:\n- name: distance\n  description: The distance between two points.\n  impls:\n  - args:\n    - name: a\n      value: ext.point\n    - name: b\n      value: ext.point\n    return: f64\n

Here, the choice for the name ext is arbitrary, as long as it does not conflict with anything else in the YAML file.

"},{"location":"extensions/#function-signature-compound-names","title":"Function Signature Compound Names","text":"

A YAML file may contain one or more functions by the same name. The key used in the function extension declaration to reference a function is a combination of the name of the function along with a list of the required input argument types. The format is as follows:

<function name>:<short_arg_type0>_<short_arg_type1>_..._<short_arg_typeN>\n

Rather than using a full data type representation, the input argument types (short_arg_type) are mapped to single-level short name. The mappings are listed in the table below.

Note

Every compound function signature must be unique. If two function implementations in a YAML file would generate the same compound function signature, then the YAML file is invalid and behavior is undefined.

Argument Type Signature Name Required Enumeration req i8 i8 i16 i16 i32 i32 i64 i64 fp32 fp32 fp64 fp64 string str binary vbin boolean bool timestamp ts timestamp_tz tstz date date time time interval_year iyear interval_day iday interval_compound icompound uuid uuid fixedchar<N> fchar varchar<N> vchar fixedbinary<N> fbin decimal<P,S> dec precision_timestamp<P> pts precision_timestamp_tz<P> ptstz struct<T1,T2,\u2026,TN> struct list<T> list map<K,V> map any[\\d]? any user defined type u!name"},{"location":"extensions/#examples","title":"Examples","text":"Function Signature Function Name add(optional enumeration, i8, i8) => i8 add:i8_i8 avg(fp32) => fp32 avg:fp32 extract(required enumeration, timestamp) => i64 extract:req_ts sum(any1) => any1 sum:any"},{"location":"extensions/#any-types","title":"Any Types","text":"
scalar_functions:\n- name: foo\n  impls:\n  - args:\n    - name: a\n      value: any\n    - name: b\n      value: any\n    return: int64\n

The any type indicates that the argument can take any possible type. In the foo function above, arguments a and b can be of any type, even different ones in the same function invocation.

scalar_functions:\n- name: bar\n  impls:\n  - args:\n    - name: a\n      value: any1\n    - name: b\n      value: any1\n    return: int64\n
The any[\\d] types (i.e. any1, any2, \u2026, any9) impose an additional restriction. Within a single function invocation, all any types with same numeric suffix must be of the same type. In the bar function above, arguments a and b can have any type as long as both types are the same.

"},{"location":"extensions/#advanced-extensions","title":"Advanced Extensions","text":"

Less common extensions can be extended using customization support at the serialization level. This includes the following kinds of extensions:

Extension Type Description Relation Modification (semantic) Extensions to an existing relation that will alter the semantics of that relation. These kinds of extensions require that any plan consumer understand the extension to be able to manipulate or execute that operator. Ignoring these extensions will result in an incorrect interpretation of the plan. An example extension might be creating a customized version of Aggregate that can optionally apply a filter before aggregating the data. Note: Semantic-changing extensions shouldn\u2019t change the core characteristics of the underlying relation. For example, they should not change the default direct output field ordering, change the number of fields output or change the behavior of physical property characteristics. If one needs to change one of these behaviors, one should define a new relation as described below. Relation Modification (optimization) Extensions to an existing relation that can improve the efficiency of a plan consumer but don\u2019t fundamentally change the behavior of the operation. An example might be an estimated amount of memory the relation is expected to use or a particular algorithmic pattern that is perceived to be optimal. New Relations Creates an entirely new kind of relation. It is the most flexible way to extend Substrait but also make the Substrait plan the least interoperable. In most cases it is better to use a semantic changing relation as oppposed to a new relation as it means existing code patterns can easily be extended to work with the additional properties. New Read Types Defines a new subcategory of read that can be used in a ReadRel. One of Substrait is to provide a fairly extensive set of read patterns within the project as opposed to requiring people to define new types externally. As such, we suggest that you first talk with the Substrait community to determine whether you read type can be incorporated directly in the core specification. New Write Types Similar to a read type but for writes. As with reads, the community recommends that interested extenders first discuss with the community about developing new write types in the community before using the extension mechanisms. Plan Extensions Semantic and/or optimization based additions at the plan level.

Because extension mechanisms are different for each serialization format, please refer to the corresponding serialization sections to understand how these extensions are defined in more detail.

"},{"location":"extensions/functions_aggregate_approx/","title":"functions_aggregate_approx.yaml","text":"

This document file is generated for functions_aggregate_approx.yaml

"},{"location":"extensions/functions_aggregate_approx/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_aggregate_approx/#approx_count_distinct","title":"approx_count_distinct","text":"

Implementations: approx_count_distinct(x): -> return_type 0. approx_count_distinct(any): -> i64

Calculates the approximate number of rows that contain distinct values of the expression argument using HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result.

"},{"location":"extensions/functions_aggregate_decimal_output/","title":"functions_aggregate_decimal_output.yaml","text":"

This document file is generated for functions_aggregate_decimal_output.yaml

"},{"location":"extensions/functions_aggregate_decimal_output/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_aggregate_decimal_output/#count","title":"count","text":"

Implementations: count(x, option:overflow): -> return_type 0. count(any, option:overflow): -> decimal<38,0>

Count a set of values. Result is returned as a decimal instead of i64.

Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_aggregate_decimal_output/#count_1","title":"count","text":"

    Implementations:

    Count a set of records (not field referenced). Result is returned as a decimal instead of i64.

    "},{"location":"extensions/functions_aggregate_decimal_output/#approx_count_distinct","title":"approx_count_distinct","text":"

    Implementations: approx_count_distinct(x): -> return_type 0. approx_count_distinct(any): -> decimal<38,0>

    Calculates the approximate number of rows that contain distinct values of the expression argument using HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result. Result is returned as a decimal instead of i64.

    "},{"location":"extensions/functions_aggregate_generic/","title":"functions_aggregate_generic.yaml","text":"

    This document file is generated for functions_aggregate_generic.yaml

    "},{"location":"extensions/functions_aggregate_generic/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_aggregate_generic/#count","title":"count","text":"

    Implementations: count(x, option:overflow): -> return_type 0. count(any, option:overflow): -> i64

    Count a set of values

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_aggregate_generic/#count_1","title":"count","text":"

    Implementations:

    Count a set of records (not field referenced)

    "},{"location":"extensions/functions_aggregate_generic/#any_value","title":"any_value","text":"

    Implementations: any_value(x, option:ignore_nulls): -> return_type 0. any_value(any1, option:ignore_nulls): -> any1?

    *Selects an arbitrary value from a group of values. If the input is empty, the function returns null. *

    Options:
  • ignore_nulls ['TRUE', 'FALSE']
  • "},{"location":"extensions/functions_arithmetic/","title":"functions_arithmetic.yaml","text":"

    This document file is generated for functions_arithmetic.yaml

    "},{"location":"extensions/functions_arithmetic/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_arithmetic/#add","title":"add","text":"

    Implementations: add(x, y, option:overflow): -> return_type 0. add(i8, i8, option:overflow): -> i8 1. add(i16, i16, option:overflow): -> i16 2. add(i32, i32, option:overflow): -> i32 3. add(i64, i64, option:overflow): -> i64 4. add(fp32, fp32, option:rounding): -> fp32 5. add(fp64, fp64, option:rounding): -> fp64

    Add two values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#subtract","title":"subtract","text":"

    Implementations: subtract(x, y, option:overflow): -> return_type 0. subtract(i8, i8, option:overflow): -> i8 1. subtract(i16, i16, option:overflow): -> i16 2. subtract(i32, i32, option:overflow): -> i32 3. subtract(i64, i64, option:overflow): -> i64 4. subtract(fp32, fp32, option:rounding): -> fp32 5. subtract(fp64, fp64, option:rounding): -> fp64

    Subtract one value from another.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#multiply","title":"multiply","text":"

    Implementations: multiply(x, y, option:overflow): -> return_type 0. multiply(i8, i8, option:overflow): -> i8 1. multiply(i16, i16, option:overflow): -> i16 2. multiply(i32, i32, option:overflow): -> i32 3. multiply(i64, i64, option:overflow): -> i64 4. multiply(fp32, fp32, option:rounding): -> fp32 5. multiply(fp64, fp64, option:rounding): -> fp64

    Multiply two values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#divide","title":"divide","text":"

    Implementations: divide(x, y, option:overflow, option:on_domain_error, option:on_division_by_zero): -> return_type 0. divide(i8, i8, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i8 1. divide(i16, i16, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i16 2. divide(i32, i32, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i32 3. divide(i64, i64, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i64 4. divide(fp32, fp32, option:rounding, option:on_domain_error, option:on_division_by_zero): -> fp32 5. divide(fp64, fp64, option:rounding, option:on_domain_error, option:on_division_by_zero): -> fp64

    *Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0). The on_division_by_zero option governs behavior in cases where y is 0. If the option is IEEE then the IEEE754 standard is followed: all values except \u00b1infinity return NaN and \u00b1infinity are unchanged. If the option is LIMIT then the result is \u00b1infinity in all cases. If either x or y are NaN then behavior will be governed by on_domain_error. If x and y are both \u00b1infinity, behavior will be governed by on_domain_error. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • on_domain_error ['NULL', 'ERROR']
  • on_division_by_zero ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • rounding ['NAN', 'NULL', 'ERROR']
  • overflow ['IEEE', 'LIMIT', 'NULL', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#negate","title":"negate","text":"

    Implementations: negate(x, option:overflow): -> return_type 0. negate(i8, option:overflow): -> i8 1. negate(i16, option:overflow): -> i16 2. negate(i32, option:overflow): -> i32 3. negate(i64, option:overflow): -> i64 4. negate(fp32): -> fp32 5. negate(fp64): -> fp64

    Negation of the value

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#modulus","title":"modulus","text":"

    Implementations: modulus(x, y, option:division_type, option:overflow, option:on_domain_error): -> return_type 0. modulus(i8, i8, option:division_type, option:overflow, option:on_domain_error): -> i8 1. modulus(i16, i16, option:division_type, option:overflow, option:on_domain_error): -> i16 2. modulus(i32, i32, option:division_type, option:overflow, option:on_domain_error): -> i32 3. modulus(i64, i64, option:division_type, option:overflow, option:on_domain_error): -> i64

    *Calculate the remainder \u00ae when dividing dividend (x) by divisor (y). In mathematics, many conventions for the modulus (mod) operation exists. The result of a mod operation depends on the software implementation and underlying hardware. Substrait is a format for describing compute operations on structured data and designed for interoperability. Therefore the user is responsible for determining a definition of division as defined by the quotient (q). The following basic conditions of division are satisfied: (1) q \u2208 \u2124 (the quotient is an integer) (2) x = y * q + r (division rule) (3) abs\u00ae < abs(y) where q is the quotient. The division_type option determines the mathematical definition of quotient to use in the above definition of division. When division_type=TRUNCATE, q = trunc(x/y). When division_type=FLOOR, q = floor(x/y). In the cases of TRUNCATE and FLOOR division: remainder r = x - round_func(x/y) The on_domain_error option governs behavior in cases where y is 0, y is \u00b1inf, or x is \u00b1inf. In these cases the mod is undefined. The overflow option governs behavior when integer overflow occurs. If x and y are both 0 or both \u00b1infinity, behavior will be governed by on_domain_error. *

    Options:
  • division_type ['TRUNCATE', 'FLOOR']
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • on_domain_error ['NULL', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#power","title":"power","text":"

    Implementations: power(x, y, option:overflow): -> return_type 0. power(i64, i64, option:overflow): -> i64 1. power(fp32, fp32): -> fp32 2. power(fp64, fp64): -> fp64

    Take the power with x as the base and y as exponent.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#sqrt","title":"sqrt","text":"

    Implementations: sqrt(x, option:rounding, option:on_domain_error): -> return_type 0. sqrt(i64, option:rounding, option:on_domain_error): -> fp64 1. sqrt(fp32, option:rounding, option:on_domain_error): -> fp32 2. sqrt(fp64, option:rounding, option:on_domain_error): -> fp64

    Square root of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#exp","title":"exp","text":"

    Implementations: exp(x, option:rounding): -> return_type 0. exp(i64, option:rounding): -> fp64 1. exp(fp32, option:rounding): -> fp32 2. exp(fp64, option:rounding): -> fp64

    The mathematical constant e, raised to the power of the value.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#cos","title":"cos","text":"

    Implementations: cos(x, option:rounding): -> return_type 0. cos(fp32, option:rounding): -> fp32 1. cos(fp64, option:rounding): -> fp64

    Get the cosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#sin","title":"sin","text":"

    Implementations: sin(x, option:rounding): -> return_type 0. sin(fp32, option:rounding): -> fp32 1. sin(fp64, option:rounding): -> fp64

    Get the sine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#tan","title":"tan","text":"

    Implementations: tan(x, option:rounding): -> return_type 0. tan(fp32, option:rounding): -> fp32 1. tan(fp64, option:rounding): -> fp64

    Get the tangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#cosh","title":"cosh","text":"

    Implementations: cosh(x, option:rounding): -> return_type 0. cosh(fp32, option:rounding): -> fp32 1. cosh(fp64, option:rounding): -> fp64

    Get the hyperbolic cosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#sinh","title":"sinh","text":"

    Implementations: sinh(x, option:rounding): -> return_type 0. sinh(fp32, option:rounding): -> fp32 1. sinh(fp64, option:rounding): -> fp64

    Get the hyperbolic sine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#tanh","title":"tanh","text":"

    Implementations: tanh(x, option:rounding): -> return_type 0. tanh(fp32, option:rounding): -> fp32 1. tanh(fp64, option:rounding): -> fp64

    Get the hyperbolic tangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#acos","title":"acos","text":"

    Implementations: acos(x, option:rounding, option:on_domain_error): -> return_type 0. acos(fp32, option:rounding, option:on_domain_error): -> fp32 1. acos(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the arccosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#asin","title":"asin","text":"

    Implementations: asin(x, option:rounding, option:on_domain_error): -> return_type 0. asin(fp32, option:rounding, option:on_domain_error): -> fp32 1. asin(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the arcsine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#atan","title":"atan","text":"

    Implementations: atan(x, option:rounding): -> return_type 0. atan(fp32, option:rounding): -> fp32 1. atan(fp64, option:rounding): -> fp64

    Get the arctangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#acosh","title":"acosh","text":"

    Implementations: acosh(x, option:rounding, option:on_domain_error): -> return_type 0. acosh(fp32, option:rounding, option:on_domain_error): -> fp32 1. acosh(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the hyperbolic arccosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#asinh","title":"asinh","text":"

    Implementations: asinh(x, option:rounding): -> return_type 0. asinh(fp32, option:rounding): -> fp32 1. asinh(fp64, option:rounding): -> fp64

    Get the hyperbolic arcsine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#atanh","title":"atanh","text":"

    Implementations: atanh(x, option:rounding, option:on_domain_error): -> return_type 0. atanh(fp32, option:rounding, option:on_domain_error): -> fp32 1. atanh(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the hyperbolic arctangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#atan2","title":"atan2","text":"

    Implementations: atan2(x, y, option:rounding, option:on_domain_error): -> return_type 0. atan2(fp32, fp32, option:rounding, option:on_domain_error): -> fp32 1. atan2(fp64, fp64, option:rounding, option:on_domain_error): -> fp64

    Get the arctangent of values given as x/y pairs.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#radians","title":"radians","text":"

    Implementations: radians(x, option:rounding): -> return_type 0. radians(fp32, option:rounding): -> fp32 1. radians(fp64, option:rounding): -> fp64

    *Converts angle x in degrees to radians. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#degrees","title":"degrees","text":"

    Implementations: degrees(x, option:rounding): -> return_type 0. degrees(fp32, option:rounding): -> fp32 1. degrees(fp64, option:rounding): -> fp64

    *Converts angle x in radians to degrees. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#abs","title":"abs","text":"

    Implementations: abs(x, option:overflow): -> return_type 0. abs(i8, option:overflow): -> i8 1. abs(i16, option:overflow): -> i16 2. abs(i32, option:overflow): -> i32 3. abs(i64, option:overflow): -> i64 4. abs(fp32): -> fp32 5. abs(fp64): -> fp64

    *Calculate the absolute value of the argument. Integer values allow the specification of overflow behavior to handle the unevenness of the twos complement, e.g. Int8 range [-128 : 127]. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#sign","title":"sign","text":"

    Implementations: sign(x): -> return_type 0. sign(i8): -> i8 1. sign(i16): -> i16 2. sign(i32): -> i32 3. sign(i64): -> i64 4. sign(fp32): -> fp32 5. sign(fp64): -> fp64

    *Return the signedness of the argument. Integer values return signedness with the same type as the input. Possible return values are [-1, 0, 1] Floating point values return signedness with the same type as the input. Possible return values are [-1.0, -0.0, 0.0, 1.0, NaN] *

    "},{"location":"extensions/functions_arithmetic/#factorial","title":"factorial","text":"

    Implementations: factorial(n, option:overflow): -> return_type 0. factorial(i32, option:overflow): -> i32 1. factorial(i64, option:overflow): -> i64

    *Return the factorial of a given integer input. The factorial of 0! is 1 by convention. Negative inputs will raise an error. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#bitwise_not","title":"bitwise_not","text":"

    Implementations: bitwise_not(x): -> return_type 0. bitwise_not(i8): -> i8 1. bitwise_not(i16): -> i16 2. bitwise_not(i32): -> i32 3. bitwise_not(i64): -> i64

    *Return the bitwise NOT result for one integer input. *

    "},{"location":"extensions/functions_arithmetic/#bitwise_and","title":"bitwise_and","text":"

    Implementations: bitwise_and(x, y): -> return_type 0. bitwise_and(i8, i8): -> i8 1. bitwise_and(i16, i16): -> i16 2. bitwise_and(i32, i32): -> i32 3. bitwise_and(i64, i64): -> i64

    *Return the bitwise AND result for two integer inputs. *

    "},{"location":"extensions/functions_arithmetic/#bitwise_or","title":"bitwise_or","text":"

    Implementations: bitwise_or(x, y): -> return_type 0. bitwise_or(i8, i8): -> i8 1. bitwise_or(i16, i16): -> i16 2. bitwise_or(i32, i32): -> i32 3. bitwise_or(i64, i64): -> i64

    *Return the bitwise OR result for two given integer inputs. *

    "},{"location":"extensions/functions_arithmetic/#bitwise_xor","title":"bitwise_xor","text":"

    Implementations: bitwise_xor(x, y): -> return_type 0. bitwise_xor(i8, i8): -> i8 1. bitwise_xor(i16, i16): -> i16 2. bitwise_xor(i32, i32): -> i32 3. bitwise_xor(i64, i64): -> i64

    *Return the bitwise XOR result for two integer inputs. *

    "},{"location":"extensions/functions_arithmetic/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_arithmetic/#sum","title":"sum","text":"

    Implementations: sum(x, option:overflow): -> return_type 0. sum(i8, option:overflow): -> i64? 1. sum(i16, option:overflow): -> i64? 2. sum(i32, option:overflow): -> i64? 3. sum(i64, option:overflow): -> i64? 4. sum(fp32, option:overflow): -> fp64? 5. sum(fp64, option:overflow): -> fp64?

    Sum a set of values. The sum of zero elements yields null.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#sum0","title":"sum0","text":"

    Implementations: sum0(x, option:overflow): -> return_type 0. sum0(i8, option:overflow): -> i64 1. sum0(i16, option:overflow): -> i64 2. sum0(i32, option:overflow): -> i64 3. sum0(i64, option:overflow): -> i64 4. sum0(fp32, option:overflow): -> fp64 5. sum0(fp64, option:overflow): -> fp64

    *Sum a set of values. The sum of zero elements yields zero. Null values are ignored. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#avg","title":"avg","text":"

    Implementations: avg(x, option:overflow): -> return_type 0. avg(i8, option:overflow): -> i8? 1. avg(i16, option:overflow): -> i16? 2. avg(i32, option:overflow): -> i32? 3. avg(i64, option:overflow): -> i64? 4. avg(fp32, option:overflow): -> fp32? 5. avg(fp64, option:overflow): -> fp64?

    Average a set of values. For integral types, this truncates partial values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#min","title":"min","text":"

    Implementations: min(x): -> return_type 0. min(i8): -> i8? 1. min(i16): -> i16? 2. min(i32): -> i32? 3. min(i64): -> i64? 4. min(fp32): -> fp32? 5. min(fp64): -> fp64?

    Min a set of values.

    "},{"location":"extensions/functions_arithmetic/#max","title":"max","text":"

    Implementations: max(x): -> return_type 0. max(i8): -> i8? 1. max(i16): -> i16? 2. max(i32): -> i32? 3. max(i64): -> i64? 4. max(fp32): -> fp32? 5. max(fp64): -> fp64?

    Max a set of values.

    "},{"location":"extensions/functions_arithmetic/#product","title":"product","text":"

    Implementations: product(x, option:overflow): -> return_type 0. product(i8, option:overflow): -> i8 1. product(i16, option:overflow): -> i16 2. product(i32, option:overflow): -> i32 3. product(i64, option:overflow): -> i64 4. product(fp32, option:rounding): -> fp32 5. product(fp64, option:rounding): -> fp64

    Product of a set of values. Returns 1 for empty input.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#std_dev","title":"std_dev","text":"

    Implementations: std_dev(x, option:rounding, option:distribution): -> return_type 0. std_dev(fp32, option:rounding, option:distribution): -> fp32? 1. std_dev(fp64, option:rounding, option:distribution): -> fp64?

    Calculates standard-deviation for a set of values.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • distribution ['SAMPLE', 'POPULATION']
  • "},{"location":"extensions/functions_arithmetic/#variance","title":"variance","text":"

    Implementations: variance(x, option:rounding, option:distribution): -> return_type 0. variance(fp32, option:rounding, option:distribution): -> fp32? 1. variance(fp64, option:rounding, option:distribution): -> fp64?

    Calculates variance for a set of values.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • distribution ['SAMPLE', 'POPULATION']
  • "},{"location":"extensions/functions_arithmetic/#corr","title":"corr","text":"

    Implementations: corr(x, y, option:rounding): -> return_type 0. corr(fp32, fp32, option:rounding): -> fp32? 1. corr(fp64, fp64, option:rounding): -> fp64?

    *Calculates the value of Pearson\u2019s correlation coefficient between x and y. If there is no input, null is returned. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#mode","title":"mode","text":"

    Implementations: mode(x): -> return_type 0. mode(i8): -> i8? 1. mode(i16): -> i16? 2. mode(i32): -> i32? 3. mode(i64): -> i64? 4. mode(fp32): -> fp32? 5. mode(fp64): -> fp64?

    *Calculates mode for a set of values. If there is no input, null is returned. *

    "},{"location":"extensions/functions_arithmetic/#median","title":"median","text":"

    Implementations: median(precision, x, option:rounding): -> return_type 0. median(precision, i8, option:rounding): -> i8? 1. median(precision, i16, option:rounding): -> i16? 2. median(precision, i32, option:rounding): -> i32? 3. median(precision, i64, option:rounding): -> i64? 4. median(precision, fp32, option:rounding): -> fp32? 5. median(precision, fp64, option:rounding): -> fp64?

    *Calculate the median for a set of values. Returns null if applied to zero records. For the integer implementations, the rounding option determines how the median should be rounded if it ends up midway between two values. For the floating point implementations, they specify the usual floating point rounding mode. *

    Options:
  • precision ['EXACT', 'APPROXIMATE']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#quantile","title":"quantile","text":"

    Implementations: quantile(boundaries, precision, n, distribution, option:rounding): -> return_type

  • n: A positive integer which defines the number of quantile partitions.
  • distribution: The data for which the quantiles should be computed.
  • 0. quantile(boundaries, precision, i64, any, option:rounding): -> LIST?<any>

    *Calculates quantiles for a set of values. This function will divide the aggregated values (passed via the distribution argument) over N equally-sized bins, where N is passed via a constant argument. It will then return the values at the boundaries of these bins in list form. If the input is appropriately sorted, this computes the quantiles of the distribution. The function can optionally return the first and/or last element of the input, as specified by the boundaries argument. If the input is appropriately sorted, this will thus be the minimum and/or maximum values of the distribution. When the boundaries do not lie exactly on elements of the incoming distribution, the function will interpolate between the two nearby elements. If the interpolated value cannot be represented exactly, the rounding option controls how the value should be selected or computed. The function fails and returns null in the following cases: - n is null or less than one; - any value in distribution is null.

    The function returns an empty list if n equals 1 and boundaries is set to NEITHER. *

    Options:
  • boundaries ['NEITHER', 'MINIMUM', 'MAXIMUM', 'BOTH']
  • precision ['EXACT', 'APPROXIMATE']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#window-functions","title":"Window Functions","text":""},{"location":"extensions/functions_arithmetic/#row_number","title":"row_number","text":"

    Implementations: 0. row_number(): -> i64?

    the number of the current row within its partition, starting at 1

    "},{"location":"extensions/functions_arithmetic/#rank","title":"rank","text":"

    Implementations: 0. rank(): -> i64?

    the rank of the current row, with gaps.

    "},{"location":"extensions/functions_arithmetic/#dense_rank","title":"dense_rank","text":"

    Implementations: 0. dense_rank(): -> i64?

    the rank of the current row, without gaps.

    "},{"location":"extensions/functions_arithmetic/#percent_rank","title":"percent_rank","text":"

    Implementations: 0. percent_rank(): -> fp64?

    the relative rank of the current row.

    "},{"location":"extensions/functions_arithmetic/#cume_dist","title":"cume_dist","text":"

    Implementations: 0. cume_dist(): -> fp64?

    the cumulative distribution.

    "},{"location":"extensions/functions_arithmetic/#ntile","title":"ntile","text":"

    Implementations: ntile(x): -> return_type 0. ntile(i32): -> i32? 1. ntile(i64): -> i64?

    Return an integer ranging from 1 to the argument value,dividing the partition as equally as possible.

    "},{"location":"extensions/functions_arithmetic/#first_value","title":"first_value","text":"

    Implementations: first_value(expression): -> return_type 0. first_value(any1): -> any1

    *Returns the first value in the window. *

    "},{"location":"extensions/functions_arithmetic/#last_value","title":"last_value","text":"

    Implementations: last_value(expression): -> return_type 0. last_value(any1): -> any1

    *Returns the last value in the window. *

    "},{"location":"extensions/functions_arithmetic/#nth_value","title":"nth_value","text":"

    Implementations: nth_value(expression, window_offset, option:on_domain_error): -> return_type 0. nth_value(any1, i32, option:on_domain_error): -> any1?

    *Returns a value from the nth row based on the window_offset. window_offset should be a positive integer. If the value of the window_offset is outside the range of the window, null is returned. The on_domain_error option governs behavior in cases where window_offset is not a positive integer or null. *

    Options:
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#lead","title":"lead","text":"

    Implementations: lead(expression): -> return_type 0. lead(any1): -> any1? 1. lead(any1, i32): -> any1? 2. lead(any1, i32, any1): -> any1?

    *Return a value from a following row based on a specified physical offset. This allows you to compare a value in the current row against a following row. The expression is evaluated against a row that comes after the current row based on the row_offset. The row_offset should be a positive integer and is set to 1 if not specified explicitly. If the row_offset is negative, the expression will be evaluated against a row coming before the current row, similar to the lag function. A row_offset of null will return null. The function returns the default input value if row_offset goes beyond the scope of the window. If a default value is not specified, it is set to null. Example comparing the sales of the current year to the following year. row_offset of 1. | year | sales | next_year_sales | | 2019 | 20.50 | 30.00 | | 2020 | 30.00 | 45.99 | | 2021 | 45.99 | null | *

    "},{"location":"extensions/functions_arithmetic/#lag","title":"lag","text":"

    Implementations: lag(expression): -> return_type 0. lag(any1): -> any1? 1. lag(any1, i32): -> any1? 2. lag(any1, i32, any1): -> any1?

    *Return a column value from a previous row based on a specified physical offset. This allows you to compare a value in the current row against a previous row. The expression is evaluated against a row that comes before the current row based on the row_offset. The expression can be a column, expression or subquery that evaluates to a single value. The row_offset should be a positive integer and is set to 1 if not specified explicitly. If the row_offset is negative, the expression will be evaluated against a row coming after the current row, similar to the lead function. A row_offset of null will return null. The function returns the default input value if row_offset goes beyond the scope of the partition. If a default value is not specified, it is set to null. Example comparing the sales of the current year to the previous year. row_offset of 1. | year | sales | previous_year_sales | | 2019 | 20.50 | null | | 2020 | 30.00 | 20.50 | | 2021 | 45.99 | 30.00 | *

    "},{"location":"extensions/functions_arithmetic_decimal/","title":"functions_arithmetic_decimal.yaml","text":"

    This document file is generated for functions_arithmetic_decimal.yaml

    "},{"location":"extensions/functions_arithmetic_decimal/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_arithmetic_decimal/#add","title":"add","text":"

    Implementations: add(x, y, option:overflow): -> return_type 0. add(decimal<P1,S1>, decimal<P2,S2>, option:overflow): ->

    init_scale = max(S1,S2)\ninit_prec = init_scale + max(P1 - S1, P2 - S2) + 1\nmin_scale = min(init_scale, 6)\ndelta = init_prec - 38\nprec = min(init_prec, 38)\nscale_after_borrow = max(init_scale - delta, min_scale)\nscale = init_prec > 38 ? scale_after_borrow : init_scale\nDECIMAL<prec, scale>  \n

    Add two decimal values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#subtract","title":"subtract","text":"

    Implementations: subtract(x, y, option:overflow): -> return_type 0. subtract(decimal<P1,S1>, decimal<P2,S2>, option:overflow): ->

    init_scale = max(S1,S2)\ninit_prec = init_scale + max(P1 - S1, P2 - S2) + 1\nmin_scale = min(init_scale, 6)\ndelta = init_prec - 38\nprec = min(init_prec, 38)\nscale_after_borrow = max(init_scale - delta, min_scale)\nscale = init_prec > 38 ? scale_after_borrow : init_scale\nDECIMAL<prec, scale>  \n

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#multiply","title":"multiply","text":"

    Implementations: multiply(x, y, option:overflow): -> return_type 0. multiply(decimal<P1,S1>, decimal<P2,S2>, option:overflow): ->

    init_scale = S1 + S2\ninit_prec = P1 + P2 + 1\nmin_scale = min(init_scale, 6)\ndelta = init_prec - 38\nprec = min(init_prec, 38)\nscale_after_borrow = max(init_scale - delta, min_scale)\nscale = init_prec > 38 ? scale_after_borrow : init_scale\nDECIMAL<prec, scale>  \n

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#divide","title":"divide","text":"

    Implementations: divide(x, y, option:overflow): -> return_type 0. divide(decimal<P1,S1>, decimal<P2,S2>, option:overflow): ->

    init_scale = max(6, S1 + P2 + 1)\ninit_prec = P1 - S1 + P2 + init_scale\nmin_scale = min(init_scale, 6)\ndelta = init_prec - 38\nprec = min(init_prec, 38)\nscale_after_borrow = max(init_scale - delta, min_scale)\nscale = init_prec > 38 ? scale_after_borrow : init_scale\nDECIMAL<prec, scale>  \n

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#modulus","title":"modulus","text":"

    Implementations: modulus(x, y, option:overflow): -> return_type 0. modulus(decimal<P1,S1>, decimal<P2,S2>, option:overflow): ->

    init_scale = max(S1,S2)\ninit_prec = min(P1 - S1, P2 - S2) + init_scale\nmin_scale = min(init_scale, 6)\ndelta = init_prec - 38\nprec = min(init_prec, 38)\nscale_after_borrow = max(init_scale - delta, min_scale)\nscale = init_prec > 38 ? scale_after_borrow : init_scale\nDECIMAL<prec, scale>  \n

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#abs","title":"abs","text":"

    Implementations: abs(x): -> return_type 0. abs(decimal<P,S>): -> decimal<P,S>

    Calculate the absolute value of the argument.

    "},{"location":"extensions/functions_arithmetic_decimal/#bitwise_and","title":"bitwise_and","text":"

    Implementations: bitwise_and(x, y): -> return_type 0. bitwise_and(DECIMAL<P1,0>, DECIMAL<P2,0>): ->

    max_precision = max(P1, P2)\nDECIMAL<max_precision, 0>  \n

    *Return the bitwise AND result for two decimal inputs. In inputs scale must be 0 (i.e. only integer types are allowed) *

    "},{"location":"extensions/functions_arithmetic_decimal/#bitwise_or","title":"bitwise_or","text":"

    Implementations: bitwise_or(x, y): -> return_type 0. bitwise_or(DECIMAL<P1,0>, DECIMAL<P2,0>): ->

    max_precision = max(P1, P2)\nDECIMAL<max_precision, 0>  \n

    *Return the bitwise OR result for two given decimal inputs. In inputs scale must be 0 (i.e. only integer types are allowed) *

    "},{"location":"extensions/functions_arithmetic_decimal/#bitwise_xor","title":"bitwise_xor","text":"

    Implementations: bitwise_xor(x, y): -> return_type 0. bitwise_xor(DECIMAL<P1,0>, DECIMAL<P2,0>): ->

    max_precision = max(P1, P2)\nDECIMAL<max_precision, 0>  \n

    *Return the bitwise XOR result for two given decimal inputs. In inputs scale must be 0 (i.e. only integer types are allowed) *

    "},{"location":"extensions/functions_arithmetic_decimal/#sqrt","title":"sqrt","text":"

    Implementations: sqrt(x): -> return_type 0. sqrt(DECIMAL<P,S>): -> fp64

    Square root of the value. Sqrt of 0 is 0 and sqrt of negative values will raise an error.

    "},{"location":"extensions/functions_arithmetic_decimal/#factorial","title":"factorial","text":"

    Implementations: factorial(n): -> return_type 0. factorial(DECIMAL<P,0>): -> DECIMAL<38,0>

    *Return the factorial of a given decimal input. Scale should be 0 for factorial decimal input. The factorial of 0! is 1 by convention. Negative inputs will raise an error. Input which cause overflow of result will raise an error. *

    "},{"location":"extensions/functions_arithmetic_decimal/#power","title":"power","text":"

    Implementations: power(x, y, option:overflow, option:complex_number_result): -> return_type 0. power(DECIMAL<P1,S1>, DECIMAL<P2,S2>, option:overflow, option:complex_number_result): -> fp64

    Take the power with x as the base and y as exponent. Behavior for complex number result is indicated by option complex_number_result

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • complex_number_result ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_arithmetic_decimal/#sum","title":"sum","text":"

    Implementations: sum(x, option:overflow): -> return_type 0. sum(DECIMAL<P, S>, option:overflow): -> DECIMAL?<38,S>

    Sum a set of values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#avg","title":"avg","text":"

    Implementations: avg(x, option:overflow): -> return_type 0. avg(DECIMAL<P,S>, option:overflow): -> DECIMAL<38,S>

    Average a set of values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#min","title":"min","text":"

    Implementations: min(x): -> return_type 0. min(DECIMAL<P, S>): -> DECIMAL?<P, S>

    Min a set of values.

    "},{"location":"extensions/functions_arithmetic_decimal/#max","title":"max","text":"

    Implementations: max(x): -> return_type 0. max(DECIMAL<P,S>): -> DECIMAL?<P, S>

    Max a set of values.

    "},{"location":"extensions/functions_arithmetic_decimal/#sum0","title":"sum0","text":"

    Implementations: sum0(x, option:overflow): -> return_type 0. sum0(DECIMAL<P, S>, option:overflow): -> DECIMAL<38,S>

    *Sum a set of values. The sum of zero elements yields zero. Null values are ignored. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_boolean/","title":"functions_boolean.yaml","text":"

    This document file is generated for functions_boolean.yaml

    "},{"location":"extensions/functions_boolean/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_boolean/#or","title":"or","text":"

    Implementations: or(a): -> return_type 0. or(boolean?): -> boolean?

    *The boolean or using Kleene logic. This function behaves as follows with nulls:

    true or null = true\n\nnull or true = true\n\nfalse or null = null\n\nnull or false = null\n\nnull or null = null\n

    In other words, in this context a null value really means \u201cunknown\u201d, and an unknown value or true is always true. Behavior for 0 or 1 inputs is as follows: or() -> false or(x) -> x *

    "},{"location":"extensions/functions_boolean/#and","title":"and","text":"

    Implementations: and(a): -> return_type 0. and(boolean?): -> boolean?

    *The boolean and using Kleene logic. This function behaves as follows with nulls:

    true and null = null\n\nnull and true = null\n\nfalse and null = false\n\nnull and false = false\n\nnull and null = null\n

    In other words, in this context a null value really means \u201cunknown\u201d, and an unknown value and false is always false. Behavior for 0 or 1 inputs is as follows: and() -> true and(x) -> x *

    "},{"location":"extensions/functions_boolean/#and_not","title":"and_not","text":"

    Implementations: and_not(a, b): -> return_type 0. and_not(boolean?, boolean?): -> boolean?

    *The boolean and of one value and the negation of the other using Kleene logic. This function behaves as follows with nulls:

    true and not null = null\n\nnull and not false = null\n\nfalse and not null = false\n\nnull and not true = false\n\nnull and not null = null\n

    In other words, in this context a null value really means \u201cunknown\u201d, and an unknown value and not true is always false, as is false and not an unknown value. *

    "},{"location":"extensions/functions_boolean/#xor","title":"xor","text":"

    Implementations: xor(a, b): -> return_type 0. xor(boolean?, boolean?): -> boolean?

    *The boolean xor of two values using Kleene logic. When a null is encountered in either input, a null is output. *

    "},{"location":"extensions/functions_boolean/#not","title":"not","text":"

    Implementations: not(a): -> return_type 0. not(boolean?): -> boolean?

    *The not of a boolean value. When a null is input, a null is output. *

    "},{"location":"extensions/functions_boolean/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_boolean/#bool_and","title":"bool_and","text":"

    Implementations: bool_and(a): -> return_type 0. bool_and(boolean): -> boolean?

    *If any value in the input is false, false is returned. If the input is empty or only contains nulls, null is returned. Otherwise, true is returned. *

    "},{"location":"extensions/functions_boolean/#bool_or","title":"bool_or","text":"

    Implementations: bool_or(a): -> return_type 0. bool_or(boolean): -> boolean?

    *If any value in the input is true, true is returned. If the input is empty or only contains nulls, null is returned. Otherwise, false is returned. *

    "},{"location":"extensions/functions_comparison/","title":"functions_comparison.yaml","text":"

    This document file is generated for functions_comparison.yaml

    "},{"location":"extensions/functions_comparison/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_comparison/#not_equal","title":"not_equal","text":"

    Implementations: not_equal(x, y): -> return_type 0. not_equal(any1, any1): -> boolean

    *Whether two values are not_equal. not_equal(x, y) := (x != y) If either/both of x and y are null, null is returned. *

    "},{"location":"extensions/functions_comparison/#equal","title":"equal","text":"

    Implementations: equal(x, y): -> return_type 0. equal(any1, any1): -> boolean

    *Whether two values are equal. equal(x, y) := (x == y) If either/both of x and y are null, null is returned. *

    "},{"location":"extensions/functions_comparison/#is_not_distinct_from","title":"is_not_distinct_from","text":"

    Implementations: is_not_distinct_from(x, y): -> return_type 0. is_not_distinct_from(any1, any1): -> boolean

    *Whether two values are equal. This function treats null values as comparable, so is_not_distinct_from(null, null) == True This is in contrast to equal, in which null values do not compare. *

    "},{"location":"extensions/functions_comparison/#is_distinct_from","title":"is_distinct_from","text":"

    Implementations: is_distinct_from(x, y): -> return_type 0. is_distinct_from(any1, any1): -> boolean

    *Whether two values are not equal. This function treats null values as comparable, so is_distinct_from(null, null) == False This is in contrast to equal, in which null values do not compare. *

    "},{"location":"extensions/functions_comparison/#lt","title":"lt","text":"

    Implementations: lt(x, y): -> return_type 0. lt(any1, any1): -> boolean

    *Less than. lt(x, y) := (x < y) If either/both of x and y are null, null is returned. *

    "},{"location":"extensions/functions_comparison/#gt","title":"gt","text":"

    Implementations: gt(x, y): -> return_type 0. gt(any1, any1): -> boolean

    *Greater than. gt(x, y) := (x > y) If either/both of x and y are null, null is returned. *

    "},{"location":"extensions/functions_comparison/#lte","title":"lte","text":"

    Implementations: lte(x, y): -> return_type 0. lte(any1, any1): -> boolean

    *Less than or equal to. lte(x, y) := (x <= y) If either/both of x and y are null, null is returned. *

    "},{"location":"extensions/functions_comparison/#gte","title":"gte","text":"

    Implementations: gte(x, y): -> return_type 0. gte(any1, any1): -> boolean

    *Greater than or equal to. gte(x, y) := (x >= y) If either/both of x and y are null, null is returned. *

    "},{"location":"extensions/functions_comparison/#between","title":"between","text":"

    Implementations: between(expression, low, high): -> return_type

  • expression: The expression to test for in the range defined by `low` and `high`.
  • low: The value to check if greater than or equal to.
  • high: The value to check if less than or equal to.
  • 0. between(any1, any1, any1): -> boolean

    Whether the expression is greater than or equal to low and less than or equal to high. expression BETWEEN low AND high If low, high, or expression are null, null is returned.

    "},{"location":"extensions/functions_comparison/#is_null","title":"is_null","text":"

    Implementations: is_null(x): -> return_type 0. is_null(any1): -> boolean

    Whether a value is null. NaN is not null.

    "},{"location":"extensions/functions_comparison/#is_not_null","title":"is_not_null","text":"

    Implementations: is_not_null(x): -> return_type 0. is_not_null(any1): -> boolean

    Whether a value is not null. NaN is not null.

    "},{"location":"extensions/functions_comparison/#is_nan","title":"is_nan","text":"

    Implementations: is_nan(x): -> return_type 0. is_nan(fp32): -> boolean 1. is_nan(fp64): -> boolean

    *Whether a value is not a number. If x is null, null is returned. *

    "},{"location":"extensions/functions_comparison/#is_finite","title":"is_finite","text":"

    Implementations: is_finite(x): -> return_type 0. is_finite(fp32): -> boolean 1. is_finite(fp64): -> boolean

    *Whether a value is finite (neither infinite nor NaN). If x is null, null is returned. *

    "},{"location":"extensions/functions_comparison/#is_infinite","title":"is_infinite","text":"

    Implementations: is_infinite(x): -> return_type 0. is_infinite(fp32): -> boolean 1. is_infinite(fp64): -> boolean

    *Whether a value is infinite. If x is null, null is returned. *

    "},{"location":"extensions/functions_comparison/#nullif","title":"nullif","text":"

    Implementations: nullif(x, y): -> return_type 0. nullif(any1, any1): -> any1

    If two values are equal, return null. Otherwise, return the first value.

    "},{"location":"extensions/functions_comparison/#coalesce","title":"coalesce","text":"

    Implementations: 0. coalesce(any1, any1): -> any1

    Evaluate arguments from left to right and return the first argument that is not null. Once a non-null argument is found, the remaining arguments are not evaluated. If all arguments are null, return null.

    "},{"location":"extensions/functions_comparison/#least","title":"least","text":"

    Implementations: 0. least(any1, any1): -> any1

    Evaluates each argument and returns the smallest one. The function will return null if any argument evaluates to null.

    "},{"location":"extensions/functions_comparison/#least_skip_null","title":"least_skip_null","text":"

    Implementations: 0. least_skip_null(any1, any1): -> any1

    Evaluates each argument and returns the smallest one. The function will return null only if all arguments evaluate to null.

    "},{"location":"extensions/functions_comparison/#greatest","title":"greatest","text":"

    Implementations: 0. greatest(any1, any1): -> any1

    Evaluates each argument and returns the largest one. The function will return null if any argument evaluates to null.

    "},{"location":"extensions/functions_comparison/#greatest_skip_null","title":"greatest_skip_null","text":"

    Implementations: 0. greatest_skip_null(any1, any1): -> any1

    Evaluates each argument and returns the largest one. The function will return null only if all arguments evaluate to null.

    "},{"location":"extensions/functions_datetime/","title":"functions_datetime.yaml","text":"

    This document file is generated for functions_datetime.yaml

    "},{"location":"extensions/functions_datetime/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_datetime/#extract","title":"extract","text":"

    Implementations: extract(component, x, timezone): -> return_type

  • x: Timezone string from IANA tzdb.
  • 0. extract(component, timestamp_tz, string): -> i64 1. extract(component, precision_timestamp_tz<P>, string): -> i64 2. extract(component, timestamp): -> i64 3. extract(component, precision_timestamp<P>): -> i64 4. extract(component, date): -> i64 5. extract(component, time): -> i64 6. extract(component, indexing, timestamp_tz, string): -> i64 7. extract(component, indexing, precision_timestamp_tz<P>, string): -> i64 8. extract(component, indexing, timestamp): -> i64 9. extract(component, indexing, precision_timestamp<P>): -> i64 10. extract(component, indexing, date): -> i64

    Extract portion of a date/time value. * YEAR Return the year. * ISO_YEAR Return the ISO 8601 week-numbering year. First week of an ISO year has the majority (4 or more) of its days in January. * US_YEAR Return the US epidemiological year. First week of US epidemiological year has the majority (4 or more) of its days in January. Last week of US epidemiological year has the year\u2019s last Wednesday in it. US epidemiological week starts on Sunday. * QUARTER Return the number of the quarter within the year. January 1 through March 31 map to the first quarter, April 1 through June 30 map to the second quarter, etc. * MONTH Return the number of the month within the year. * DAY Return the number of the day within the month. * DAY_OF_YEAR Return the number of the day within the year. January 1 maps to the first day, February 1 maps to the thirty-second day, etc. * MONDAY_DAY_OF_WEEK Return the number of the day within the week, from Monday (first day) to Sunday (seventh day). * SUNDAY_DAY_OF_WEEK Return the number of the day within the week, from Sunday (first day) to Saturday (seventh day). * MONDAY_WEEK Return the number of the week within the year. First week starts on first Monday of January. * SUNDAY_WEEK Return the number of the week within the year. First week starts on first Sunday of January. * ISO_WEEK Return the number of the ISO week within the ISO year. First ISO week has the majority (4 or more) of its days in January. ISO week starts on Monday. * US_WEEK Return the number of the US week within the US year. First US week has the majority (4 or more) of its days in January. US week starts on Sunday. * HOUR Return the hour (0-23). * MINUTE Return the minute (0-59). * SECOND Return the second (0-59). * MILLISECOND Return number of milliseconds since the last full second. * MICROSECOND Return number of microseconds since the last full millisecond. * NANOSECOND Return number of nanoseconds since the last full microsecond. * SUBSECOND Return number of microseconds since the last full second of the given timestamp. * UNIX_TIME Return number of seconds that have elapsed since 1970-01-01 00:00:00 UTC, ignoring leap seconds. * TIMEZONE_OFFSET Return number of seconds of timezone offset to UTC. The range of values returned for QUARTER, MONTH, DAY, DAY_OF_YEAR, MONDAY_DAY_OF_WEEK, SUNDAY_DAY_OF_WEEK, MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK depends on whether counting starts at 1 or 0. This is governed by the indexing option. When indexing is ONE: * QUARTER returns values in range 1-4 * MONTH returns values in range 1-12 * DAY returns values in range 1-31 * DAY_OF_YEAR returns values in range 1-366 * MONDAY_DAY_OF_WEEK and SUNDAY_DAY_OF_WEEK return values in range 1-7 * MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK return values in range 1-53 When indexing is ZERO: * QUARTER returns values in range 0-3 * MONTH returns values in range 0-11 * DAY returns values in range 0-30 * DAY_OF_YEAR returns values in range 0-365 * MONDAY_DAY_OF_WEEK and SUNDAY_DAY_OF_WEEK return values in range 0-6 * MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK return values in range 0-52 The indexing option must be specified when the component is QUARTER, MONTH, DAY, DAY_OF_YEAR, MONDAY_DAY_OF_WEEK, SUNDAY_DAY_OF_WEEK, MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, or US_WEEK. The indexing option cannot be specified when the component is YEAR, ISO_YEAR, US_YEAR, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, SUBSECOND, UNIX_TIME, or TIMEZONE_OFFSET. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    Options:
  • component ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'SUBSECOND', 'UNIX_TIME', 'TIMEZONE_OFFSET']
  • indexing ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'NANOSECOND', 'SUBSECOND', 'UNIX_TIME', 'TIMEZONE_OFFSET']
  • component ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'SUBSECOND', 'UNIX_TIME']
  • indexing ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'NANOSECOND', 'SUBSECOND', 'UNIX_TIME']
  • component ['YEAR', 'ISO_YEAR', 'US_YEAR', 'UNIX_TIME']
  • indexing ['HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'SUBSECOND']
  • component ['QUARTER', 'MONTH', 'DAY', 'DAY_OF_YEAR', 'MONDAY_DAY_OF_WEEK', 'SUNDAY_DAY_OF_WEEK', 'MONDAY_WEEK', 'SUNDAY_WEEK', 'ISO_WEEK', 'US_WEEK']
  • indexing ['ONE', 'ZERO']
  • "},{"location":"extensions/functions_datetime/#extract_boolean","title":"extract_boolean","text":"

    Implementations: extract_boolean(component, x): -> return_type 0. extract_boolean(component, timestamp): -> boolean 1. extract_boolean(component, precision_timestamp<P>): -> boolean 2. extract_boolean(component, timestamp_tz, string): -> boolean 3. extract_boolean(component, precision_timestamp_tz<P>, string): -> boolean 4. extract_boolean(component, date): -> boolean

    *Extract boolean values of a date/time value. * IS_LEAP_YEAR Return true if year of the given value is a leap year and false otherwise. * IS_DST Return true if DST (Daylight Savings Time) is observed at the given value in the given timezone.

    Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.*

    Options:
  • component ['IS_LEAP_YEAR']
  • component ['IS_LEAP_YEAR', 'IS_DST']
  • "},{"location":"extensions/functions_datetime/#add","title":"add","text":"

    Implementations: add(x, y): -> return_type 0. add(timestamp, interval_year): -> timestamp 1. add(precision_timestamp<P>, interval_year): -> precision_timestamp<P> 2. add(timestamp_tz, interval_year, string): -> timestamp_tz 3. add(precision_timestamp_tz<P>, interval_year, string): -> precision_timestamp_tz<P> 4. add(date, interval_year): -> timestamp 5. add(timestamp, interval_day<P>): -> timestamp 6. add(precision_timestamp<P>, interval_day<P>): -> precision_timestamp<P> 7. add(timestamp_tz, interval_day<P>): -> timestamp_tz 8. add(precision_timestamp_tz<P>, interval_day<P>): -> precision_timestamp_tz<P> 9. add(date, interval_day<P>): -> timestamp

    Add an interval to a date/time type. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    "},{"location":"extensions/functions_datetime/#multiply","title":"multiply","text":"

    Implementations: multiply(x, y): -> return_type 0. multiply(i8, interval_day<P>): -> interval_day<P> 1. multiply(i16, interval_day<P>): -> interval_day<P> 2. multiply(i32, interval_day<P>): -> interval_day<P> 3. multiply(i64, interval_day<P>): -> interval_day<P> 4. multiply(i8, interval_year): -> interval_year 5. multiply(i16, interval_year): -> interval_year 6. multiply(i32, interval_year): -> interval_year 7. multiply(i64, interval_year): -> interval_year

    Multiply an interval by an integral number.

    "},{"location":"extensions/functions_datetime/#add_intervals","title":"add_intervals","text":"

    Implementations: add_intervals(x, y): -> return_type 0. add_intervals(interval_day<P>, interval_day<P>): -> interval_day<P> 1. add_intervals(interval_year, interval_year): -> interval_year

    Add two intervals together.

    "},{"location":"extensions/functions_datetime/#subtract","title":"subtract","text":"

    Implementations: subtract(x, y): -> return_type 0. subtract(timestamp, interval_year): -> timestamp 1. subtract(precision_timestamp<P>, interval_year): -> precision_timestamp<P> 2. subtract(timestamp_tz, interval_year): -> timestamp_tz 3. subtract(precision_timestamp_tz<P>, interval_year): -> precision_timestamp_tz<P> 4. subtract(timestamp_tz, interval_year, string): -> timestamp_tz 5. subtract(precision_timestamp_tz<P>, interval_year, string): -> precision_timestamp_tz<P> 6. subtract(date, interval_year): -> date 7. subtract(timestamp, interval_day<P>): -> timestamp 8. subtract(precision_timestamp<P>, interval_day<P>): -> precision_timestamp<P> 9. subtract(timestamp_tz, interval_day<P>): -> timestamp_tz 10. subtract(precision_timestamp_tz<P>, interval_day<P>): -> precision_timestamp_tz<P> 11. subtract(date, interval_day<P>): -> date

    Subtract an interval from a date/time type. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    "},{"location":"extensions/functions_datetime/#lte","title":"lte","text":"

    Implementations: lte(x, y): -> return_type 0. lte(timestamp, timestamp): -> boolean 1. lte(precision_timestamp<P>, precision_timestamp<P>): -> boolean 2. lte(timestamp_tz, timestamp_tz): -> boolean 3. lte(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean 4. lte(date, date): -> boolean 5. lte(interval_day<P>, interval_day<P>): -> boolean 6. lte(interval_year, interval_year): -> boolean

    less than or equal to

    "},{"location":"extensions/functions_datetime/#lt","title":"lt","text":"

    Implementations: lt(x, y): -> return_type 0. lt(timestamp, timestamp): -> boolean 1. lt(precision_timestamp<P>, precision_timestamp<P>): -> boolean 2. lt(timestamp_tz, timestamp_tz): -> boolean 3. lt(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean 4. lt(date, date): -> boolean 5. lt(interval_day<P>, interval_day<P>): -> boolean 6. lt(interval_year, interval_year): -> boolean

    less than

    "},{"location":"extensions/functions_datetime/#gte","title":"gte","text":"

    Implementations: gte(x, y): -> return_type 0. gte(timestamp, timestamp): -> boolean 1. gte(precision_timestamp<P>, precision_timestamp<P>): -> boolean 2. gte(timestamp_tz, timestamp_tz): -> boolean 3. gte(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean 4. gte(date, date): -> boolean 5. gte(interval_day<P>, interval_day<P>): -> boolean 6. gte(interval_year, interval_year): -> boolean

    greater than or equal to

    "},{"location":"extensions/functions_datetime/#gt","title":"gt","text":"

    Implementations: gt(x, y): -> return_type 0. gt(timestamp, timestamp): -> boolean 1. gt(precision_timestamp<P>, precision_timestamp<P>): -> boolean 2. gt(timestamp_tz, timestamp_tz): -> boolean 3. gt(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean 4. gt(date, date): -> boolean 5. gt(interval_day<P>, interval_day<P>): -> boolean 6. gt(interval_year, interval_year): -> boolean

    greater than

    "},{"location":"extensions/functions_datetime/#assume_timezone","title":"assume_timezone","text":"

    Implementations: assume_timezone(x, timezone): -> return_type

  • x: Timezone string from IANA tzdb.
  • 0. assume_timezone(timestamp, string): -> timestamp_tz 1. assume_timezone(precision_timestamp<P>, string): -> precision_timestamp_tz<P> 2. assume_timezone(date, string): -> timestamp_tz

    Convert local timestamp to UTC-relative timestamp_tz using given local time\u2019s timezone. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    "},{"location":"extensions/functions_datetime/#local_timestamp","title":"local_timestamp","text":"

    Implementations: local_timestamp(x, timezone): -> return_type

  • x: Timezone string from IANA tzdb.
  • 0. local_timestamp(timestamp_tz, string): -> timestamp 1. local_timestamp(precision_timestamp_tz<P>, string): -> precision_timestamp<P>

    Convert UTC-relative timestamp_tz to local timestamp using given local time\u2019s timezone. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    "},{"location":"extensions/functions_datetime/#strptime_time","title":"strptime_time","text":"

    Implementations: strptime_time(time_string, format): -> return_type 0. strptime_time(string, string): -> time

    Parse string into time using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference.

    "},{"location":"extensions/functions_datetime/#strptime_date","title":"strptime_date","text":"

    Implementations: strptime_date(date_string, format): -> return_type 0. strptime_date(string, string): -> date

    Parse string into date using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference.

    "},{"location":"extensions/functions_datetime/#strptime_timestamp","title":"strptime_timestamp","text":"

    Implementations: strptime_timestamp(timestamp_string, format, timezone): -> return_type

  • timestamp_string: Timezone string from IANA tzdb.
  • 0. strptime_timestamp(string, string, string): -> timestamp_tz 1. strptime_timestamp(string, string): -> timestamp_tz

    Parse string into timestamp using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference. If timezone is present in timestamp and provided as parameter an error is thrown. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is supplied as parameter and present in the parsed string the parsed timezone is used. If parameter supplied timezone is invalid an error is thrown.

    "},{"location":"extensions/functions_datetime/#strftime","title":"strftime","text":"

    Implementations: strftime(x, format): -> return_type 0. strftime(timestamp, string): -> string 1. strftime(precision_timestamp<P>, string): -> string 2. strftime(timestamp_tz, string, string): -> string 3. strftime(precision_timestamp_tz<P>, string, string): -> string 4. strftime(date, string): -> string 5. strftime(time, string): -> string

    Convert timestamp/date/time to string using provided format, see https://man7.org/linux/man-pages/man3/strftime.3.html for reference. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    "},{"location":"extensions/functions_datetime/#round_temporal","title":"round_temporal","text":"

    Implementations: round_temporal(x, rounding, unit, multiple, origin): -> return_type 0. round_temporal(timestamp, rounding, unit, i64, timestamp): -> timestamp 1. round_temporal(precision_timestamp<P>, rounding, unit, i64, precision_timestamp<P>): -> precision_timestamp<P> 2. round_temporal(timestamp_tz, rounding, unit, i64, string, timestamp_tz): -> timestamp_tz 3. round_temporal(precision_timestamp_tz<P>, rounding, unit, i64, string, precision_timestamp_tz<P>): -> precision_timestamp_tz<P> 4. round_temporal(date, rounding, unit, i64, date): -> date 5. round_temporal(time, rounding, unit, i64, time): -> time

    Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple from the origin in the given timezone, the resulting point is chosen as one of the two nearest multiples. Which of these is chosen is governed by rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    Options:
  • rounding ['FLOOR', 'CEIL', 'ROUND_TIE_DOWN', 'ROUND_TIE_UP']
  • unit ['YEAR', 'MONTH', 'WEEK', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • rounding ['YEAR', 'MONTH', 'WEEK', 'DAY']
  • unit ['HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • "},{"location":"extensions/functions_datetime/#round_calendar","title":"round_calendar","text":"

    Implementations: round_calendar(x, rounding, unit, origin, multiple): -> return_type 0. round_calendar(timestamp, rounding, unit, origin, i64): -> timestamp 1. round_calendar(precision_timestamp<P>, rounding, unit, origin, i64): -> precision_timestamp<P> 2. round_calendar(timestamp_tz, rounding, unit, origin, i64, string): -> timestamp_tz 3. round_calendar(precision_timestamp_tz<P>, rounding, unit, origin, i64, string): -> precision_timestamp_tz<P> 4. round_calendar(date, rounding, unit, origin, i64, date): -> date 5. round_calendar(time, rounding, unit, origin, i64, time): -> time

    Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple from the last origin unit in the given timezone, the resulting point is chosen as one of the two nearest multiples. Which of these is chosen is governed by rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    Options:
  • rounding ['FLOOR', 'CEIL', 'ROUND_TIE_DOWN', 'ROUND_TIE_UP']
  • unit ['YEAR', 'MONTH', 'WEEK', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • origin ['YEAR', 'MONTH', 'MONDAY_WEEK', 'SUNDAY_WEEK', 'ISO_WEEK', 'US_WEEK', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND']
  • rounding ['YEAR', 'MONTH', 'WEEK', 'DAY']
  • unit ['YEAR', 'MONTH', 'MONDAY_WEEK', 'SUNDAY_WEEK', 'ISO_WEEK', 'US_WEEK', 'DAY']
  • origin ['DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • rounding ['DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND']
  • "},{"location":"extensions/functions_datetime/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_datetime/#min","title":"min","text":"

    Implementations: min(x): -> return_type 0. min(date): -> date? 1. min(time): -> time? 2. min(timestamp): -> timestamp? 3. min(precision_timestamp<P>): -> precision_timestamp?<P> 4. min(timestamp_tz): -> timestamp_tz? 5. min(precision_timestamp_tz<P>): -> precision_timestamp_tz?<P> 6. min(interval_day<P>): -> interval_day?<P> 7. min(interval_year): -> interval_year?

    Min a set of values.

    "},{"location":"extensions/functions_datetime/#max","title":"max","text":"

    Implementations: max(x): -> return_type 0. max(date): -> date? 1. max(time): -> time? 2. max(timestamp): -> timestamp? 3. max(timestamp_tz): -> timestamp_tz? 4. max(precision_timestamp_tz<P>): -> precision_timestamp_tz?<P> 5. max(interval_day<P>): -> interval_day?<P> 6. max(interval_year): -> interval_year?

    Max a set of values.

    "},{"location":"extensions/functions_geometry/","title":"functions_geometry.yaml","text":"

    This document file is generated for functions_geometry.yaml

    "},{"location":"extensions/functions_geometry/#data-types","title":"Data Types","text":"

    name: geometry structure: BINARY

    "},{"location":"extensions/functions_geometry/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_geometry/#point","title":"point","text":"

    Implementations: point(x, y): -> return_type 0. point(fp64, fp64): -> u!geometry

    *Returns a 2D point with the given x and y coordinate values. *

    "},{"location":"extensions/functions_geometry/#make_line","title":"make_line","text":"

    Implementations: make_line(geom1, geom2): -> return_type 0. make_line(u!geometry, u!geometry): -> u!geometry

    *Returns a linestring connecting the endpoint of geometry geom1 to the begin point of geometry geom2. Repeated points at the beginning of input geometries are collapsed to a single point. A linestring can be closed or simple. A closed linestring starts and ends on the same point. A simple linestring does not cross or touch itself. *

    "},{"location":"extensions/functions_geometry/#x_coordinate","title":"x_coordinate","text":"

    Implementations: x_coordinate(point): -> return_type 0. x_coordinate(u!geometry): -> fp64

    *Return the x coordinate of the point. Return null if not available. *

    "},{"location":"extensions/functions_geometry/#y_coordinate","title":"y_coordinate","text":"

    Implementations: y_coordinate(point): -> return_type 0. y_coordinate(u!geometry): -> fp64

    *Return the y coordinate of the point. Return null if not available. *

    "},{"location":"extensions/functions_geometry/#num_points","title":"num_points","text":"

    Implementations: num_points(geom): -> return_type 0. num_points(u!geometry): -> i64

    *Return the number of points in the geometry. The geometry should be an linestring or circularstring. *

    "},{"location":"extensions/functions_geometry/#is_empty","title":"is_empty","text":"

    Implementations: is_empty(geom): -> return_type 0. is_empty(u!geometry): -> boolean

    *Return true is the geometry is an empty geometry. *

    "},{"location":"extensions/functions_geometry/#is_closed","title":"is_closed","text":"

    Implementations: is_closed(geom): -> return_type 0. is_closed(u!geometry): -> boolean

    *Return true if the geometry\u2019s start and end points are the same. *

    "},{"location":"extensions/functions_geometry/#is_simple","title":"is_simple","text":"

    Implementations: is_simple(geom): -> return_type 0. is_simple(u!geometry): -> boolean

    *Return true if the geometry does not self intersect. *

    "},{"location":"extensions/functions_geometry/#is_ring","title":"is_ring","text":"

    Implementations: is_ring(geom): -> return_type 0. is_ring(u!geometry): -> boolean

    *Return true if the geometry\u2019s start and end points are the same and it does not self intersect. *

    "},{"location":"extensions/functions_geometry/#geometry_type","title":"geometry_type","text":"

    Implementations: geometry_type(geom): -> return_type 0. geometry_type(u!geometry): -> string

    *Return the type of geometry as a string. *

    "},{"location":"extensions/functions_geometry/#envelope","title":"envelope","text":"

    Implementations: envelope(geom): -> return_type 0. envelope(u!geometry): -> u!geometry

    *Return the minimum bounding box for the input geometry as a geometry. The returned geometry is defined by the corner points of the bounding box. If the input geometry is a point or a line, the returned geometry can also be a point or line. *

    "},{"location":"extensions/functions_geometry/#dimension","title":"dimension","text":"

    Implementations: dimension(geom): -> return_type 0. dimension(u!geometry): -> i8

    *Return the dimension of the input geometry. If the input is a collection of geometries, return the largest dimension from the collection. Dimensionality is determined by the complexity of the input and not the coordinate system being used. Type dimensions: POINT - 0 LINE - 1 POLYGON - 2 *

    "},{"location":"extensions/functions_geometry/#is_valid","title":"is_valid","text":"

    Implementations: is_valid(geom): -> return_type 0. is_valid(u!geometry): -> boolean

    *Return true if the input geometry is a valid 2D geometry. For 3 dimensional and 4 dimensional geometries, the validity is still only tested in 2 dimensions. *

    "},{"location":"extensions/functions_geometry/#collection_extract","title":"collection_extract","text":"

    Implementations: collection_extract(geom_collection): -> return_type 0. collection_extract(u!geometry): -> u!geometry 1. collection_extract(u!geometry, i8): -> u!geometry

    *Given the input geometry collection, return a homogenous multi-geometry. All geometries in the multi-geometry will have the same dimension. If type is not specified, the multi-geometry will only contain geometries of the highest dimension. If type is specified, the multi-geometry will only contain geometries of that type. If there are no geometries of the specified type, an empty geometry is returned. Only points, linestrings, and polygons are supported. Type numbers: POINT - 0 LINE - 1 POLYGON - 2 *

    "},{"location":"extensions/functions_geometry/#flip_coordinates","title":"flip_coordinates","text":"

    Implementations: flip_coordinates(geom_collection): -> return_type 0. flip_coordinates(u!geometry): -> u!geometry

    *Return a version of the input geometry with the X and Y axis flipped. This operation can be performed on geometries with more than 2 dimensions. However, only X and Y axis will be flipped. *

    "},{"location":"extensions/functions_geometry/#remove_repeated_points","title":"remove_repeated_points","text":"

    Implementations: remove_repeated_points(geom): -> return_type 0. remove_repeated_points(u!geometry): -> u!geometry 1. remove_repeated_points(u!geometry, fp64): -> u!geometry

    *Return a version of the input geometry with duplicate consecutive points removed. If the tolerance argument is provided, consecutive points within the tolerance distance of one another are considered to be duplicates. *

    "},{"location":"extensions/functions_geometry/#buffer","title":"buffer","text":"

    Implementations: buffer(geom, buffer_radius): -> return_type 0. buffer(u!geometry, fp64): -> u!geometry

    *Compute and return an expanded version of the input geometry. All the points of the returned geometry are at a distance of buffer_radius away from the points of the input geometry. If a negative buffer_radius is provided, the geometry will shrink instead of expand. A negative buffer_radius may shrink the geometry completely, in which case an empty geometry is returned. For input the geometries of points or lines, a negative buffer_radius will always return an emtpy geometry. *

    "},{"location":"extensions/functions_geometry/#centroid","title":"centroid","text":"

    Implementations: centroid(geom): -> return_type 0. centroid(u!geometry): -> u!geometry

    *Return a point which is the geometric center of mass of the input geometry. *

    "},{"location":"extensions/functions_geometry/#minimum_bounding_circle","title":"minimum_bounding_circle","text":"

    Implementations: minimum_bounding_circle(geom): -> return_type 0. minimum_bounding_circle(u!geometry): -> u!geometry

    *Return the smallest circle polygon that contains the input geometry. *

    "},{"location":"extensions/functions_logarithmic/","title":"functions_logarithmic.yaml","text":"

    This document file is generated for functions_logarithmic.yaml

    "},{"location":"extensions/functions_logarithmic/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_logarithmic/#ln","title":"ln","text":"

    Implementations: ln(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type 0. ln(i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 1. ln(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32 2. ln(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 3. ln(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    Natural logarithm of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • "},{"location":"extensions/functions_logarithmic/#log10","title":"log10","text":"

    Implementations: log10(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type 0. log10(i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 1. log10(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32 2. log10(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 3. log10(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    Logarithm to base 10 of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • "},{"location":"extensions/functions_logarithmic/#log2","title":"log2","text":"

    Implementations: log2(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type 0. log2(i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 1. log2(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32 2. log2(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 3. log2(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    Logarithm to base 2 of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • "},{"location":"extensions/functions_logarithmic/#logb","title":"logb","text":"

    Implementations: logb(x, base, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type

  • x: The number `x` to compute the logarithm of
  • base: The logarithm base `b` to use
  • 0. logb(i64, i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 1. logb(fp32, fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32 2. logb(fp64, fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 3. logb(decimal<P1,S1>, decimal<P1,S1>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    *Logarithm of the value with the given base logb(x, b) => log_{b} (x) *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • "},{"location":"extensions/functions_logarithmic/#log1p","title":"log1p","text":"

    Implementations: log1p(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type 0. log1p(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32 1. log1p(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 2. log1p(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    *Natural logarithm (base e) of 1 + x log1p(x) => log(1+x) *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • "},{"location":"extensions/functions_rounding/","title":"functions_rounding.yaml","text":"

    This document file is generated for functions_rounding.yaml

    "},{"location":"extensions/functions_rounding/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_rounding/#ceil","title":"ceil","text":"

    Implementations: ceil(x): -> return_type 0. ceil(fp32): -> fp32 1. ceil(fp64): -> fp64

    *Rounding to the ceiling of the value x. *

    "},{"location":"extensions/functions_rounding/#floor","title":"floor","text":"

    Implementations: floor(x): -> return_type 0. floor(fp32): -> fp32 1. floor(fp64): -> fp64

    *Rounding to the floor of the value x. *

    "},{"location":"extensions/functions_rounding/#round","title":"round","text":"

    Implementations: round(x, s, option:rounding): -> return_type

  • x: Numerical expression to be rounded.
  • s: Number of decimal places to be rounded to. When `s` is a positive number, nothing will happen since `x` is an integer value. When `s` is a negative number, the rounding is performed to the nearest multiple of `10^(-s)`.
  • 0. round(i8, i32, option:rounding): -> i8? 1. round(i16, i32, option:rounding): -> i16? 2. round(i32, i32, option:rounding): -> i32? 3. round(i64, i32, option:rounding): -> i64? 4. round(fp32, i32, option:rounding): -> fp32? 5. round(fp64, i32, option:rounding): -> fp64?

    *Rounding the value x to s decimal places. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR', 'AWAY_FROM_ZERO', 'TIE_DOWN', 'TIE_UP', 'TIE_TOWARDS_ZERO', 'TIE_TO_ODD']
  • "},{"location":"extensions/functions_set/","title":"functions_set.yaml","text":"

    This document file is generated for functions_set.yaml

    "},{"location":"extensions/functions_set/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_set/#index_in","title":"index_in","text":"

    Implementations: index_in(needle, haystack, option:nan_equality): -> return_type 0. index_in(any1, list<any1>, option:nan_equality): -> i64?

    *Checks the membership of a value in a list of values Returns the first 0-based index value of some input needle if needle is equal to any element in haystack. Returns NULL if not found. If needle is NULL, returns NULL. If needle is NaN: - Returns 0-based index of NaN in input (default) - Returns NULL (if NAN_IS_NOT_NAN is specified) *

    Options:
  • nan_equality ['NAN_IS_NAN', 'NAN_IS_NOT_NAN']
  • "},{"location":"extensions/functions_string/","title":"functions_string.yaml","text":"

    This document file is generated for functions_string.yaml

    "},{"location":"extensions/functions_string/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_string/#concat","title":"concat","text":"

    Implementations: concat(input, option:null_handling): -> return_type 0. concat(varchar<L1>, option:null_handling): -> varchar<L1> 1. concat(string, option:null_handling): -> string

    Concatenate strings. The null_handling option determines whether or not null values will be recognized by the function. If null_handling is set to IGNORE_NULLS, null value arguments will be ignored when strings are concatenated. If set to ACCEPT_NULLS, the result will be null if any argument passed to the concat function is null.

    Options:
  • null_handling ['IGNORE_NULLS', 'ACCEPT_NULLS']
  • "},{"location":"extensions/functions_string/#like","title":"like","text":"

    Implementations: like(input, match, option:case_sensitivity): -> return_type

  • input: The input string.
  • match: The string to match against the input string.
  • 0. like(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean 1. like(string, string, option:case_sensitivity): -> boolean

    Are two strings like each other. The case_sensitivity option applies to the match argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#substring","title":"substring","text":"

    Implementations: substring(input, start, length, option:negative_start): -> return_type 0. substring(varchar<L1>, i32, i32, option:negative_start): -> varchar<L1> 1. substring(string, i32, i32, option:negative_start): -> string 2. substring(fixedchar<l1>, i32, i32, option:negative_start): -> string 3. substring(varchar<L1>, i32, option:negative_start): -> varchar<L1> 4. substring(string, i32, option:negative_start): -> string 5. substring(fixedchar<l1>, i32, option:negative_start): -> string

    Extract a substring of a specified length starting from position start. A start value of 1 refers to the first characters of the string. When length is not specified the function will extract a substring starting from position start and ending at the end of the string. The negative_start option applies to the start parameter. WRAP_FROM_END means the index will start from the end of the input and move backwards. The last character has an index of -1, the second to last character has an index of -2, and so on. LEFT_OF_BEGINNING means the returned substring will start from the left of the first character. A start of -1 will begin 2 characters left of the the input, while a start of 0 begins 1 character left of the input.

    Options:
  • negative_start ['WRAP_FROM_END', 'LEFT_OF_BEGINNING', 'ERROR']
  • negative_start ['WRAP_FROM_END', 'LEFT_OF_BEGINNING']
  • "},{"location":"extensions/functions_string/#regexp_match_substring","title":"regexp_match_substring","text":"

    Implementations: regexp_match_substring(input, pattern, position, occurrence, group, option:case_sensitivity, option:multiline, option:dotall): -> return_type 0. regexp_match_substring(varchar<L1>, varchar<L2>, i64, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> varchar<L1> 1. regexp_match_substring(string, string, i64, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> string

    Extract a substring that matches the given regular expression pattern. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The occurrence of the pattern to be extracted is specified using the occurrence argument. Specifying 1 means the first occurrence will be extracted, 2 means the second occurrence, and so on. The occurrence argument should be a positive non-zero integer. The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The regular expression capture group can be specified using the group argument. Specifying 0 will return the substring matching the full regular expression. Specifying 1 will return the substring matching only the first capture group, and so on. The group argument should be a non-negative integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the occurrence value is out of range, the position value is out of range, or the group value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#regexp_match_substring_1","title":"regexp_match_substring","text":"

    Implementations: regexp_match_substring(input, pattern, option:case_sensitivity, option:multiline, option:dotall): -> return_type 0. regexp_match_substring(string, string, option:case_sensitivity, option:multiline, option:dotall): -> string

    Extract a substring that matches the given regular expression pattern. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The first occurrence of the pattern from the beginning of the string is extracted. It returns the substring matching the full regular expression. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#regexp_match_substring_all","title":"regexp_match_substring_all","text":"

    Implementations: regexp_match_substring_all(input, pattern, position, group, option:case_sensitivity, option:multiline, option:dotall): -> return_type 0. regexp_match_substring_all(varchar<L1>, varchar<L2>, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> List<varchar<L1>> 1. regexp_match_substring_all(string, string, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> List<string>

    Extract all substrings that match the given regular expression pattern. This will return a list of extracted strings with one value for each occurrence of a match. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The regular expression capture group can be specified using the group argument. Specifying 0 will return substrings matching the full regular expression. Specifying 1 will return substrings matching only the first capture group, and so on. The group argument should be a non-negative integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the position value is out of range, or the group value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#starts_with","title":"starts_with","text":"

    Implementations: starts_with(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. starts_with(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean 1. starts_with(varchar<L1>, string, option:case_sensitivity): -> boolean 2. starts_with(varchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean 3. starts_with(string, string, option:case_sensitivity): -> boolean 4. starts_with(string, varchar<L1>, option:case_sensitivity): -> boolean 5. starts_with(string, fixedchar<L1>, option:case_sensitivity): -> boolean 6. starts_with(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean 7. starts_with(fixedchar<L1>, string, option:case_sensitivity): -> boolean 8. starts_with(fixedchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean

    Whether the input string starts with the substring. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#ends_with","title":"ends_with","text":"

    Implementations: ends_with(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. ends_with(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean 1. ends_with(varchar<L1>, string, option:case_sensitivity): -> boolean 2. ends_with(varchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean 3. ends_with(string, string, option:case_sensitivity): -> boolean 4. ends_with(string, varchar<L1>, option:case_sensitivity): -> boolean 5. ends_with(string, fixedchar<L1>, option:case_sensitivity): -> boolean 6. ends_with(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean 7. ends_with(fixedchar<L1>, string, option:case_sensitivity): -> boolean 8. ends_with(fixedchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean

    Whether input string ends with the substring. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#contains","title":"contains","text":"

    Implementations: contains(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. contains(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean 1. contains(varchar<L1>, string, option:case_sensitivity): -> boolean 2. contains(varchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean 3. contains(string, string, option:case_sensitivity): -> boolean 4. contains(string, varchar<L1>, option:case_sensitivity): -> boolean 5. contains(string, fixedchar<L1>, option:case_sensitivity): -> boolean 6. contains(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean 7. contains(fixedchar<L1>, string, option:case_sensitivity): -> boolean 8. contains(fixedchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean

    Whether the input string contains the substring. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#strpos","title":"strpos","text":"

    Implementations: strpos(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. strpos(string, string, option:case_sensitivity): -> i64 1. strpos(varchar<L1>, varchar<L1>, option:case_sensitivity): -> i64 2. strpos(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> i64

    Return the position of the first occurrence of a string in another string. The first character of the string is at position 1. If no occurrence is found, 0 is returned. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#regexp_strpos","title":"regexp_strpos","text":"

    Implementations: regexp_strpos(input, pattern, position, occurrence, option:case_sensitivity, option:multiline, option:dotall): -> return_type 0. regexp_strpos(varchar<L1>, varchar<L2>, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64 1. regexp_strpos(string, string, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64

    Return the position of an occurrence of the given regular expression pattern in a string. The first character of the string is at position 1. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. Which occurrence to return the position of is specified using the occurrence argument. Specifying 1 means the position first occurrence will be returned, 2 means the position of the second occurrence, and so on. The occurrence argument should be a positive non-zero integer. If no occurrence is found, 0 is returned. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the occurrence value is out of range, or the position value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#count_substring","title":"count_substring","text":"

    Implementations: count_substring(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to count.
  • 0. count_substring(string, string, option:case_sensitivity): -> i64 1. count_substring(varchar<L1>, varchar<L2>, option:case_sensitivity): -> i64 2. count_substring(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> i64

    Return the number of non-overlapping occurrences of a substring in an input string. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#regexp_count_substring","title":"regexp_count_substring","text":"

    Implementations: regexp_count_substring(input, pattern, position, option:case_sensitivity, option:multiline, option:dotall): -> return_type 0. regexp_count_substring(string, string, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64 1. regexp_count_substring(varchar<L1>, varchar<L2>, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64 2. regexp_count_substring(fixedchar<L1>, fixedchar<L2>, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64

    Return the number of non-overlapping occurrences of a regular expression pattern in an input string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile or the position value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#regexp_count_substring_1","title":"regexp_count_substring","text":"

    Implementations: regexp_count_substring(input, pattern, option:case_sensitivity, option:multiline, option:dotall): -> return_type 0. regexp_count_substring(string, string, option:case_sensitivity, option:multiline, option:dotall): -> i64

    Return the number of non-overlapping occurrences of a regular expression pattern in an input string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The match starts at the first character of the input string. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#replace","title":"replace","text":"

    Implementations: replace(input, substring, replacement, option:case_sensitivity): -> return_type

  • input: Input string.
  • substring: The substring to replace.
  • replacement: The replacement string.
  • 0. replace(string, string, string, option:case_sensitivity): -> string 1. replace(varchar<L1>, varchar<L2>, varchar<L3>, option:case_sensitivity): -> varchar<L1>

    Replace all occurrences of the substring with the replacement string. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#concat_ws","title":"concat_ws","text":"

    Implementations: concat_ws(separator, string_arguments): -> return_type

  • separator: Character to separate strings by.
  • string_arguments: Strings to be concatenated.
  • 0. concat_ws(string, string): -> string 1. concat_ws(varchar<L2>, varchar<L1>): -> varchar<L1>

    Concatenate strings together separated by a separator.

    "},{"location":"extensions/functions_string/#repeat","title":"repeat","text":"

    Implementations: repeat(input, count): -> return_type 0. repeat(string, i64): -> string 1. repeat(varchar<L1>, i64, i64): -> varchar<L1>

    Repeat a string count number of times.

    "},{"location":"extensions/functions_string/#reverse","title":"reverse","text":"

    Implementations: reverse(input): -> return_type 0. reverse(string): -> string 1. reverse(varchar<L1>): -> varchar<L1> 2. reverse(fixedchar<L1>): -> fixedchar<L1>

    Returns the string in reverse order.

    "},{"location":"extensions/functions_string/#replace_slice","title":"replace_slice","text":"

    Implementations: replace_slice(input, start, length, replacement): -> return_type

  • input: Input string.
  • start: The position in the string to start deleting/inserting characters.
  • length: The number of characters to delete from the input string.
  • replacement: The new string to insert at the start position.
  • 0. replace_slice(string, i64, i64, string): -> string 1. replace_slice(varchar<L1>, i64, i64, varchar<L2>): -> varchar<L1>

    Replace a slice of the input string. A specified \u2018length\u2019 of characters will be deleted from the input string beginning at the \u2018start\u2019 position and will be replaced by a new string. A start value of 1 indicates the first character of the input string. If start is negative or zero, or greater than the length of the input string, a null string is returned. If \u2018length\u2019 is negative, a null string is returned. If \u2018length\u2019 is zero, inserting of the new string occurs at the specified \u2018start\u2019 position and no characters are deleted. If \u2018length\u2019 is greater than the input string, deletion will occur up to the last character of the input string.

    "},{"location":"extensions/functions_string/#lower","title":"lower","text":"

    Implementations: lower(input, option:char_set): -> return_type 0. lower(string, option:char_set): -> string 1. lower(varchar<L1>, option:char_set): -> varchar<L1> 2. lower(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Transform the string to lower case characters. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • "},{"location":"extensions/functions_string/#upper","title":"upper","text":"

    Implementations: upper(input, option:char_set): -> return_type 0. upper(string, option:char_set): -> string 1. upper(varchar<L1>, option:char_set): -> varchar<L1> 2. upper(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Transform the string to upper case characters. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • "},{"location":"extensions/functions_string/#swapcase","title":"swapcase","text":"

    Implementations: swapcase(input, option:char_set): -> return_type 0. swapcase(string, option:char_set): -> string 1. swapcase(varchar<L1>, option:char_set): -> varchar<L1> 2. swapcase(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Transform the string\u2019s lowercase characters to uppercase and uppercase characters to lowercase. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • "},{"location":"extensions/functions_string/#capitalize","title":"capitalize","text":"

    Implementations: capitalize(input, option:char_set): -> return_type 0. capitalize(string, option:char_set): -> string 1. capitalize(varchar<L1>, option:char_set): -> varchar<L1> 2. capitalize(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Capitalize the first character of the input string. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • "},{"location":"extensions/functions_string/#title","title":"title","text":"

    Implementations: title(input, option:char_set): -> return_type 0. title(string, option:char_set): -> string 1. title(varchar<L1>, option:char_set): -> varchar<L1> 2. title(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Converts the input string into titlecase. Capitalize the first character of each word in the input string except for articles (a, an, the). Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • "},{"location":"extensions/functions_string/#initcap","title":"initcap","text":"

    Implementations: initcap(input, option:char_set): -> return_type 0. initcap(string, option:char_set): -> string 1. initcap(varchar<L1>, option:char_set): -> varchar<L1> 2. initcap(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Capitalizes the first character of each word in the input string, including articles, and lowercases the rest. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • "},{"location":"extensions/functions_string/#char_length","title":"char_length","text":"

    Implementations: char_length(input): -> return_type 0. char_length(string): -> i64 1. char_length(varchar<L1>): -> i64 2. char_length(fixedchar<L1>): -> i64

    Return the number of characters in the input string. The length includes trailing spaces.

    "},{"location":"extensions/functions_string/#bit_length","title":"bit_length","text":"

    Implementations: bit_length(input): -> return_type 0. bit_length(string): -> i64 1. bit_length(varchar<L1>): -> i64 2. bit_length(fixedchar<L1>): -> i64

    Return the number of bits in the input string.

    "},{"location":"extensions/functions_string/#octet_length","title":"octet_length","text":"

    Implementations: octet_length(input): -> return_type 0. octet_length(string): -> i64 1. octet_length(varchar<L1>): -> i64 2. octet_length(fixedchar<L1>): -> i64

    Return the number of bytes in the input string.

    "},{"location":"extensions/functions_string/#regexp_replace","title":"regexp_replace","text":"

    Implementations: regexp_replace(input, pattern, replacement, position, occurrence, option:case_sensitivity, option:multiline, option:dotall): -> return_type

  • input: The input string.
  • pattern: The regular expression to search for within the input string.
  • replacement: The replacement string.
  • position: The position to start the search.
  • occurrence: Which occurrence of the match to replace.
  • 0. regexp_replace(string, string, string, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> string 1. regexp_replace(varchar<L1>, varchar<L2>, varchar<L3>, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> varchar<L1>

    Search a string for a substring that matches a given regular expression pattern and replace it with a replacement string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github .io/icu/userguide/strings/regexp.html). The occurrence of the pattern to be replaced is specified using the occurrence argument. Specifying 1 means only the first occurrence will be replaced, 2 means the second occurrence, and so on. Specifying 0 means all occurrences will be replaced. The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The replacement string can capture groups using numbered backreferences. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the replacement contains an illegal back-reference, the occurrence value is out of range, or the position value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#regexp_replace_1","title":"regexp_replace","text":"

    Implementations: regexp_replace(input, pattern, replacement, option:case_sensitivity, option:multiline, option:dotall): -> return_type

  • input: The input string.
  • pattern: The regular expression to search for within the input string.
  • replacement: The replacement string.
  • 0. regexp_replace(string, string, string, option:case_sensitivity, option:multiline, option:dotall): -> string

    Search a string for a substring that matches a given regular expression pattern and replace it with a replacement string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github .io/icu/userguide/strings/regexp.html). The replacement string can capture groups using numbered backreferences. All occurrences of the pattern will be replaced. The search for matches start at the first character of the input. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile or the replacement contains an illegal back-reference.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#ltrim","title":"ltrim","text":"

    Implementations: ltrim(input, characters): -> return_type

  • input: The string to remove characters from.
  • characters: The set of characters to remove.
  • 0. ltrim(varchar<L1>, varchar<L2>): -> varchar<L1> 1. ltrim(string, string): -> string

    Remove any occurrence of the characters from the left side of the string. If no characters are specified, spaces are removed.

    "},{"location":"extensions/functions_string/#rtrim","title":"rtrim","text":"

    Implementations: rtrim(input, characters): -> return_type

  • input: The string to remove characters from.
  • characters: The set of characters to remove.
  • 0. rtrim(varchar<L1>, varchar<L2>): -> varchar<L1> 1. rtrim(string, string): -> string

    Remove any occurrence of the characters from the right side of the string. If no characters are specified, spaces are removed.

    "},{"location":"extensions/functions_string/#trim","title":"trim","text":"

    Implementations: trim(input, characters): -> return_type

  • input: The string to remove characters from.
  • characters: The set of characters to remove.
  • 0. trim(varchar<L1>, varchar<L2>): -> varchar<L1> 1. trim(string, string): -> string

    Remove any occurrence of the characters from the left and right sides of the string. If no characters are specified, spaces are removed.

    "},{"location":"extensions/functions_string/#lpad","title":"lpad","text":"

    Implementations: lpad(input, length, characters): -> return_type

  • input: The string to pad.
  • length: The length of the output string.
  • characters: The string of characters to use for padding.
  • 0. lpad(varchar<L1>, i32, varchar<L2>): -> varchar<L1> 1. lpad(string, i32, string): -> string

    Left-pad the input string with the string of \u2018characters\u2019 until the specified length of the string has been reached. If the input string is longer than \u2018length\u2019, remove characters from the right-side to shorten it to \u2018length\u2019 characters. If the string of \u2018characters\u2019 is longer than the remaining \u2018length\u2019 needed to be filled, only pad until \u2018length\u2019 has been reached. If \u2018characters\u2019 is not specified, the default value is a single space.

    "},{"location":"extensions/functions_string/#rpad","title":"rpad","text":"

    Implementations: rpad(input, length, characters): -> return_type

  • input: The string to pad.
  • length: The length of the output string.
  • characters: The string of characters to use for padding.
  • 0. rpad(varchar<L1>, i32, varchar<L2>): -> varchar<L1> 1. rpad(string, i32, string): -> string

    Right-pad the input string with the string of \u2018characters\u2019 until the specified length of the string has been reached. If the input string is longer than \u2018length\u2019, remove characters from the left-side to shorten it to \u2018length\u2019 characters. If the string of \u2018characters\u2019 is longer than the remaining \u2018length\u2019 needed to be filled, only pad until \u2018length\u2019 has been reached. If \u2018characters\u2019 is not specified, the default value is a single space.

    "},{"location":"extensions/functions_string/#center","title":"center","text":"

    Implementations: center(input, length, character, option:padding): -> return_type

  • input: The string to pad.
  • length: The length of the output string.
  • character: The character to use for padding.
  • 0. center(varchar<L1>, i32, varchar<L1>, option:padding): -> varchar<L1> 1. center(string, i32, string, option:padding): -> string

    Center the input string by padding the sides with a single character until the specified length of the string has been reached. By default, if the length will be reached with an uneven number of padding, the extra padding will be applied to the right side. The side with extra padding can be controlled with the padding option. Behavior is undefined if the number of characters passed to the character argument is not 1.

    Options:
  • padding ['RIGHT', 'LEFT']
  • "},{"location":"extensions/functions_string/#left","title":"left","text":"

    Implementations: left(input, count): -> return_type 0. left(varchar<L1>, i32): -> varchar<L1> 1. left(string, i32): -> string

    Extract count characters starting from the left of the string.

    "},{"location":"extensions/functions_string/#right","title":"right","text":"

    Implementations: right(input, count): -> return_type 0. right(varchar<L1>, i32): -> varchar<L1> 1. right(string, i32): -> string

    Extract count characters starting from the right of the string.

    "},{"location":"extensions/functions_string/#string_split","title":"string_split","text":"

    Implementations: string_split(input, separator): -> return_type

  • input: The input string.
  • separator: A character used for splitting the string.
  • 0. string_split(varchar<L1>, varchar<L2>): -> List<varchar<L1>> 1. string_split(string, string): -> List<string>

    Split a string into a list of strings, based on a specified separator character.

    "},{"location":"extensions/functions_string/#regexp_string_split","title":"regexp_string_split","text":"

    Implementations: regexp_string_split(input, pattern, option:case_sensitivity, option:multiline, option:dotall): -> return_type

  • input: The input string.
  • pattern: The regular expression to search for within the input string.
  • 0. regexp_string_split(varchar<L1>, varchar<L2>, option:case_sensitivity, option:multiline, option:dotall): -> List<varchar<L1>> 1. regexp_string_split(string, string, option:case_sensitivity, option:multiline, option:dotall): -> List<string>

    Split a string into a list of strings, based on a regular expression pattern. The substrings matched by the pattern will be used as the separators to split the input string and will not be included in the resulting list. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_string/#string_agg","title":"string_agg","text":"

    Implementations: string_agg(input, separator): -> return_type

  • input: Column of string values.
  • separator: Separator for concatenated strings
  • 0. string_agg(string, string): -> string

    Concatenates a column of string values with a separator.

    "},{"location":"relations/basics/","title":"Basics","text":"

    Substrait is designed to allow a user to describe arbitrarily complex data transformations. These transformations are composed of one or more relational operations. Relational operations are well-defined transformation operations that work by taking zero or more input datasets and transforming them into zero or more output transformations. Substrait defines a core set of transformations, but users are also able to extend the operations with their own specialized operations.

    "},{"location":"relations/basics/#plans","title":"Plans","text":"

    A plan is a tree of relations. The root of the tree is the final output of the plan. Each node in the tree is a relational operation. The children of a node are the inputs to the operation. The leaves of the tree are the input datasets to the plan.

    Plans can be composed together using reference relations. This allows for the construction of common plans that can be reused in multiple places. If a plan has no cycles (there is only one plan or each reference relation only references later plans) then the plan will form a DAG (Directed Acyclic Graph).

    "},{"location":"relations/basics/#relational-operators","title":"Relational Operators","text":"

    Each relational operation is composed of several properties. Common properties for relational operations include the following:

    Property Description Type Emit The set of columns output from this operation and the order of those columns. Logical & Physical Hints A set of optionally provided, optionally consumed information about an operation that better informs execution. These might include estimated number of input and output records, estimated record size, likely filter reduction, estimated dictionary size, etc. These can also include implementation specific pieces of execution information. Physical Constraint A set of runtime constraints around the operation, limiting its consumption based on real-world resources (CPU, memory) as well as virtual resources like number of records produced, the largest record size, etc. Physical"},{"location":"relations/basics/#relational-signatures","title":"Relational Signatures","text":"

    In functions, function signatures are declared externally to the use of those signatures (function bindings). In the case of relational operations, signatures are declared directly in the specification. This is due to the speed of change and number of total operations. Relational operations in the specification are expected to be <100 for several years with additions being infrequent. On the other hand, there is an expectation of both a much larger number of functions (1,000s) and a much higher velocity of additions.

    Each relational operation must declare the following:

    • Transformation logic around properties of the data. For example, does a relational operation maintain sortedness of a field? Does an operation change the distribution of data?
    • How many input relations does an operation require?
    • Does the operator produce an output (by specification, we limit relational operations to a single output at this time)
    • What is the schema and field ordering of an output (see emit below)?
    "},{"location":"relations/basics/#emit-output-ordering","title":"Emit: Output Ordering","text":"

    A relational operation uses field references to access specific fields of the input stream. Field references are always ordinal based on the order of the incoming streams. Each relational operation must declare the order of its output data. To simplify things, each relational operation can be in one of two modes:

    1. Direct output: The order of outputs is based on the definition declared by the relational operation.
    2. Remap: A listed ordering of the direct outputs. This remapping can be also used to drop columns no longer used (such as a filter field or join keys after a join). Note that remapping/exclusion can only be done at the outputs root struct. Filtering of compound values or extracting subsets must be done through other operation types (e.g. projection).
    "},{"location":"relations/basics/#relation-properties","title":"Relation Properties","text":"

    There are a number of predefined properties that exist in Substrait relations. These include the following.

    "},{"location":"relations/basics/#distribution","title":"Distribution","text":"

    When data is partitioned across multiple sibling sets, distribution describes that set of properties that apply to any one partition. This is based on a set of distribution expression properties. A distribution is declared as a set of one or more fields and a distribution type across all fields.

    Property Description Required Distribution Fields List of fields references that describe distribution (e.g. [0,2:4,5:0:0]). The order of these references do not impact results. Required for partitioned distribution type. Disallowed for singleton distribution type. Distribution Type PARTITIONED: For a discrete tuple of values for the declared distribution fields, all records with that tuple are located in the same partition. SINGLETON: there will only be a single partition for this operation. Required"},{"location":"relations/basics/#orderedness","title":"Orderedness","text":"

    A guarantee that data output from this operation is provided with a sort order. The sort order will be declared based on a set of sort field definitions based on the emitted output of this operation.

    Property Description Required Sort Fields A list of fields that the data are ordered by. The list is in order of the sort. If we sort by [0,1] then this means we only consider the data for field 1 to be ordered within each discrete value of field 0. At least one required. Per - Sort Field A field reference that the data is sorted by. Required Per - Sort Direction The direction of the data. See direction options below. Required"},{"location":"relations/basics/#ordering-directions","title":"Ordering Directions","text":"Direction Descriptions Nulls Position Ascending Returns data in ascending order based on the quality function associated with the type. Nulls are included before any values. First Descending Returns data in descending order based on the quality function associated with the type. Nulls are included before any values. First Ascending Returns data in ascending order based on the quality function associated with the type. Nulls are included after any values. Last Descending Returns data in descending order based on the quality function associated with the type. Nulls are included after any values. Last Custom function identifier Returns data using a custom function that returns -1, 0, or 1 depending on the order of the data. Per Function Clustered Ensures that all equal values are coalesced (but no ordering between values is defined). E.g. for values 1,2,3,1,2,3, output could be any of the following: 1,1,2,2,3,3 or 1,1,3,3,2,2 or 2,2,1,1,3,3 or 2,2,3,3,1,1 or 3,3,1,1,2,2 or 3,3,2,2,1,1. N/A, may appear anywhere but will be coalesced. Discussion Points
    • Should read definition types be more extensible in the same way that function signatures are? Are extensible read definition types necessary if we have custom relational operators?
    • How are decomposed reads expressed? For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object.
    "},{"location":"relations/common_fields/","title":"Common Fields","text":"

    Every relation contains a common section containing optional hints and emit behavior.

    "},{"location":"relations/common_fields/#emit","title":"Emit","text":"

    A relation which has a direct emit kind outputs the relation\u2019s output without reordering or selection. A relation that specifies an emit output mapping can output its output columns in any order and may leave output columns out.

    Relation Output
    • Many relations (such as Project) by default provide as their output the list of all their input columns plus any generated columns as its output columns. Review each relation to understand its specific output default.
    "},{"location":"relations/common_fields/#hints","title":"Hints","text":"

    Hints provide information that can improve performance but cannot be used to control the behavior. Table statistics, runtime constraints, name hints, and saved computations all fall into this category.

    Hint Design
    • If a hint is not present or has incorrect data the consumer should be able to ignore it and still arrive at the correct result.
    "},{"location":"relations/common_fields/#saved-computations","title":"Saved Computations","text":"

    Computations can be used to save a data structure to use elsewhere. For instance, let\u2019s say we have a plan with a HashEquiJoin and an AggregateDistinct operation. The HashEquiJoin could save its hash table as part of saved computation id number 1 and the AggregateDistinct could read in computation id number 1.

    "},{"location":"relations/embedded_relations/","title":"Embedded Relations","text":"

    Pending.

    Embedded relations allow a Substrait producer to define a set operation that will be embedded in the plan.

    TODO: define lots of details about what interfaces, languages, formats, etc. Should reasonably be an extension of embedded user defined table functions.

    "},{"location":"relations/logical_relations/","title":"Logical Relations","text":""},{"location":"relations/logical_relations/#read-operator","title":"Read Operator","text":"

    The read operator is an operator that produces one output. A simple example would be the reading of a Parquet file. It is expected that many types of reads will be added over time.

    Signature Value Inputs 0 Outputs 1 Property Maintenance N/A (no inputs) Direct Output Order Defaults to the schema of the data read after the optional projection (masked complex expression) is applied."},{"location":"relations/logical_relations/#read-properties","title":"Read Properties","text":"Property Description Required Definition The contents of the read property definition. Required Direct Schema Defines the schema of the output of the read (before any projection or emit remapping/hiding). Required Filter A boolean Substrait expression that describes a filter that must be applied to the data. The filter should be interpreted against the direct schema. Optional, defaults to none. Best Effort Filter A boolean Substrait expression that describes a filter that may be applied to the data. The filter should be interpreted against the direct schema. Optional, defaults to none. Projection A masked complex expression describing the portions of the content that should be read Optional, defaults to all of schema Output Properties Declaration of orderedness and/or distribution properties this read produces. Optional, defaults to no properties. Properties A list of name/value pairs associated with the read. Optional, defaults to empty"},{"location":"relations/logical_relations/#read-filtering","title":"Read Filtering","text":"

    The read relation has two different filter properties. A filter, which must be satisfied by the operator and a best effort filter, which does not have to be satisfied. This reflects the way that consumers are often implemented. A consumer is often only able to fully apply a limited set of operations in the scan. There can then be an extended set of operations which a consumer can apply in a best effort fashion. A producer, when setting these two fields, should take care to only use expressions that the consumer is capable of handling.

    As an example, a consumer may only be able to fully apply (in the read relation) <, =, and > on integral types. The consumer may be able to apply <, =, and > in a best effort fashion on decimal and string types. Consider the filter expression my_int < 10 && my_string < \"x\" && upper(my_string) > \"B\". In this case the filter should be set to my_int < 10 and the best_effort_filter should be set to my_string < \"x\" and the remaining portion (upper(my_string) > \"B\") should be put into a filter relation.

    A filter expression must be interpreted against the direct schema before the projection expression has been applied. As a result, fields may be referenced by the filter expression which are not included in the relation\u2019s output.

    "},{"location":"relations/logical_relations/#read-definition-types","title":"Read Definition Types","text":"Adding new Read Definition Types

    If you have a read definition that\u2019s not covered here, see the process for adding new read definition types.

    Read definition types (like the rest of the features in Substrait) are built by the community and added to the specification.

    "},{"location":"relations/logical_relations/#virtual-table","title":"Virtual Table","text":"

    A virtual table is a table whose contents are embedded in the plan itself. The table data is encoded as records consisting of literal values or expressions that can be resolved without referencing any input data. For example, a literal, a function call involving literals, or any other expression that does not require input.

    Property Description Required Data Required Required"},{"location":"relations/logical_relations/#named-table","title":"Named Table","text":"

    A named table is a reference to data defined elsewhere. For example, there may be a catalog of tables with unique names that both the producer and consumer agree on. This catalog would provide the consumer with more information on how to retrieve the data.

    Property Description Required Names A list of namespaced strings that, together, form the table name Required (at least one)"},{"location":"relations/logical_relations/#files-type","title":"Files Type","text":"Property Description Required Items An array of Items (path or path glob) associated with the read. Required Format per item Enumeration of available formats. Only current option is PARQUET. Required Slicing parameters per item Information to use when reading a slice of a file. Optional"},{"location":"relations/logical_relations/#slicing-files","title":"Slicing Files","text":"

    A read operation is allowed to only read part of a file. This is convenient, for example, when distributing a read operation across several nodes. The slicing parameters are specified as byte offsets into the file.

    Many file formats consist of indivisible \u201cchunks\u201d of data (e.g. Parquet row groups). If this happens the consumer can determine which slice a particular chunk belongs to. For example, one possible approach is that a chunk should only be read if the midpoint of the chunk (dividing by 2 and rounding down) is contained within the asked-for byte range.

    ReadRel Message
    message ReadRel {\n  RelCommon common = 1;\n  NamedStruct base_schema = 2;\n  Expression filter = 3;\n  Expression best_effort_filter = 11;\n  Expression.MaskExpression projection = 4;\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n  // Definition of which type of scan operation is to be performed\n  oneof read_type {\n    VirtualTable virtual_table = 5;\n    LocalFiles local_files = 6;\n    NamedTable named_table = 7;\n    ExtensionTable extension_table = 8;\n  }\n\n  // A base table. The list of string is used to represent namespacing (e.g., mydb.mytable).\n  // This assumes shared catalog between systems exchanging a message.\n  message NamedTable {\n    repeated string names = 1;\n    substrait.extensions.AdvancedExtension advanced_extension = 10;\n  }\n\n  // A table composed of expressions.\n  message VirtualTable {\n    repeated Expression.Literal.Struct values = 1 [deprecated = true];\n    repeated Expression.Nested.Struct expressions = 2;\n  }\n\n  // A stub type that can be used to extend/introduce new table types outside\n  // the specification.\n  message ExtensionTable {\n    google.protobuf.Any detail = 1;\n  }\n\n  // Represents a list of files in input of a scan operation\n  message LocalFiles {\n    repeated FileOrFiles items = 1;\n    substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n    // Many files consist of indivisible chunks (e.g. parquet row groups\n    // or CSV rows).  If a slice partially selects an indivisible chunk\n    // then the consumer should employ some rule to decide which slice to\n    // include the chunk in (e.g. include it in the slice that contains\n    // the midpoint of the chunk)\n    message FileOrFiles {\n      oneof path_type {\n        // A URI that can refer to either a single folder or a single file\n        string uri_path = 1;\n        // A URI where the path portion is a glob expression that can\n        // identify zero or more paths.\n        // Consumers should support the POSIX syntax.  The recursive\n        // globstar (**) may not be supported.\n        string uri_path_glob = 2;\n        // A URI that refers to a single file\n        string uri_file = 3;\n        // A URI that refers to a single folder\n        string uri_folder = 4;\n      }\n\n      // Original file format enum, superseded by the file_format oneof.\n      reserved 5;\n      reserved \"format\";\n\n      // The index of the partition this item belongs to\n      uint64 partition_index = 6;\n\n      // The start position in byte to read from this item\n      uint64 start = 7;\n\n      // The length in byte to read from this item\n      uint64 length = 8;\n\n      message ParquetReadOptions {}\n      message ArrowReadOptions {}\n      message OrcReadOptions {}\n      message DwrfReadOptions {}\n      message DelimiterSeparatedTextReadOptions {\n        // Delimiter separated files may be compressed.  The reader should\n        // autodetect this and decompress as needed.\n\n        // The character(s) used to separate fields.  Common values are comma,\n        // tab, and pipe.  Multiple characters are allowed.\n        string field_delimiter = 1;\n        // The maximum number of bytes to read from a single line.  If a line\n        // exceeds this limit the resulting behavior is undefined.\n        uint64 max_line_size = 2;\n        // The character(s) used to quote strings.  Common values are single\n        // and double quotation marks.\n        string quote = 3;\n        // The number of lines to skip at the beginning of the file.\n        uint64 header_lines_to_skip = 4;\n        // The character used to escape characters in strings.  Backslash is\n        // a common value.  Note that a double quote mark can also be used as an\n        // escape character but the external quotes should be removed first.\n        string escape = 5;\n        // If this value is encountered (including empty string), the resulting\n        // value is null instead.  Leave unset to disable.  If this value is\n        // provided, the effective schema of this file is comprised entirely of\n        // nullable strings.  If not provided, the effective schema is instead\n        // made up of non-nullable strings.\n        optional string value_treated_as_null = 6;\n      }\n\n      // The format of the files along with options for reading those files.\n      oneof file_format {\n        ParquetReadOptions parquet = 9;\n        ArrowReadOptions arrow = 10;\n        OrcReadOptions orc = 11;\n        google.protobuf.Any extension = 12;\n        DwrfReadOptions dwrf = 13;\n        DelimiterSeparatedTextReadOptions text = 14;\n      }\n    }\n  }\n\n}\n
    "},{"location":"relations/logical_relations/#filter-operation","title":"Filter Operation","text":"

    The filter operator eliminates one or more records from the input data based on a boolean filter expression.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Orderedness, Distribution, remapped by emit Direct Output Order The field order as the input."},{"location":"relations/logical_relations/#filter-properties","title":"Filter Properties","text":"Property Description Required Input The relational input. Required Expression A boolean expression which describes which records are included/excluded. Required FilterRel Message
    message FilterRel {\n  RelCommon common = 1;\n  Rel input = 2;\n  Expression condition = 3;\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n}\n
    "},{"location":"relations/logical_relations/#sort-operation","title":"Sort Operation","text":"

    The sort operator reorders a dataset based on one or more identified sort fields and a sorting function for each.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Will update orderedness property to the output of the sort operation. Distribution property only remapped based on emit. Direct Output Order The field order of the input."},{"location":"relations/logical_relations/#sort-properties","title":"Sort Properties","text":"Property Description Required Input The relational input. Required Sort Fields List of one or more fields to sort by. Uses the same properties as the orderedness property. One sort field required SortRel Message
    message SortRel {\n  RelCommon common = 1;\n  Rel input = 2;\n  repeated SortField sorts = 3;\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n}\n
    "},{"location":"relations/logical_relations/#project-operation","title":"Project Operation","text":"

    The project operation will produce one or more additional expressions based on the inputs of the dataset.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Distribution maintained, mapped by emit. Orderedness: Maintained if no window operations. Extended to include projection fields if fields are direct references. If window operations are present, no orderedness is maintained. Direct Output Order The field order of the input + the list of new expressions in the order they are declared in the expressions list."},{"location":"relations/logical_relations/#project-properties","title":"Project Properties","text":"Property Description Required Input The relational input. Required Expressions List of one or more expressions to add to the input. At least one expression required ProjectRel Message
    message ProjectRel {\n  RelCommon common = 1;\n  Rel input = 2;\n  repeated Expression expressions = 3;\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n}\n
    "},{"location":"relations/logical_relations/#cross-product-operation","title":"Cross Product Operation","text":"

    The cross product operation will combine two separate inputs into a single output. It pairs every record from the left input with every record of the right input.

    Signature Value Inputs 2 Outputs 1 Property Maintenance Distribution is maintained. Orderedness is empty post operation. Direct Output Order The emit order of the left input followed by the emit order of the right input."},{"location":"relations/logical_relations/#cross-product-properties","title":"Cross Product Properties","text":"Property Description Required Left Input A relational input. Required Right Input A relational input. Required CrossRel Message
    message CrossRel {\n  RelCommon common = 1;\n  Rel left = 2;\n  Rel right = 3;\n\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n}\n
    "},{"location":"relations/logical_relations/#join-operation","title":"Join Operation","text":"

    The join operation will combine two separate inputs into a single output, based on a join expression. A common subtype of joins is an equality join where the join expression is constrained to a list of equality (or equality + null equality) conditions between the two inputs of the join.

    Signature Value Inputs 2 Outputs 1 Property Maintenance Distribution is maintained. Orderedness is empty post operation. Physical relations may provide better property maintenance. Direct Output Order The emit order of the left input followed by the emit order of the right input."},{"location":"relations/logical_relations/#join-properties","title":"Join Properties","text":"Property Description Required Left Input A relational input. Required Right Input A relational input. Required Join Expression A boolean condition that describes whether each record from the left set \u201cmatch\u201d the record from the right set. Field references correspond to the direct output order of the data. Required. Can be the literal True. Post-Join Filter A boolean condition to be applied to each result record after the inputs have been joined, yielding only the records that satisfied the condition. Optional Join Type One of the join types defined below. Required"},{"location":"relations/logical_relations/#join-types","title":"Join Types","text":"Type Description Inner Return records from the left side only if they match the right side. Return records from the right side only when they match the left side. For each cross input match, return a record including the data from both sides. Non-matching records are ignored. Outer Return all records from both the left and right inputs. For each cross input match, return a record including the data from both sides. For any remaining non-match records, return the record from the corresponding input along with nulls for the opposite input. Left Return all records from the left input. For each cross input match, return a record including the data from both sides. For any remaining non-matching records from the left input, return the left record along with nulls for the right input. Right Return all records from the right input. For each cross input match, return a record including the data from both sides. For any remaining non-matching records from the right input, return the right record along with nulls for the left input. Left Semi Returns records from the left input. These are returned only if the records have a join partner on the right side. Right Semi Returns records from the right input. These are returned only if the records have a join partner on the left side. Left Anti Return records from the left input. These are returned only if the records do not have a join partner on the right side. Right Anti Return records from the right input. These are returned only if the records do not have a join partner on the left side. Left Single Return all records from the left input with no join expansion. If at least one record from the right input matches the left, return one arbitrary matching record from the right input. For any left records without matching right records, return the left record along with nulls for the right input. Similar to a left outer join but only returns one right match at most. Useful for nested sub-queries where we need exactly one record in output (or throw exception). See Section 3.2 of https://15721.courses.cs.cmu.edu/spring2018/papers/16-optimizer2/hyperjoins-btw2017.pdf for more information. Right Single Same as left single except that the right and left inputs are switched. Left Mark Returns one record for each record from the left input. Appends one additional \u201cmark\u201d column to the output of the join. The new column will be listed after all columns from both sides and will be of type nullable boolean. If there is at least one join partner in the right input where the join condition evaluates to true then the mark column will be set to true. Otherwise, if there is at least one join partner in the right input where the join condition evaluates to NULL then the mark column will be set to NULL. Otherwise the mark column will be set to false. Right Mark Returns records from the right input. Appends one additional \u201cmark\u201d column to the output of the join. The new column will be listed after all columns from both sides and will be of type nullable boolean. If there is at least one join partner in the left input where the join condition evaluates to true then the mark column will be set to true. Otherwise, if there is at least one join partner in the left input where the join condition evaluates to NULL then the mark column will be set to NULL. Otherwise the mark column will be set to false. JoinRel Message
    message JoinRel {\n  RelCommon common = 1;\n  Rel left = 2;\n  Rel right = 3;\n  Expression expression = 4;\n  Expression post_join_filter = 5;\n\n  JoinType type = 6;\n\n  enum JoinType {\n    JOIN_TYPE_UNSPECIFIED = 0;\n    JOIN_TYPE_INNER = 1;\n    JOIN_TYPE_OUTER = 2;\n    JOIN_TYPE_LEFT = 3;\n    JOIN_TYPE_RIGHT = 4;\n    JOIN_TYPE_LEFT_SEMI = 5;\n    JOIN_TYPE_LEFT_ANTI = 6;\n    JOIN_TYPE_LEFT_SINGLE = 7;\n    JOIN_TYPE_RIGHT_SEMI = 8;\n    JOIN_TYPE_RIGHT_ANTI = 9;\n    JOIN_TYPE_RIGHT_SINGLE = 10;\n    JOIN_TYPE_LEFT_MARK = 11;\n    JOIN_TYPE_RIGHT_MARK = 12;\n  }\n\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n}\n
    "},{"location":"relations/logical_relations/#set-operation","title":"Set Operation","text":"

    The set operation encompasses several set-level operations that support combining datasets, possibly excluding records based on various types of record level matching.

    Signature Value Inputs 2 or more Outputs 1 Property Maintenance Maintains distribution if all inputs have the same ordinal distribution. Orderedness is not maintained. Direct Output Order The field order of the inputs. All inputs must have identical field types, but field nullabilities may vary."},{"location":"relations/logical_relations/#set-properties","title":"Set Properties","text":"Property Description Required Primary Input The primary input of the dataset. Required Secondary Inputs One or more relational inputs. At least one required Set Operation Type From list below. Required"},{"location":"relations/logical_relations/#set-operation-types","title":"Set Operation Types","text":"

    The set operation type determines both the records that are emitted and the type of the output record.

    For some set operations, whether a specific record is included in the output and if it appears more than once depends on the number of times it occurs across all inputs. In the following table, treat: * m: the number of time a records occurs in the primary input (p) * n1: the number of times a record occurs in the 1st secondary input (s1) * n2: the number of times a record occurs in the 2nd secondary input (s2) * \u2026 * n: the number of times a record occurs in the nth secondary input

    Operation Description Examples Output Nullability Minus (Primary) Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicates.Each value is treated as a unique member of the set, so duplicates in the first set don\u2019t affect the result.This operation maps to SQL EXCEPT DISTINCT. MINUS\u00a0\u00a0p: {1, 2, 2, 3, 3, 3, 4}\u00a0\u00a0s1: {1, 2}\u00a0\u00a0s2: {3}YIELDS{4} The same as the primary input. Minus (Primary All) Returns all records from the primary input excluding any matching records from secondary inputs.For each specific record returned, the output contains max(0, m - sum(n1, n2, \u2026, n)) copies.This operation maps to SQL EXCEPT ALL. MINUS ALL\u00a0\u00a0p: {1, 2, 2, 3, 3, 3, 3}\u00a0\u00a0s1: {1, 2, 3, 4}\u00a0\u00a0s2: {3}YIELDS{2, 3, 3} The same as the primary input. Minus (Multiset) Returns all records from the primary input excluding any records that are included in all secondary inputs.This operation does not have a direct SQL mapping. MINUS MULTISET\u00a0\u00a0p: {1, 2, 3, 4}\u00a0\u00a0s1: {1, 2}\u00a0\u00a0s2: {1, 2, 3}YIELDS{3, 4} The same as the primary input. Intersection (Primary) Returns all records from the primary input that are present in any secondary input, removing duplicates.This operation does not have a direct SQL mapping. INTERSECT\u00a0\u00a0p: {1, 2, 2, 3, 3, 3, 4}\u00a0\u00a0s1: {1, 2, 3, 5}\u00a0\u00a0s2: {2, 3, 6}YIELDS{1, 2, 3} If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output. Intersection (Multiset) Returns all records from the primary input that match at least one record from all secondary inputs.This operation maps to SQL INTERSECT DISTINCT INTERSECT MULTISET\u00a0\u00a0p: {1, 2, 3, 4}\u00a0\u00a0s1: {2, 3}\u00a0\u00a0s2: {3, 4}YIELDS{3} If a field is required in any of the inputs, it is required in the output. Intersection (Multiset All) Returns all records from the primary input that are present in every secondary input.For each specific record returned, the output contains min(m, n1, n2, \u2026, n) copies.This operation maps to SQL INTERSECT ALL. INTERSECT ALL\u00a0\u00a0p: {1, 2, 2, 3, 3, 3, 4}\u00a0\u00a0s1: {1, 2, 3, 3, 5}\u00a0\u00a0s2: {2, 3, 3, 6}YIELDS{2, 3, 3} If a field is required in any of the inputs, it is required in the output. Union Distinct Returns all records from each set, removing duplicates.This operation maps to SQL UNION DISTINCT. UNION\u00a0\u00a0p: {1, 2, 2, 3, 3, 3, 4}\u00a0\u00a0s1: {2, 3, 5}\u00a0\u00a0s2: {1, 6}YIELDS{1, 2, 3, 4, 5, 6} If a field is nullable in any of the inputs, it is nullable in the output. Union All Returns all records from all inputs.For each specific record returned, the output contains (m + n1 + n2 + \u2026 + n) copies.This operation maps to SQL UNION ALL. UNION ALL\u00a0\u00a0p: {1, 2, 2, 3, 3, 3, 4}\u00a0\u00a0s1: {2, 3, 5}\u00a0\u00a0s2: {1, 6}YIELDS{1, 2, 2, 3, 3, 3, 4, 2, 3, 5, 1, 6} If a field is nullable in any of the inputs, it is nullable in the output.

    Note that for set operations, NULL matches NULL. That is

    {NULL, 1, 3} MINUS          {NULL, 2, 4} === (1), (3)\n{NULL, 1, 3} INTERSECTION   {NULL, 2, 3} === (NULL)\n{NULL, 1, 3} UNION DISTINCT {NULL, 2, 4} === (NULL), (1), (2), (3), (4)\n

    "},{"location":"relations/logical_relations/#output-type-derivation-examples","title":"Output Type Derivation Examples","text":"

    Given the following inputs, where R is Required and N is Nullable:

    Input 1: (R, R, R, R, N, N, N, N)  Primary Input\nInput 2: (R, R, N, N, R, R, N, N)  Secondary Input\nInput 3: (R, N, R, N, R, N, R, N)  Secondary Input\n

    The output type is as follows for the various operations

    Property Output Type Minus (Primary) (R, R, R, R, N, N, N, N) Minus (Primary All) (R, R, R, R, N, N, N, N) Minus (Multiset) (R, R, R, R, N, N, N, N) Intersection (Primary) (R, R, R, R, R, N, N, N) Intersection (Multiset) (R, R, R, R, R, R, R, N) Intersection (Multiset All) (R, R, R, R, R, R, R, N) Union Distinct (R, N, N, N, N, N, N, N) Union All (R, N, N, N, N, N, N, N) SetRel Message
    message SetRel {\n  RelCommon common = 1;\n  // The first input is the primary input, the remaining are secondary\n  // inputs.  There must be at least two inputs.\n  repeated Rel inputs = 2;\n  SetOp op = 3;\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n  enum SetOp {\n    SET_OP_UNSPECIFIED = 0;\n    SET_OP_MINUS_PRIMARY = 1;\n    SET_OP_MINUS_PRIMARY_ALL = 7;\n    SET_OP_MINUS_MULTISET = 2;\n    SET_OP_INTERSECTION_PRIMARY = 3;\n    SET_OP_INTERSECTION_MULTISET = 4;\n    SET_OP_INTERSECTION_MULTISET_ALL = 8;\n    SET_OP_UNION_DISTINCT = 5;\n    SET_OP_UNION_ALL = 6;\n  }\n\n}\n
    "},{"location":"relations/logical_relations/#fetch-operation","title":"Fetch Operation","text":"

    The fetch operation eliminates records outside a desired window. Typically corresponds to a fetch/offset SQL clause. Will only returns records between the start offset and the end offset.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution and orderedness. Direct Output Order Unchanged from input."},{"location":"relations/logical_relations/#fetch-properties","title":"Fetch Properties","text":"Property Description Required Input A relational input, typically with a desired orderedness property. Required Offset A non-negative integer. Declares the offset for retrieval of records. Optional, defaults to 0. Count A non-negative integer or -1. Declares the number of records that should be returned. -1 signals that ALL records should be returned. Required FetchRel Message
    message FetchRel {\n  RelCommon common = 1;\n  Rel input = 2;\n  // the offset expressed in number of records\n  int64 offset = 3;\n  // the amount of records to return\n  // use -1 to signal that ALL records should be returned\n  int64 count = 4;\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n}\n
    "},{"location":"relations/logical_relations/#aggregate-operation","title":"Aggregate Operation","text":"

    The aggregate operation groups input data on one or more sets of grouping keys, calculating each measure for each combination of grouping key.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution if all distribution fields are contained in every grouping set. No orderedness guaranteed. Direct Output Order The list of grouping expressions in declaration order followed by the list of measures in declaration order, followed by an i32 describing the associated particular grouping set the value is derived from (if applicable).

    In its simplest form, an aggregation has only measures. In this case, all records are folded into one, and a column is returned for each aggregate expression in the measures list.

    Grouping sets can be used for finer-grained control over which records are folded. A grouping set consists of zero or more references to the list of grouping expressions. Within a grouping set, two records will be folded together if and only if they have the same values for each of the expressions in the grouping set. The values returned by the grouping expressions will be returned as columns to the left of the columns for the aggregate expressions. Each of the grouping expressions must occur in at least one of the grouping sets. If a grouping set contains no grouping expressions, all rows will be folded for that grouping set. (Having a single grouping set with no grouping expressions is thus equivalent to not having any grouping sets.)

    It is possible to specify multiple grouping sets in a single aggregate operation. The grouping sets behave more or less independently, with each returned record belonging to one of the grouping sets. The values for the grouping expression columns that are not part of the grouping set for a particular record will be set to null. The columns for grouping expressions that do not appear in all grouping sets will be nullable (regardless of the nullability of the type returned by the grouping expression) to accomodate the null insertion.

    To further disambiguate which record belongs to which grouping set, an aggregate relation with more than one grouping set receives an extra i32 column on the right-hand side. The value of this field will be the zero-based index of the grouping set that yielded the record.

    If at least one grouping expression is present, the aggregation is allowed to not have any aggregate expressions. An aggregate relation is invalid if it would yield zero columns.

    "},{"location":"relations/logical_relations/#aggregate-properties","title":"Aggregate Properties","text":"Property Description Required Input The relational input. Required Grouping Sets One or more grouping sets. Optional, required if no measures. Per Grouping Set A list of expression grouping that the aggregation measured should be calculated for. Optional. Measures A list of one or more aggregate expressions along with an optional filter. Optional, required if no grouping sets. AggregateRel Message
    message AggregateRel {\n  RelCommon common = 1;\n\n  // Input of the aggregation\n  Rel input = 2;\n\n  // A list of zero or more grouping sets that the aggregation measures should\n  // be calculated for. There must be at least one grouping set if there are no\n  // measures (but it can be the empty grouping set).\n  repeated Grouping groupings = 3;\n\n  // A list of one or more aggregate expressions along with an optional filter.\n  // Required if there are no groupings.\n  repeated Measure measures = 4;\n\n  // A list of zero or more grouping expressions that grouping sets (i.e.,\n  // `Grouping` messages in the `groupings` field) can reference. Each\n  // expression in this list must be referred to by at least one\n  // `Grouping.expression_references`.\n  repeated Expression grouping_expressions = 5;\n\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n  message Grouping {\n    // Deprecated in favor of `expression_references` below.\n    repeated Expression grouping_expressions = 1 [deprecated = true];\n\n    // A list of zero or more references to grouping expressions, i.e., indices\n    // into the `grouping_expression` list.\n    repeated uint32 expression_references = 2;\n  }\n\n  message Measure {\n    AggregateFunction measure = 1;\n\n    // An optional boolean expression that acts to filter which records are\n    // included in the measure. True means include this record for calculation\n    // within the measure.\n    // Helps to support SUM(<c>) FILTER(WHERE...) syntax without masking opportunities for optimization\n    Expression filter = 2;\n  }\n\n}\n
    "},{"location":"relations/logical_relations/#reference-operator","title":"Reference Operator","text":"

    The reference operator is used to construct DAGs of operations. In a Plan we can have multiple Rel representing various computations with potentially multiple outputs. The ReferenceRel is used to express the fact that multiple Rel might be sharing subtrees of computation. This can be used to express arbitrary DAGs as well as represent multi-query optimizations.

    As a concrete example think about two queries SELECT * FROM A JOIN B JOIN C and SELECT * FROM A JOIN B JOIN D, We could use the ReferenceRel to highlight the shared A JOIN B between the two queries, by creating a plan with 3 Rel. One expressing A JOIN B (in position 0 in the plan), one using reference as follows: ReferenceRel(0) JOIN C and a third one doing ReferenceRel(0) JOIN D. This allows to avoid the redundancy of A JOIN B.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains all properties of the input Direct Output Order Maintains order"},{"location":"relations/logical_relations/#reference-properties","title":"Reference Properties","text":"Property Description Required Referred Rel A zero-indexed positional reference to a Rel defined within the same Plan. Required ReferenceRel Message
    message ReferenceRel {\n  int32 subtree_ordinal = 1;\n\n}\n
    "},{"location":"relations/logical_relations/#write-operator","title":"Write Operator","text":"

    The write operator is an operator that consumes one input and writes it to storage. This can range from writing to a Parquet file, to INSERT/DELETE/UPDATE in a database.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Output depends on OutputMode (none, or modified records) Direct Output Order Unchanged from input"},{"location":"relations/logical_relations/#write-properties","title":"Write Properties","text":"Property Description Required Write Type Definition of which object we are operating on (e.g., a fully-qualified table name). Required CTAS Schema The names of all the columns and their type for a CREATE TABLE AS. Required only for CTAS Write Operator Which type of operation we are performing (INSERT/DELETE/UPDATE/CTAS). Required Rel Input The Rel representing which records we will be operating on (e.g., VALUES for an INSERT, or which records to DELETE, or records and after-image of their values for UPDATE). Required Create Mode This determines what should happen if the table already exists (ERROR/REPLACE/IGNORE) Required only for CTAS Output Mode For views that modify a DB it is important to control which records to \u201creturn\u201d. Common default is NO_OUTPUT where we return nothing. Alternatively, we can return MODIFIED_RECORDS, that can be further manipulated by layering more rels ontop of this WriteRel (e.g., to \u201ccount how many records were updated\u201d). This also allows to return the after-image of the change. To return before-image (or both) one can use the reference mechanisms and have multiple return values. Required for VIEW CREATE/CREATE_OR_REPLACE/ALTER"},{"location":"relations/logical_relations/#write-definition-types","title":"Write Definition Types","text":"Adding new Write Definition Types

    If you have a write definition that\u2019s not covered here, see the process for adding new write definition types.

    Write definition types are built by the community and added to the specification.

    WriteRel Message
    message WriteRel {\n  // Definition of which TABLE we are operating on\n  oneof write_type {\n    NamedObjectWrite named_table = 1;\n    ExtensionObject extension_table = 2;\n  }\n\n  // The schema of the table (must align with Rel input (e.g., number of leaf fields must match))\n  NamedStruct table_schema = 3;\n\n  // The type of operation to perform\n  WriteOp op = 4;\n\n  // The relation that determines the records to add/remove/modify\n  // the schema must match with table_schema. Default values must be explicitly stated\n  // in a ProjectRel at the top of the input. The match must also\n  // occur in case of DELETE to ensure multi-engine plans are unequivocal.\n  Rel input = 5;\n\n  CreateMode create_mode = 8; // Used with CTAS to determine what to do if the table already exists\n\n  // Output mode determines what is the output of executing this rel\n  OutputMode output = 6;\n  RelCommon common = 7;\n\n  enum WriteOp {\n    WRITE_OP_UNSPECIFIED = 0;\n    // The insert of new records in a table\n    WRITE_OP_INSERT = 1;\n    // The removal of records from a table\n    WRITE_OP_DELETE = 2;\n    // The modification of existing records within a table\n    WRITE_OP_UPDATE = 3;\n    // The Creation of a new table, and the insert of new records in the table\n    WRITE_OP_CTAS = 4;\n  }\n\n  enum CreateMode {\n    CREATE_MODE_UNSPECIFIED = 0;\n    CREATE_MODE_APPEND_IF_EXISTS = 1; // Append the data to the table if it already exists\n    CREATE_MODE_REPLACE_IF_EXISTS = 2; // Replace the table if it already exists (\"OR REPLACE\")\n    CREATE_MODE_IGNORE_IF_EXISTS = 3; // Ignore the request if the table already exists (\"IF NOT EXISTS\")\n    CREATE_MODE_ERROR_IF_EXISTS = 4; // Throw an error if the table already exists (default behavior)\n  }\n\n  enum OutputMode {\n    OUTPUT_MODE_UNSPECIFIED = 0;\n    // return no records at all\n    OUTPUT_MODE_NO_OUTPUT = 1;\n    // this mode makes the operator return all the record INSERTED/DELETED/UPDATED by the operator.\n    // The operator returns the AFTER-image of any change. This can be further manipulated by operators upstreams\n    // (e.g., retunring the typical \"count of modified records\").\n    // For scenarios in which the BEFORE image is required, the user must implement a spool (via references to\n    // subplans in the body of the Rel input) and return those with anounter PlanRel.relations.\n    OUTPUT_MODE_MODIFIED_RECORDS = 2;\n  }\n\n}\n
    "},{"location":"relations/logical_relations/#virtual-table_1","title":"Virtual Table","text":"Property Description Required Name The in-memory name to give the dataset. Required Pin Whether it is okay to remove this dataset from memory or it should be kept in memory. Optional, defaults to false."},{"location":"relations/logical_relations/#files-type_1","title":"Files Type","text":"Property Description Required Path A URI to write the data to. Supports the inclusion of field references that are listed as available in properties as a \u201crotation description field\u201d. Required Format Enumeration of available formats. Only current option is PARQUET. Required"},{"location":"relations/logical_relations/#update-operator","title":"Update Operator","text":"

    The update operator applies a set of column transformations on a named table and writes to a storage.

    Signature Value Inputs 0 Outputs 1 Property Maintenance Output is number of modified records"},{"location":"relations/logical_relations/#update-properties","title":"Update Properties","text":"Property Description Required Update Type Definition of which object we are operating on (e.g., a fully-qualified table name). Required Table Schema The names and types of all the columns of the input table Required Update Condition The condition that must be met for a record to be updated. Required Update Transformations The set of column updates to be applied to the table. Required UpdateRel Message
    message UpdateRel {\n  oneof update_type {\n    NamedTable named_table = 1;\n  }\n\n  NamedStruct table_schema = 2; // The full schema of the named_table\n  Expression condition = 3; // condition to be met for the update to be applied on a record\n\n  // The list of transformations to apply to the columns of the named_table\n  repeated TransformExpression transformations = 4;\n\n  message TransformExpression {\n    Expression transformation = 1; // the transformation to apply\n    int32 column_target = 2; // index of the column to apply the transformation to\n  }\n\n}\n
    "},{"location":"relations/logical_relations/#ddl-data-definition-language-operator","title":"DDL (Data Definition Language) Operator","text":"

    The operator that defines modifications of a database schema (CREATE/DROP/ALTER for TABLE and VIEWS).

    Signature Value Inputs 1 Outputs 0 Property Maintenance N/A (no output) Direct Output Order N/A"},{"location":"relations/logical_relations/#ddl-properties","title":"DDL Properties","text":"Property Description Required Write Type Definition of which type of object we are operating on. Required Table Schema The names of all the columns and their type. Required (except for DROP operations) Table Defaults The set of default values for this table. Required (except for DROP operations) DDL Object Which type of object we are operating on (e.g., TABLE or VIEW). Required DDL Operator The operation to be performed (e.g., CREATE/ALTER/DROP). Required View Definition A Rel representing the \u201cbody\u201d of a VIEW. Required for VIEW CREATE/CREATE_OR_REPLACE/ALTER DdlRel Message
    message DdlRel {\n  // Definition of which type of object we are operating on\n  oneof write_type {\n    NamedObjectWrite named_object = 1;\n    ExtensionObject extension_object = 2;\n  }\n\n  // The columns that will be modified (representing after-image of a schema change)\n  NamedStruct table_schema = 3;\n  // The default values for the columns (representing after-image of a schema change)\n  // E.g., in case of an ALTER TABLE that changes some of the column default values, we expect\n  // the table_defaults Struct to report a full list of default values reflecting the result of applying\n  // the ALTER TABLE operator successfully\n  Expression.Literal.Struct table_defaults = 4;\n\n  // Which type of object we operate on\n  DdlObject object = 5;\n\n  // The type of operation to perform\n  DdlOp op = 6;\n\n  // The body of the CREATE VIEW\n  Rel view_definition = 7;\n  RelCommon common = 8;\n\n  enum DdlObject {\n    DDL_OBJECT_UNSPECIFIED = 0;\n    // A Table object in the system\n    DDL_OBJECT_TABLE = 1;\n    // A View object in the system\n    DDL_OBJECT_VIEW = 2;\n  }\n\n  enum DdlOp {\n    DDL_OP_UNSPECIFIED = 0;\n    // A create operation (for any object)\n    DDL_OP_CREATE = 1;\n    // A create operation if the object does not exist, or replaces it (equivalent to a DROP + CREATE) if the object already exists\n    DDL_OP_CREATE_OR_REPLACE = 2;\n    // An operation that modifies the schema (e.g., column names, types, default values) for the target object\n    DDL_OP_ALTER = 3;\n    // An operation that removes an object from the system\n    DDL_OP_DROP = 4;\n    // An operation that removes an object from the system (without throwing an exception if the object did not exist)\n    DDL_OP_DROP_IF_EXIST = 5;\n  }\n  //TODO add PK/constraints/indexes/etc..?\n\n}\n
    Discussion Points
    • How should correlated operations be handled?
    "},{"location":"relations/physical_relations/","title":"Physical Relations","text":"

    There is no true distinction between logical and physical operations in Substrait. By convention, certain operations are classified as physical, but all operations can be potentially used in any kind of plan. A particular set of transformations or target operators may (by convention) be considered the \u201cphysical plan\u201d but this is a characteristic of the system consuming substrait as opposed to a definition within Substrait.

    "},{"location":"relations/physical_relations/#hash-equijoin-operator","title":"Hash Equijoin Operator","text":"

    The hash equijoin join operator will build a hash table out of the right input based on a set of join keys. It will then probe that hash table for incoming inputs, finding matches.

    Signature Value Inputs 2 Outputs 1 Property Maintenance Distribution is maintained. Orderedness of the left set is maintained in INNER join cases, otherwise it is eliminated. Direct Output Order Same as the Join operator."},{"location":"relations/physical_relations/#hash-equijoin-properties","title":"Hash Equijoin Properties","text":"Property Description Required Left Input A relational input.(Probe-side) Required Right Input A relational input.(Build-side) Required Left Keys References to the fields to join on in the left input. Required Right Keys References to the fields to join on in the right input. Required Post Join Predicate An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. Optional, defaults true. Join Type One of the join types defined in the Join operator. Required"},{"location":"relations/physical_relations/#nlj-nested-loop-join-operator","title":"NLJ (Nested Loop Join) Operator","text":"

    The nested loop join operator does a join by holding the entire right input and then iterating over it using the left input, evaluating the join expression on the Cartesian product of all rows, only outputting rows where the expression is true. Will also include non-matching rows in the OUTER, LEFT and RIGHT operations per the join type requirements.

    Signature Value Inputs 2 Outputs 1 Property Maintenance Distribution is maintained. Orderedness is eliminated. Direct Output Order Same as the Join operator."},{"location":"relations/physical_relations/#nlj-properties","title":"NLJ Properties","text":"Property Description Required Left Input A relational input. Required Right Input A relational input. Required Join Expression A boolean condition that describes whether each record from the left set \u201cmatch\u201d the record from the right set. Optional. Defaults to true (a Cartesian join). Join Type One of the join types defined in the Join operator. Required"},{"location":"relations/physical_relations/#merge-equijoin-operator","title":"Merge Equijoin Operator","text":"

    The merge equijoin does a join by taking advantage of two sets that are sorted on the join keys. This allows the join operation to be done in a streaming fashion.

    Signature Value Inputs 2 Outputs 1 Property Maintenance Distribution is maintained. Orderedness is eliminated. Direct Output Order Same as the Join operator."},{"location":"relations/physical_relations/#merge-join-properties","title":"Merge Join Properties","text":"Property Description Required Left Input A relational input. Required Right Input A relational input. Required Left Keys References to the fields to join on in the left input. Required Right Keys References to the fields to join on in the right input. Reauired Post Join Predicate An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. Optional, defaults true. Join Type One of the join types defined in the Join operator. Required"},{"location":"relations/physical_relations/#exchange-operator","title":"Exchange Operator","text":"

    The exchange operator will redistribute data based on an exchange type definition. Applying this operation will lead to an output that presents the desired distribution.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Orderedness is maintained. Distribution is overwritten based on configuration. Direct Output Order Order of the input."},{"location":"relations/physical_relations/#exchange-types","title":"Exchange Types","text":"Type Description Scatter Distribute data using a system defined hashing function that considers one or more fields. For the same type of fields and same ordering of values, the same partition target should be identified for different ExchangeRels Single Bucket Define an expression that provides a single i32 bucket number. Optionally define whether the expression will only return values within the valid number of partition counts. If not, the system should modulo the return value to determine a target partition. Multi Bucket Define an expression that provides a List<i32> of bucket numbers. Optionally define whether the expression will only return values within the valid number of partition counts. If not, the system should modulo the return value to determine a target partition. The records should be sent to all bucket numbers provided by the expression. Broadcast Send all records to all partitions. Round Robin Send records to each target in sequence. Can follow either exact or approximate behavior. Approximate will attempt to balance the number of records sent to each destination but may not exactly distribute evenly and may send batches of records to each target before moving to the next."},{"location":"relations/physical_relations/#exchange-properties","title":"Exchange Properties","text":"Property Description Required Input The relational input. Required. Distribution Type One of the distribution types defined above. Required. Partition Count The number of partitions targeted for output. Optional. If not defined, implementation system should decide the number of partitions. Note that when not defined, single or multi bucket expressions should not be constrained to count. Expression Mapping Describes a relationship between each partition ID and the destination that partition should be sent to. Optional. A partition may be sent to 0..N locations. Value can either be a URI or arbitrary value."},{"location":"relations/physical_relations/#merging-capture","title":"Merging Capture","text":"

    A receiving operation that will merge multiple ordered streams to maintain orderedness.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Orderedness and distribution are maintained. Direct Output Order Order of the input."},{"location":"relations/physical_relations/#merging-capture-properties","title":"Merging Capture Properties","text":"Property Description Required Blocking Whether the merging should block incoming data. Blocking should be used carefully, based on whether a deadlock can be produced. Optional, defaults to false"},{"location":"relations/physical_relations/#simple-capture","title":"Simple Capture","text":"

    A receiving operation that will merge multiple streams in an arbitrary order.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Orderness is empty after this operation. Distribution are maintained. Direct Output Order Order of the input."},{"location":"relations/physical_relations/#naive-capture-properties","title":"Naive Capture Properties","text":"Property Description Required Input The relational input. Required"},{"location":"relations/physical_relations/#top-n-operation","title":"Top-N Operation","text":"

    The top-N operator reorders a dataset based on one or more identified sort fields as well as a sorting function. Rather than sort the entire dataset, the top-N will only maintain the total number of records required to ensure a limited output. A top-n is a combination of a logical sort and logical fetch operations.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Will update orderedness property to the output of the sort operation. Distribution property only remapped based on emit. Direct Output Order The field order of the input."},{"location":"relations/physical_relations/#top-n-properties","title":"Top-N Properties","text":"Property Description Required Input The relational input. Required Sort Fields List of one or more fields to sort by. Uses the same properties as the orderedness property. One sort field required Offset A positive integer. Declares the offset for retrieval of records. Optional, defaults to 0. Count A positive integer. Declares the number of records that should be returned. Required"},{"location":"relations/physical_relations/#hash-aggregate-operation","title":"Hash Aggregate Operation","text":"

    The hash aggregate operation maintains a hash table for each grouping set to coalesce equivalent tuples.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution if all distribution fields are contained in every grouping set. No orderness guaranteed. Direct Output Order Same as defined by Aggregate operation."},{"location":"relations/physical_relations/#hash-aggregate-properties","title":"Hash Aggregate Properties","text":"Property Description Required Input The relational input. Required Grouping Sets One or more grouping sets. Optional, required if no measures. Per Grouping Set A list of expression grouping that the aggregation measured should be calculated for. Optional, defaults to 0. Measures A list of one or more aggregate expressions. Implementations may or may not support aggregate ordering expressions. Optional, required if no grouping sets."},{"location":"relations/physical_relations/#streaming-aggregate-operation","title":"Streaming Aggregate Operation","text":"

    The streaming aggregate operation leverages data ordered by the grouping expressions to calculate data each grouping set tuple-by-tuple in streaming fashion. All grouping sets and orderings requested on each aggregate must be compatible to allow multiple grouping sets or aggregate orderings.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution if all distribution fields are contained in every grouping set. Maintains input ordering. Direct Output Order Same as defined by Aggregate operation."},{"location":"relations/physical_relations/#streaming-aggregate-properties","title":"Streaming Aggregate Properties","text":"Property Description Required Input The relational input. Required Grouping Sets One or more grouping sets. If multiple grouping sets are declared, sets must all be compatible with the input sortedness. Optional, required if no measures. Per Grouping Set A list of expression grouping that the aggregation measured should be calculated for. Optional, defaults to 0. Measures A list of one or more aggregate expressions. Aggregate expressions ordering requirements must be compatible with expected ordering. Optional, required if no grouping sets."},{"location":"relations/physical_relations/#consistent-partition-window-operation","title":"Consistent Partition Window Operation","text":"

    A consistent partition window operation is a special type of project operation where every function is a window function and all of the window functions share the same sorting and partitioning. This allows for the sort and partition to be calculated once and shared between the various function evaluations.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution and ordering. Direct Output Order Same as Project operator (input followed by each window expression)."},{"location":"relations/physical_relations/#window-properties","title":"Window Properties","text":"Property Description Required Input The relational input. Required Window Functions One or more window functions. At least one required."},{"location":"relations/physical_relations/#expand-operation","title":"Expand Operation","text":"

    The expand operation creates duplicates of input records based on the Expand Fields. Each Expand Field can be a Switching Field or an expression. Switching Fields are described below. If an Expand Field is an expression then its value is consistent across all duplicate rows.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Distribution is maintained if all the distribution fields are consistent fields with direct references. Ordering can only be maintained down to the level of consistent fields that are kept. Direct Output Order The expand fields followed by an i32 column describing the index of the duplicate that the row is derived from."},{"location":"relations/physical_relations/#expand-properties","title":"Expand Properties","text":"Property Description Required Input The relational input. Required Direct Fields Expressions describing the output fields. These refer to the schema of the input. Each Direct Field must be an expression or a Switching Field Required"},{"location":"relations/physical_relations/#switching-field-properties","title":"Switching Field Properties","text":"

    A switching field is a field whose value is different in each duplicated row. All switching fields in an Expand Operation must have the same number of duplicates.

    Property Description Required Duplicates List of one or more expressions. The output will contain a row for each expression. Required"},{"location":"relations/physical_relations/#hashing-window-operation","title":"Hashing Window Operation","text":"

    A window aggregate operation that will build hash tables for each distinct partition expression.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution. Eliminates ordering. Direct Output Order Same as Project operator (input followed by each window expression)."},{"location":"relations/physical_relations/#hashing-window-properties","title":"Hashing Window Properties","text":"Property Description Required Input The relational input. Required Window Expressions One or more window expressions. At least one required."},{"location":"relations/physical_relations/#streaming-window-operation","title":"Streaming Window Operation","text":"

    A window aggregate operation that relies on a partition/ordering sorted input.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution. Eliminates ordering. Direct Output Order Same as Project operator (input followed by each window expression)."},{"location":"relations/physical_relations/#streaming-window-properties","title":"Streaming Window Properties","text":"Property Description Required Input The relational input. Required Window Expressions One or more window expressions. Must be supported by the sortedness of the input. At least one required."},{"location":"relations/user_defined_relations/","title":"User Defined Relations","text":"

    Pending

    "},{"location":"serialization/basics/","title":"Basics","text":"

    Substrait is designed to be serialized into various different formats. Currently we support a binary serialization for transmission of plans between programs (e.g. IPC or network communication) and a text serialization for debugging and human readability. Other formats may be added in the future.

    These formats serialize a collection of plans. Substrait does not define how a collection of plans is to be interpreted. For example, the following scenarios are all valid uses of a collection of plans:

    • A query engine receives a plan and executes it. It receives a collection of plans with a single root plan. The top-level node of the root plan defines the output of the query. Non-root plans may be included as common subplans which are referenced from the root plan.
    • A transpiler may convert plans from one dialect to another. It could take, as input, a single root plan. Then it could output a serialized binary containing multiple root plans. Each root plan is a representation of the input plan in a different dialect.
    • A distributed scheduler might expect 1+ root plans. Each root plan describes a different stage of computation.

    Libraries should make sure to thoroughly describe the way plan collections will be produced or consumed.

    "},{"location":"serialization/basics/#root-plans","title":"Root plans","text":"

    We often refer to query plans as a graph of nodes (typically a DAG unless the query is recursive). However, we encode this graph as a collection of trees with a single root tree that references other trees (which may also transitively reference other trees). Plan serializations all have some way to indicate which plan(s) are \u201croot\u201d plans. Any plan that is not a root plan and is not referenced (directly or transitively) by some root plan can safely be ignored.

    "},{"location":"serialization/binary_serialization/","title":"Binary Serialization","text":"

    Substrait can be serialized into a protobuf-based binary representation. The proto schema/IDL files can be found on GitHub. Proto files are place in the io.substrait namespace for C++/Java and the Substrait.Protobuf namespace for C#.

    "},{"location":"serialization/binary_serialization/#plan","title":"Plan","text":"

    The main top-level object used to communicate a Substrait plan using protobuf is a Plan message (see the ExtendedExpression for an alternative other top-level object). The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees.

    Plan Message
    message Plan {\n  // Substrait version of the plan. Optional up to 0.17.0, required for later\n  // versions.\n  Version version = 6;\n\n  // a list of yaml specifications this plan may depend on\n  repeated substrait.extensions.SimpleExtensionURI extension_uris = 1;\n\n  // a list of extensions this plan may depend on\n  repeated substrait.extensions.SimpleExtensionDeclaration extensions = 2;\n\n  // one or more relation trees that are associated with this plan.\n  repeated PlanRel relations = 3;\n\n  // additional extensions associated with this plan.\n  substrait.extensions.AdvancedExtension advanced_extensions = 4;\n\n  // A list of com.google.Any entities that this plan may use. Can be used to\n  // warn if some embedded message types are unknown. Note that this list may\n  // include message types that are ignorable (optimizations) or that are\n  // unused. In many cases, a consumer may be able to work with a plan even if\n  // one or more message types defined here are unknown.\n  repeated string expected_type_urls = 5;\n\n}\n
    "},{"location":"serialization/binary_serialization/#extensions","title":"Extensions","text":"

    Protobuf supports both simple and advanced extensions. Simple extensions are declared at the plan level and advanced extensions are declared at multiple levels of messages within the plan.

    "},{"location":"serialization/binary_serialization/#simple-extensions","title":"Simple Extensions","text":"

    For simple extensions, a plan references the URIs associated with the simple extensions to provide additional plan capabilities. These URIs will list additional relevant information for the plan.

    Simple extensions within a plan are split into three components: an extension URI, an extension declaration and a number of references.

    • Extension URI: A unique identifier for the extension pointing to a YAML document specifying one or more specific extensions. Declares an anchor that can be used in extension declarations.
    • Extension Declaration: A specific extension within a single YAML document. The declaration combines a reference to the associated Extension URI along with a unique key identifying the specific item within that YAML document (see Function Signature Compound Names). It also defines a declaration anchor. The anchor is a plan-specific unique value that the producer creates as a key to be referenced elsewhere.
    • Extension Reference: A specific instance or use of an extension declaration within the plan body.

    Extension URIs and declarations are encapsulated in the top level of the plan. Extension declarations are then referenced throughout the body of the plan itself. The exact structure of these references will depend on the extension point being used, but they will always include the extension\u2019s anchor (or key). For example, all scalar function expressions contain references to an extension declaration which defines the semantics of the function.

    Simple Extension URI
    message SimpleExtensionURI {\n  // A surrogate key used in the context of a single plan used to reference the\n  // URI associated with an extension.\n  uint32 extension_uri_anchor = 1;\n\n  // The URI where this extension YAML can be retrieved. This is the \"namespace\"\n  // of this extension.\n  string uri = 2;\n\n}\n

    Once the YAML file URI anchor is defined, the anchor will be referenced by zero or more SimpleExtensionDefinitions. For each simple extension definition, an anchor is defined for that specific extension entity. This anchor is then referenced to within lower-level primitives (functions, etc.) to reference that specific extension. Message properties are named *_anchor where the anchor is defined and *_reference when referencing the anchor. For example function_anchor and function_reference.

    Simple Extension Declaration
    message SimpleExtensionDeclaration {\n  oneof mapping_type {\n    ExtensionType extension_type = 1;\n    ExtensionTypeVariation extension_type_variation = 2;\n    ExtensionFunction extension_function = 3;\n  }\n\n  // Describes a Type\n  message ExtensionType {\n    // references the extension_uri_anchor defined for a specific extension URI.\n    uint32 extension_uri_reference = 1;\n\n    // A surrogate key used in the context of a single plan to reference a\n    // specific extension type\n    uint32 type_anchor = 2;\n\n    // the name of the type in the defined extension YAML.\n    string name = 3;\n  }\n\n  message ExtensionTypeVariation {\n    // references the extension_uri_anchor defined for a specific extension URI.\n    uint32 extension_uri_reference = 1;\n\n    // A surrogate key used in the context of a single plan to reference a\n    // specific type variation\n    uint32 type_variation_anchor = 2;\n\n    // the name of the type in the defined extension YAML.\n    string name = 3;\n  }\n\n  message ExtensionFunction {\n    // references the extension_uri_anchor defined for a specific extension URI.\n    uint32 extension_uri_reference = 1;\n\n    // A surrogate key used in the context of a single plan to reference a\n    // specific function\n    uint32 function_anchor = 2;\n\n    // A function signature compound name\n    string name = 3;\n  }\n\n}\n

    Note

    Anchors only have meaning within a single plan and exist simply to reduce plan size. They are not some form of global identifier. Different plans may use different anchors for the same specific functions, types, type variations, etc.

    Note

    It is valid for a plan to include SimpleExtensionURIs and/or SimpleExtensionDeclarations that are not referenced directly.

    "},{"location":"serialization/binary_serialization/#advanced-extensions","title":"Advanced Extensions","text":"

    Substrait protobuf exposes a special object in multiple places in the representation to expose extension capabilities. Extensions are done via this object. Extensions are separated into main concepts:

    Advanced Extension Type Description Optimization A change to the plan that may help some consumers work more efficiently with the plan. These properties should be propagated through plan pipelines where possible but do not impact the meaning of the plan. A consumer can safely ignore these properties. Enhancement A change to the plan that functionally changes the behavior of the plan. Use these sparingly as they will impact plan interoperability. Advanced Extension Protobuf
    message AdvancedExtension {\n  // An optimization is helpful information that don't influence semantics. May\n  // be ignored by a consumer.\n  repeated google.protobuf.Any optimization = 1;\n\n  // An enhancement alter semantics. Cannot be ignored by a consumer.\n  google.protobuf.Any enhancement = 2;\n\n}\n
    "},{"location":"serialization/binary_serialization/#capabilities","title":"Capabilities","text":"

    When two systems exchanging Substrait plans want to understand each other\u2019s capabilities, they may exchange a Capabilities message. The capabilities message provides information on the set of simple and advanced extensions that the system supports.

    Capabilities Message
    message Capabilities {\n  // List of Substrait versions this system supports\n  repeated string substrait_versions = 1;\n\n  // list of com.google.Any message types this system supports for advanced\n  // extensions.\n  repeated string advanced_extension_type_urls = 2;\n\n  // list of simple extensions this system supports.\n  repeated SimpleExtension simple_extensions = 3;\n\n  message SimpleExtension {\n    string uri = 1;\n    repeated string function_keys = 2;\n    repeated string type_keys = 3;\n    repeated string type_variation_keys = 4;\n  }\n\n}\n
    "},{"location":"serialization/binary_serialization/#protobuf-rationale","title":"Protobuf Rationale","text":"

    The binary format of Substrait is designed to be easy to work with in many languages. A key requirement is that someone can take the binary format IDL and use standard tools to build a set of primitives that are easy to work with in any of a number of languages. This allows communities to build and use Substrait using only a binary IDL and the specification (and allows the Substrait project to avoid being required to build libraries for each language to work with the specification).

    There are several binary IDLs that exist today. The key requirements for Substrait are the following:

    • Strongly typed IDL schema language
    • High-quality well-supported and idiomatic bindings/compilers for key languages (Python, Javascript, C++, Go, Rust, Java)
    • Compact serial representation

    The primary formats that exist that roughly qualify under these requirements include: Protobuf, Thrift, Flatbuf, Avro, Cap\u2019N\u2019Proto. Protobuf was chosen due to its clean typing system and large number of high quality language bindings.

    The binary serialization IDLs can be found on GitHub and are sampled throughout the documentation.

    "},{"location":"serialization/text_serialization/","title":"Text Serialization","text":"

    To maximize the new user experience, it is important for Substrait to have a text representation of plans. This allows people to experiment with basic tooling. Building simple CLI tools that do things like SQL > Plan and Plan > SQL or REPL plan construction can all be done relatively straightforwardly with a text representation.

    The recommended text serialization format is JSON. Since the text format is not designed for performance, the format can be produced to maximize readability. This also allows nice symmetry between the construction of plans and the configuration of various extensions such as function signatures and user defined types.

    To ensure the JSON is valid, the object will be defined using the OpenApi 3.1 specification. This not only allows strong validation, the OpenApi specification enables code generators to be easily used to produce plans in many languages.

    While JSON will be used for much of the plan serialization, Substrait uses a custom simplistic grammar for record level expressions. While one can construct an equation such as (10 + 5)/2 using a tree of function and literal objects, it is much more human-readable to consume a plan when the information is written similarly to the way one typically consumes scalar expressions. This grammar will be maintained in an ANTLR grammar (targetable to multiple programming languages) and is also planned to be supported via JSON schema definition format tag so that the grammar can be validated as part of the schema validation.

    "},{"location":"spec/extending/","title":"Extending","text":"

    Substrait is a community project and requires consensus about new additions to the specification in order to maintain consistency. The best way to get consensus is to discuss ideas. The main ways to communicate are:

    • Substrait Mailing List
    • Substrait Slack
    • Community Meeting
    "},{"location":"spec/extending/#minor-changes","title":"Minor changes","text":"

    Simple changes like typos and bug fixes do not require as much effort. File an issue or send a PR and we can discuss it there.

    "},{"location":"spec/extending/#complex-changes","title":"Complex changes","text":"

    For complex features it is useful to discuss the change first. It will be useful to gather some background information to help get everyone on the same page.

    "},{"location":"spec/extending/#outline-the-issue","title":"Outline the issue","text":""},{"location":"spec/extending/#language","title":"Language","text":"

    Every engine has its own terminology. Every Spark user probably knows what an \u201cattribute\u201d is. Velox users will know what a \u201cRowVector\u201d means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible.

    "},{"location":"spec/extending/#motivation","title":"Motivation","text":"

    What problems does this relation solve? If it is a more logical relation then how does it allow users to express new capabilities? If it is more of an internal relation then how does it map to existing logical relations? How is it different than other existing relations? Why do we need this?

    "},{"location":"spec/extending/#examples","title":"Examples","text":"

    Provide example input and output for the relation. Show example plans. Try and motivate your examples, as best as possible, with something that looks like a real world problem. These will go a long ways towards helping others understand the purpose of a relation.

    "},{"location":"spec/extending/#alternatives","title":"Alternatives","text":"

    Discuss what alternatives are out there. Are there other ways to achieve similar results? Do some systems handle this problem differently?

    "},{"location":"spec/extending/#survey-existing-implementation","title":"Survey existing implementation","text":"

    It\u2019s unlikely that this is the first time that this has been done. Figuring out

    "},{"location":"spec/extending/#prototype-the-feature","title":"Prototype the feature","text":"

    Novel approaches should be implemented as an extension first.

    "},{"location":"spec/extending/#substrait-design-principles","title":"Substrait design principles","text":"

    Substrait is designed around interoperability so a feature only used by a single system may not be accepted. But don\u2019t dispair! Substrait has a highly developed extension system for this express purpose.

    "},{"location":"spec/extending/#you-dont-have-to-do-it-alone","title":"You don\u2019t have to do it alone","text":"

    If you are hoping to add a feature and these criteria seem intimidating then feel free to start a mailing list discussion before you have all the information and ask for help. Investigating other implementations, in particular, is something that can be quite difficult to do on your own.

    "},{"location":"spec/specification/","title":"Specification","text":""},{"location":"spec/specification/#status","title":"Status","text":"

    The specification has passed the initial design phase and is now in the final stages of being fleshed out. The community is encouraged to identify (and address) any perceived gaps in functionality using GitHub issues and PRs. Once all of the planned implementations have been completed all deprecated fields will be eliminated and version 1.0 will be released.

    "},{"location":"spec/specification/#components-complete","title":"Components (Complete)","text":"Section Description Simple Types A way to describe the set of basic types that will be operated on within a plan. Only includes simple types such as integers and doubles (nothing configurable or compound). Compound Types Expression of types that go beyond simple scalar values. Key concepts here include: configurable types such as fixed length and numeric types as well as compound types such as structs, maps, lists, etc. Type Variations Physical variations to base types. User Defined Types Extensions that can be defined for specific IR producers/consumers. Field References Expressions to identify which portions of a record should be operated on. Scalar Functions Description of how functions are specified. Concepts include arguments, variadic functions, output type derivation, etc. Scalar Function List A list of well-known canonical functions in YAML format. Specialized Record Expressions Specialized expression types that are more naturally expressed outside the function paradigm. Examples include items such as if/then/else and switch statements. Aggregate Functions Functions that are expressed in aggregation operations. Examples include things such as SUM, COUNT, etc. Operations take many records and collapse them into a single (possibly compound) value. Window Functions Functions that relate a record to a set of encompassing records. Examples in SQL include RANK, NTILE, etc. User Defined Functions Reusable named functions that are built beyond the core specification. Implementations are typically registered thorough external means (drop a file in a directory, send a special command with implementation, etc.) Embedded Functions Functions implementations embedded directly within the plan. Frequently used in data science workflows where business logic is interspersed with standard operations. Relation Basics Basic concepts around relational algebra, record emit and properties. Logical Relations Common relational operations used in compute plans including project, join, aggregation, etc. Text Serialization A human producible & consumable representation of the plan specification. Binary Serialization A high performance & compact binary representation of the plan specification."},{"location":"spec/specification/#components-designed-but-not-implemented","title":"Components (Designed but not Implemented)","text":"Section Description Table Functions Functions that convert one or more values from an input record into 0..N output records. Example include operations such as explode, pos-explode, etc. User Defined Relations Installed and reusable relational operations customized to a particular platform. Embedded Relations Relational operations where plans contain the \u201cmachine code\u201d to directly execute the necessary operations. Physical Relations Specific execution sub-variations of common relational operations that describe have multiple unique physical variants associated with a single logical operation. Examples include hash join, merge join, nested loop join, etc."},{"location":"spec/technology_principles/","title":"Technology Principles","text":"
    • Provide a good suite of well-specified common functionality in databases and data science applications.
    • Make it easy for users to privately or publicly extend the representation to support specialized/custom operations.
    • Produce something that is language agnostic and requires minimal work to start developing against in a new language.
    • Drive towards a common format that avoids specialization for single favorite producer or consumer.
    • Establish clear delineation between specifications that MUST be respected to and those that can be optionally ignored.
    • Establish a forgiving compatibility approach and versioning scheme that supports cross-version compatibility in maximum number of cases.
    • Minimize the need for consumer intelligence by excluding concepts like overloading, type coercion, implicit casting, field name handling, etc. (Note: this is weak and should be better stated.)
    • Decomposability/severability: A particular producer or consumer should be able to produce or consume only a subset of the specification and interact well with any other Substrait system as long the specific operations requested fit within the subset of specification supported by the counter system.
    "},{"location":"spec/versioning/","title":"Versioning","text":"

    As an interface specification, the goal of Substrait is to reach a point where (breaking) changes will never need to happen again, or at least be few and far between. By analogy, Apache Arrow\u2019s in-memory format specification has stayed functionally constant, despite many major library versions being released. However, we\u2019re not there yet. When we believe that we\u2019ve reached this point, we will signal this by releasing version 1.0.0. Until then, we will remain in the 0.x.x version regime.

    Despite this, we strive to maintain backward compatibility for both the binary representation and the text representation by means of deprecation. When a breaking change cannot be reasonably avoided, we may remove previously deprecated fields. All deprecated fields will be removed for the 1.0.0 release.

    Substrait uses semantic versioning for its version numbers, with the addition that, during 0.x.y, we increment the x digit for breaking changes and new features, and the y digit for fixes and other nonfunctional changes. The release process is currently automated and makes a new release every week, provided something has changed on the main branch since the previous release. This release cadence will likely be slowed down as stability increases over time. Conventional commits are used to distinguish between breaking changes, new features, and fixes, and GitHub actions are used to verify that there are indeed no breaking protobuf changes in a commit, unless the commit message states this.

    "},{"location":"tools/producer_tools/","title":"Producer Tools","text":""},{"location":"tools/producer_tools/#isthmus","title":"Isthmus","text":"

    Isthmus is an application that serializes SQL to Substrait Protobuf via the Calcite SQL compiler.

    "},{"location":"tools/substrait_validator/","title":"Substrait Validator","text":"

    The Substrait Validator is a tool used to validate substrait plans as well as print diagnostics information regarding the plan validity.

    "},{"location":"tools/third_party_tools/","title":"Third Party Tools","text":""},{"location":"tools/third_party_tools/#substrait-tools","title":"Substrait-tools","text":"

    The substrait-tools python package provides a command line interface for producing/consuming substrait plans by leveraging the APIs from different producers and consumers.

    "},{"location":"tools/third_party_tools/#substrait-fiddle","title":"Substrait Fiddle","text":"

    Substrait Fiddle is an online tool to share, debug, and prototype Substrait plans.

    The Substrait Fiddle Source is available allowing it to be run in any environment.

    "},{"location":"tutorial/sql_to_substrait/","title":"SQL to Substrait tutorial","text":"

    This is an introductory tutorial to learn the basics of Substrait for readers already familiar with SQL. We will look at how to construct a Substrait plan from an example query.

    We\u2019ll present the Substrait in JSON form to make it relatively readable to newcomers. Typically Substrait is exchanged as a protobuf message, but for debugging purposes it is often helpful to look at a serialized form. Plus, it\u2019s not uncommon for unit tests to represent plans as JSON strings. So if you are developing with Substrait, it\u2019s useful to have experience reading them.

    Note

    Substrait is currently only defined with Protobuf. The JSON provided here is the Protobuf JSON output, but it is not the official Substrait text format. Eventually, Substrait will define it\u2019s own human-readable text format, but for now this tutorial will make do with what Protobuf provides.

    Substrait is designed to communicate plans (mostly logical plans). Those plans contain types, schemas, expressions, extensions, and relations. We\u2019ll look at them in that order, going from simplest to most complex until we can construct full plans.

    This tutorial won\u2019t cover all the details of each piece, but it will give you an idea of how they connect together. For a detailed reference of each individual field, the best place to look is reading the protobuf definitions. They represent the source-of-truth of the spec and are well-commented to address ambiguities.

    "},{"location":"tutorial/sql_to_substrait/#problem-set-up","title":"Problem Set up","text":"

    To learn Substrait, we\u2019ll build up to a specific query. We\u2019ll be using the tables:

    CREATE TABLE orders (\n  product_id: i64 NOT NULL,\n  quantity: i32 NOT NULL,\n  order_date: date NOT NULL,\n  price: decimal(10, 2)\n);\n
    CREATE TABLE products (\n  product_id: i64 NOT NULL,\n  categories: list<string NOT NULL> NOT NULL,\n  details: struct<manufacturer: string, year_created: int32>,\n  product_name: string\n);\n

    This orders table represents events where products were sold, recording how many (quantity) and at what price (price). The products table provides details for each product, with product_id as the primary key.

    And we\u2019ll try to create the query:

    SELECT\n  product_name,\n  product_id,\n  sum(quantity * price) as sales\nFROM\n  orders\nINNER JOIN\n  products\nON\n  orders.product_id = products.product_id\nWHERE\n  -- categories does not contain \"Computers\"\n  INDEX_IN(\"Computers\", categories) IS NULL\nGROUP BY\n  product_name,\n  product_id\n

    The query asked the question: For products that aren\u2019t in the \"Computer\" category, how much has each product generated in sales?

    However, Substrait doesn\u2019t correspond to SQL as much as it does to logical plans. So to be less ambiguous, the plan we are aiming for looks like:

    |-+ Aggregate({sales = sum(quantity_price)}, group_by=(product_name, product_id))\n  |-+ InnerJoin(on=orders.product_id = products.product_id)\n    |- ReadTable(orders)\n    |-+ Filter(INDEX_IN(\"Computers\", categories) IS NULL)\n      |- ReadTable(products)\n
    "},{"location":"tutorial/sql_to_substrait/#types-and-schemas","title":"Types and Schemas","text":"

    As part of the Substrait plan, we\u2019ll need to embed the data types of the input tables. In Substrait, each type is a distinct message, which at a minimum contains a field for nullability. For example, a string field looks like:

    {\n  \"string\": {\n    \"nullability\": \"NULLABILITY_NULLABLE\"\n  }\n}\n

    Nullability is an enum not a boolean, since Substrait allows NULLABILITY_UNSPECIFIED as an option, in addition to NULLABILITY_NULLABLE (nullable) and NULLABILITY_REQUIRED (not nullable).

    Other types such as VarChar and Decimal have other parameters. For example, our orders.price column will be represented as:

    {\n  \"decimal\": {\n    \"precision\": 10,\n    \"scale\": 2,\n    \"nullability\": \"NULLABILITY_NULLABLE\"\n  }\n}\n

    Finally, there are nested compound types such as structs and list types that have other types as parameters. For example, the products.categories column is a list of strings, so can be represented as:

    {\n  \"list\": {\n    \"type\": {\n      \"string\": {\n        \"nullability\": \"NULLABILITY_REQUIRED\"\n      }\n    },\n    \"nullability\": \"NULLABILITY_REQUIRED\"\n  }\n}\n

    To know what parameters each type can take, refer to the Protobuf definitions in type.proto.

    Schemas of tables can be represented with a NamedStruct message, which is the combination of a struct type containing all the columns and a list of column names. For the orders table, this will look like:

    {\n  \"names\": [\n    \"product_id\",\n    \"quantity\",\n    \"order_date\",\n    \"price\"\n  ],\n  \"struct\": {\n    \"types\": [\n      {\n        \"i64\": {\n          \"nullability\": \"NULLABILITY_REQUIRED\"\n        }\n      },\n      {\n        \"i32\": {\n          \"nullability\": \"NULLABILITY_REQUIRED\"\n        }\n      },\n      {\n        \"date\": {\n          \"nullability\": \"NULLABILITY_REQUIRED\"\n        }\n      },\n      {\n        \"decimal\": {\n          \"precision\": 10,\n          \"scale\": 2,\n          \"nullability\": \"NULLABILITY_NULLABLE\"\n        }\n      }\n    ],\n    \"nullability\": \"NULLABILITY_REQUIRED\"\n  }\n}\n

    Here, names is the names of all fields. In nested schemas, this includes the names of subfields in depth-first order. So for the products table, the details struct field will be included as well as the two subfields (manufacturer and year_created) right after. And because it\u2019s depth first, these subfields appear before product_name. The full schema looks like:

    {\n  \"names\": [\n    \"product_id\",\n    \"categories\",\n    \"details\",\n    \"manufacturer\",\n    \"year_created\",\n    \"product_name\"\n  ],\n  \"struct\": {\n    \"types\": [\n      {\n        \"i64\": {\n          \"nullability\": \"NULLABILITY_REQUIRED\"\n        }\n      },\n      {\n        \"list\": {\n          \"type\": {\n            \"string\": {\n              \"nullability\": \"NULLABILITY_REQUIRED\"\n            }\n          },\n          \"nullability\": \"NULLABILITY_REQUIRED\"\n        }\n      },\n      {\n        \"struct\": {\n          \"types\": [\n            {\n              \"string\": {\n                \"nullability\": \"NULLABILITY_NULLABLE\"\n              },\n              \"i32\": {\n                \"nullability\": \"NULLABILITY_NULLABLE\"\n              }\n            }\n          ],\n          \"nullability\": \"NULLABILITY_NULLABLE\"\n        }\n      },\n      {\n        \"string\": {\n          \"nullability\": \"NULLABILITY_NULLABLE\"\n        }\n      }\n    ],\n    \"nullability\": \"NULLABILITY_REQUIRED\"\n  }\n}\n
    "},{"location":"tutorial/sql_to_substrait/#expressions","title":"Expressions","text":"

    The next basic building block we will need is expressions. Expressions can be one of several things, including:

    • Field references
    • Literal values
    • Functions
    • Subqueries
    • Window Functions

    Since some expressions such as functions can contain other expressions, expressions can be represented as a tree. Literal values and field references typically are the leaf nodes.

    For the expression INDEX_IN(categories, \"Computers\") IS NULL, we have a field reference categories, a literal string \"Computers\", and two functions\u2014 INDEX_IN and IS NULL.

    The field reference for categories is represented by:

    {\n  \"selection\": {\n    \"directReference\": {\n      \"structField\": {\n        \"field\": 1\n      }\n    },\n    \"rootReference\": {}\n  }\n}\n

    Whereas SQL references field by names, Substrait always references fields numerically. This means that a Substrait expression only makes sense relative to a certain schema. As we\u2019ll see later when we discuss relations, for a filter relation this will be relative to the input schema, so the 1 here is referring to the second field of products.

    Note

    Protobuf may not serialize fields with integer type and value 0, since 0 is the default. So if you instead saw \"structField\": {}, know that is is equivalent to \"structField\": { \"field\": 0 }.

    \"Computers\" will be translated to a literal expression:

    {\n  \"literal\": {\n    \"string\": \"Computers\"\n  }\n}\n

    Both IS NULL and INDEX_IN will be scalar function expressions. Available functions in Substrait are defined in extension YAML files contained in https://github.com/substrait-io/substrait/tree/main/extensions. Additional extensions may be created elsewhere. IS NULL is defined as a is_null function in functions_comparison.yaml and INDEX_IN is defined as index_in function in functions_set.yaml.

    First, the expression for INDEX_IN(\"Computers\", categories) is:

    {\n  \"scalarFunction\": {\n    \"functionReference\": 1,\n    \"outputType\": {\n      \"i64\": {\n        \"nullability\": \"NULLABILITY_NULLABLE\"\n      }\n    },\n    \"arguments\": [\n      {\n        \"value\": {\n          \"literal\": {\n            \"string\": \"Computers\"\n          }\n        }\n      },\n      {\n        \"value\": {\n          \"selection\": {\n            \"directReference\": {\n              \"structField\": {\n                \"field\": 1\n              }\n            },\n            \"rootReference\": {}\n          }\n        }\n      }\n    ]\n  }\n}\n

    functionReference will be explained later in the plans section. For now, understand that it\u2019s a ID that corresponds to an entry in a list of function definitions that we will create later.

    outputType defines the type the function outputs. We know this is a nullable i64 type since that is what the function definition declares in the YAML file.

    arguments defines the arguments being passed into the function, which are all done positionally based on the function definition in the YAML file. The two arguments will be familiar as the literal and the field reference we constructed earlier.

    To create the final expression, we just need to wrap this in another scalar function expression for IS NULL.

    {\n  \"scalarFunction\": {\n    \"functionReference\": 2,\n    \"outputType\": {\n      \"bool\": {\n        \"nullability\": \"NULLABILITY_REQUIRED\"\n      }\n    },\n    \"arguments\": [\n      {\n        \"value\": {\n          \"scalarFunction\": {\n            \"functionReference\": 1,\n            \"outputType\": {\n              \"i64\": {\n                \"nullability\": \"NULLABILITY_NULLABLE\"\n              }\n            },\n            \"arguments\": [\n              {\n                \"value\": {\n                  \"literal\": {\n                    \"string\": \"Computers\"\n                  }\n                }\n              },\n              {\n                \"value\": {\n                  \"selection\": {\n                    \"directReference\": {\n                      \"structField\": {\n                        \"field\": 1\n                      }\n                    },\n                    \"rootReference\": {}\n                  }\n                }\n              }\n            ]\n          }\n        }\n      }\n    ]\n  }\n}\n

    To see what other types of expressions are available and what fields they take, see the Expression proto definition in algebra.proto.

    "},{"location":"tutorial/sql_to_substrait/#relations","title":"Relations","text":"

    In most SQL engines, a logical or physical plan is represented as a tree of nodes, such as filter, project, scan, or join. The left diagram below may be a familiar representation of our plan, where nodes feed data into each other moving from left to right. In Substrait, each of these nodes is a Relation.

    A relation that takes another relation as input will contain (or refer to) that relation. This is usually a field called input, but sometimes different names are used in relations that take multiple inputs. For example, join relations take two inputs, with field names left and right. In JSON, the rough layout for the relations in our plan will look like:

    {\n    \"aggregate\": {\n        \"input\": {\n            \"join\": {\n                \"left\": {\n                    \"filter\": {\n                        \"input\": {\n                            \"read\": {\n                                ...\n                            }\n                        },\n                        ...\n                    }\n                },\n                \"right\": {\n                    \"read\": {\n                        ...\n                    }\n                },\n                ...\n            }\n        },\n        ...\n    }\n}\n

    For our plan, we need to define the read relations for each table, a filter relation to exclude the \"Computer\" category from the products table, a join relation to perform the inner join, and finally an aggregate relation to compute the total sales.

    The read relations are composed of a baseSchema and a namedTable field. The type of read is a named table, so the namedTable field is present with names containing the list of name segments (my_database.my_table). Other types of reads include virtual tables (a table of literal values embedded in the plan) and a list of files. See Read Definition Types for more details. The baseSchema is the schemas we defined earlier and namedTable are just the names of the tables. So for reading the orders table, the relation looks like:

    {\n  \"read\": {\n    \"namedTable\": {\n      \"names\": [\n        \"orders\"\n      ]\n    },\n    \"baseSchema\": {\n      \"names\": [\n        \"product_id\",\n        \"quantity\",\n        \"order_date\",\n        \"price\"\n      ],\n      \"struct\": {\n        \"types\": [\n          {\n            \"i64\": {\n              \"nullability\": \"NULLABILITY_REQUIRED\"\n            }\n          },\n          {\n            \"i32\": {\n              \"nullability\": \"NULLABILITY_REQUIRED\"\n            }\n          },\n          {\n            \"date\": {\n              \"nullability\": \"NULLABILITY_REQUIRED\"\n            }\n          },\n          {\n            \"decimal\": {\n              \"scale\": 10,\n              \"precision\": 2,\n              \"nullability\": \"NULLABILITY_NULLABLE\"\n            }\n          }\n        ],\n        \"nullability\": \"NULLABILITY_REQUIRED\"\n      }\n    }\n  }\n}\n

    Read relations are leaf nodes. Leaf nodes don\u2019t depend on any other node for data and usually represent a source of data in our plan. Leaf nodes are then typically used as input for other nodes that manipulate the data. For example, our filter node will take the products read relation as an input.

    The filter node will also take a condition field, which will just be the expression we constructed earlier.

    {\n  \"filter\": {\n    \"input\": {\n      \"read\": { ... }\n    },\n    \"condition\": {\n      \"scalarFunction\": {\n        \"functionReference\": 2,\n        \"outputType\": {\n          \"bool\": {\n            \"nullability\": \"NULLABILITY_REQUIRED\"\n          }\n        },\n        \"arguments\": [\n          {\n            \"value\": {\n              \"scalarFunction\": {\n                \"functionReference\": 1,\n                \"outputType\": {\n                  \"i64\": {\n                    \"nullability\": \"NULLABILITY_NULLABLE\"\n                  }\n                },\n                \"arguments\": [\n                  {\n                    \"value\": {\n                      \"literal\": {\n                        \"string\": \"Computers\"\n                      }\n                    }\n                  },\n                  {\n                    \"value\": {\n                      \"selection\": {\n                        \"directReference\": {\n                          \"structField\": {\n                            \"field\": 1\n                          }\n                        },\n                        \"rootReference\": {}\n                      }\n                    }\n                  }\n                ]\n              }\n            }\n          }\n        ]\n      }\n    }\n  }\n}\n

    The join relation will take two inputs. In the left field will be the read relation for orders and in the right field will be the filter relation (from products). The type field is an enum that allows us to specify we want an inner join. Finally, the expression field contains the expression to use in the join. Since we haven\u2019t used the equals() function yet, we use the reference number 3 here. (Again, we\u2019ll see at the end with plans how these functions are resolved.) The arguments refer to fields 0 and 4, which are indices into the combined schema formed from the left and right inputs. We\u2019ll discuss later in Field Indices where these come from.

    {\n  \"join\": {\n    \"left\": { ... },\n    \"right\": { ... },\n    \"type\": \"JOIN_TYPE_INNER\",\n    \"expression\": {\n      \"scalarFunction\": {\n        \"functionReference\": 3,\n        \"outputType\": {\n          \"bool\": {\n            \"nullability\": \"NULLABILITY_NULLABLE\"\n          }\n        },\n        \"arguments\": [\n          {\n            \"value\": {\n              \"selection\": {\n                \"directReference\": {\n                  \"structField\": {\n                    \"field\": 0\n                  }\n                },\n                \"rootReference\": {}\n              }\n            }\n          },\n          {\n            \"value\": {\n              \"selection\": {\n                \"directReference\": {\n                  \"structField\": {\n                    \"field\": 4\n                  }\n                },\n                \"rootReference\": {}\n              }\n            }\n          }\n        ]\n      }\n    }\n  }\n}\n

    The final aggregation requires two things, other than the input. First is the groupings. We\u2019ll use a single grouping expression containing the references to the fields product_name and product_id. (Multiple grouping expressions can be used to do cube aggregations.)

    For measures, we\u2019ll need to define sum(quantity * price) as sales. Substrait is stricter about data types, and quantity is an integer while price is a decimal. So we\u2019ll first need to cast quantity to a decimal, making the Substrait expression more like sum(multiply(cast(decimal(10, 2), quantity), price)). Both sum() and multiply() are functions, defined in functions_arithmetic_demical.yaml. However cast() is a special expression type in Substrait, rather than a function.

    Finally, the naming with as sales will be handled at the end as part of the plan, so that\u2019s not part of the relation. Since we are always using field indices to refer to fields, Substrait doesn\u2019t record any intermediate field names.

    {\n  \"aggregate\": {\n    \"input\": { ... },\n    \"groupings\": [\n      {\n        \"groupingExpressions\": [\n          {\n            \"value\": {\n              \"selection\": {\n                \"directReference\": {\n                  \"structField\": {\n                    \"field\": 0\n                  }\n                },\n                \"rootReference\": {}\n              }\n            }\n          },\n          {\n            \"value\": {\n              \"selection\": {\n                \"directReference\": {\n                  \"structField\": {\n                    \"field\": 7\n                  }\n                },\n                \"rootReference\": {}\n              }\n            }\n          },\n        ]\n      }\n    ],\n    \"measures\": [\n      {\n        \"measure\": {\n          \"functionReference\": 4,\n          \"outputType\": {\n            \"decimal\": {\n              \"precision\": 38,\n              \"scale\": 2,\n              \"nullability\": \"NULLABILITY_NULLABLE\"\n            }\n          },\n          \"arguments\": [\n            {\n              \"value\": {\n                \"scalarFunction\": {\n                  \"functionReference\": 5,\n                  \"outputType\": {\n                    \"decimal\": {\n                      \"precision\": 38,\n                      \"scale\": 2,\n                      \"nullability\": \"NULLABILITY_NULLABLE\"\n                    }\n                  },\n                  \"arguments\": [\n                    {\n                      \"value\": {\n                        \"cast\": {\n                          \"type\": {\n                            \"decimal\": {\n                              \"precision\": 10,\n                              \"scale\": 2,\n                              \"nullability\": \"NULLABILITY_REQUIRED\"\n                            }\n                          },\n                          \"input\": {\n                            \"selection\": {\n                              \"directReference\": {\n                                \"structField\": {\n                                  \"field\": 1\n                                }\n                              },\n                              \"rootReference\": {}\n                            }\n                          }\n                        }\n                      }\n                    },\n                    {\n                      \"value\": {\n                        \"selection\": {\n                          \"directReference\": {\n                            \"structField\": {\n                              \"field\": 3\n                            }\n                          },\n                          \"rootReference\": {}\n                        }\n                      }\n                    }\n                  ]\n                }\n              }\n            }\n          ]\n        }\n      }\n    ]\n  }\n}\n
    "},{"location":"tutorial/sql_to_substrait/#field-indices","title":"Field indices","text":"

    So far, we have glossed over the field indices. Now that we\u2019ve built up each of the relations, it will be a bit easier to explain them.

    Throughout the plan, data always has some implicit schema, which is modified by each relation. Often, the schema can change within a relation\u2013we\u2019ll discuss an example in the next section. Each relation has it\u2019s own rules in how schemas are modified, called the output order or emit order. For the purposes of our query, the relevant rules are:

    • For Read relations, their output schema is the schema of the table.
    • For Filter relations, the output schema is the same as in the input schema.
    • For Joins relations, the input schema is the concatenation of the left and then the right schemas. The output schema is the same.
    • For Aggregate relations, the output schema is the group by fields followed by the measures.

    Note

    Sometimes it can be hard to tell what the implicit schema is. For help determining that, consider using the substrait-validator tool, described in Next Steps.

    The diagram below shows the mapping of field indices within each relation and how each of the field references show up in each relations properties.

    "},{"location":"tutorial/sql_to_substrait/#column-selection-and-emit","title":"Column selection and emit","text":"

    As written, the aggregate output schema will be:

    0: product_id: i64\n1: product_name: string\n2: sales: decimal(32, 8)\n

    But we want product_name to come before product_id in our output. How do we reorder those columns?

    You might be tempted to add a Project relation at the end. However, the project relation only adds columns; it is not responsible for subsetting or reordering columns.

    Instead, any relation can reorder or subset columns through the emit property. By default, it is set to direct, which outputs all columns \u201cas is\u201d. But it can also be specified as a sequence of field indices.

    For simplicity, we will add this to the final aggregate relation. We could also add it to all relations, only selecting the fields we strictly need in later relations. Indeed, a good optimizer would probably do that to our plan. And for some engines, the emit property is only valid within a project relation, so in those cases we would need to add that relation in combination with emit. But to keep things simple, we\u2019ll limit the columns at the end within the aggregation relation.

    For our final column selection, we\u2019ll modify the top-level relation to be:

    {\n  \"aggregate\": {\n    \"input\": { ... },\n    \"groupings\": [ ... ],\n    \"measures\": [ ... ],\n    \"common\": {\n      \"emit\": {\n        \"outputMapping\": [1, 0, 2]\n      }\n    }\n}\n
    "},{"location":"tutorial/sql_to_substrait/#plans","title":"Plans","text":"

    Now that we\u2019ve constructed our relations, we can put it all into a plan. Substrait plans are the only messages that can be sent and received on their own. Recall that earlier, we had function references to those YAML files, but so far there\u2019s been no place to tell a consumer what those function reference IDs mean or which extensions we are using. That information belongs at the plan level.

    The overall layout for a plan is

    {\n  \"extensionUris\": [ ... ],\n  \"extensions\": [ ... ],\n  \"relations\": [\n    {\n      \"root\": {\n        \"names\": [\n          \"product_name\",\n          \"product_id\",\n          \"sales\"\n        ],\n        \"input\": { ... }\n      }\n    }\n  ]\n}\n

    The relations field is a list of Root relations. Most queries only have one root relation, but the spec allows for multiple so a common plan could be referenced by other plans, sort of like a CTE (Common Table Expression) from SQL. The root relation provides the final column names for our query. The input to this relation is our aggregate relation (which contains all the other relations as children).

    For extensions, we need to provide extensionUris with the locations of the YAML files we used and extensions with the list of functions we used and which extension they come from.

    In our query, we used:

    • index_in (1), from functions_set.yaml,
    • is_null (2), from functions_comparison.yaml,
    • equal (3), from functions_comparison.yaml,
    • sum (4), from functions_arithmetic_decimal.yaml,
    • multiply (5), from functions_arithmetic_decimal.yaml.

    So first we can create the three extension uris:

    [\n  {\n    \"extensionUriAnchor\": 1,\n    \"uri\": \"https://github.com/substrait-io/substrait/blob/main/extensions/functions_set.yaml\"\n  },\n  {\n    \"extensionUriAnchor\": 2,\n    \"uri\": \"https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml\"\n  },\n  {\n    \"extensionUriAnchor\": 3,\n    \"uri\": \"https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic_decimal.yaml\"\n  }\n]\n

    Then we can create the extensions:

    [\n  {\n    \"extensionFunction\": {\n      \"extensionUriReference\": 1,\n      \"functionAnchor\": 1,\n      \"name\": \"index_in\"\n    }\n  },\n  {\n    \"extensionFunction\": {\n      \"extensionUriReference\": 2,\n      \"functionAnchor\": 2,\n      \"name\": \"is_null\"\n    }\n  },\n  {\n    \"extensionFunction\": {\n      \"extensionUriReference\": 2,\n      \"functionAnchor\": 3,\n      \"name\": \"equal\"\n    }\n  },\n  {\n    \"extensionFunction\": {\n      \"extensionUriReference\": 3,\n      \"functionAnchor\": 4,\n      \"name\": \"sum\"\n    }\n  },\n  {\n    \"extensionFunction\": {\n      \"extensionUriReference\": 3,\n      \"functionAnchor\": 5,\n      \"name\": \"multiply\"\n    }\n  }\n]\n

    Once we\u2019ve added our extensions, the plan is complete. Our plan outputted in full is: final_plan.json.

    "},{"location":"tutorial/sql_to_substrait/#next-steps","title":"Next steps","text":"

    Validate and introspect plans using substrait-validator. Amongst other things, this tool can show what the current schema and column indices are at each point in the plan. Try downloading the final plan JSON above and generating an HTML report on the plan with:

    substrait-validator final_plan.json --out-file output.html\n
    "},{"location":"types/named_structs/","title":"Named Structs","text":"

    A Named Struct is a special type construct that combines: * A Struct type * A list of names for the fields in the Struct, in depth-first search order

    The depth-first search order for names arises from the the ability to nest Structs within other types. All Struct fields must be named, even nested fields.

    Named Structs are most commonly used to model the schema of Read relations.

    "},{"location":"types/named_structs/#determining-names","title":"Determining Names","text":"

    When producing/consuming names for a NamedStruct, some types require special handling:

    "},{"location":"types/named_structs/#struct","title":"Struct","text":"

    A struct has names for each of its inner fields.

    For example, the following Struct

    struct<i64, i64>\n       \u2191    \u2191\n       a    b\n
    has 2 names, one for each of its inner fields.

    "},{"location":"types/named_structs/#structs-within-compound-types","title":"Structs within Compound Types","text":"

    Struct types nested in compound types must also be be named.

    "},{"location":"types/named_structs/#structs-within-maps","title":"Structs within Maps","text":"

    If a Map contains Structs, either as keys or values or both, the Struct fields must be named. Keys are named before values. For example the following Map

    map<struct<i64, i64>, struct<i64, i64, i64>>\n           \u2191    \u2191            \u2191    \u2191    \u2191\n           a    b            c    d    e\n
    has 5 named fields * 2 names [a, b] for the struct fields used as a key * 3 names [c, d, e] for the struct fields used as a value

    "},{"location":"types/named_structs/#structs-within-list","title":"Structs within List","text":"

    If a List contains Structs, the Struct fields must be named. For example the following List

    list<struct<i64, i64>>\n            \u2191    \u2191\n            a    b\n
    has 2 named fields [a, b] for the struct fields.

    "},{"location":"types/named_structs/#structs-within-struct","title":"Structs within Struct","text":"

    Structs can also be embedded within Structs.

    A Struct like

    struct<struct<i64, i64>, struct<i64, i64, i64>>\n       \u2191      \u2191    \u2191     \u2191      \u2191    \u2191    \u2191\n       a      b    c     d      e    f    g\n
    has 7 names * 1 name [a] for the 1st nested struct field * 2 names [b, c] for the fields within the 1st nested struct * 1 name [d] the for the 2nd nested struct field * 3 names [e, f, g] for the fields within the 2nd nested struct

    "},{"location":"types/named_structs/#putting-it-all-together","title":"Putting It All Together","text":""},{"location":"types/named_structs/#simple-named-struct","title":"Simple Named Struct","text":"
    NamedStruct {\n    names: [a, b, c, d]\n    struct: struct<i64, list<i64>, map<i64, i64>, i64>\n                   \u2191    \u2191          \u2191              \u2191\n                   a    b          c              d\n}\n
    "},{"location":"types/named_structs/#structs-in-compound-types","title":"Structs in Compound Types","text":"
    NamedStruct {\n    names: [a, b, c, d, e, f, g, h]\n    struct: struct<i64, list<struct<i64, i64>>, map<i64, struct<i64, i64>>, i64>\n                   \u2191    \u2191          \u2191     \u2191      \u2191               \u2191    \u2191      \u2191\n                   a    b          c     d      e               f    g      h\n}\n
    "},{"location":"types/named_structs/#structs-in-structs","title":"Structs in Structs","text":"
    NamedStruct {\n    names: [a, b, c, d, e, f, g, h, i]\n    struct: struct<i64, struct<i64, struct<i64, i64>, i64, struct<i64, i64>>>>\n                   \u2191    \u2191      \u2191    \u2191      \u2191    \u2191     \u2191    \u2191      \u2191    \u2191\n                   a    b      c    d      e    f     g    h      i    j\n}\n
    "},{"location":"types/type_classes/","title":"Type Classes","text":"

    In Substrait, the \u201cclass\u201d of a type, not to be confused with the concept from object-oriented programming, defines the set of non-null values that instances of a type may assume.

    Implementations of a Substrait type must support at least this set of values, but may include more; for example, an i8 could be represented using the same in-memory format as an i32, as long as functions operating on i8 values within [-128..127] behave as specified (in this case, this means 8-bit overflow must work as expected). Operating on values outside the specified range is unspecified behavior.

    "},{"location":"types/type_classes/#simple-types","title":"Simple Types","text":"

    Simple type classes are those that don\u2019t support any form of configuration. For simplicity, any generic type that has only a small number of discrete implementations is declared directly, as opposed to via configuration.

    Type Name Description Protobuf representation for literals boolean A value that is either True or False. bool i8 A signed integer within [-128..127], typically represented as an 8-bit two\u2019s complement number. int32 i16 A signed integer within [-32,768..32,767], typically represented as a 16-bit two\u2019s complement number. int32 i32 A signed integer within [-2147483648..2,147,483,647], typically represented as a 32-bit two\u2019s complement number. int32 i64 A signed integer within [\u22129,223,372,036,854,775,808..9,223,372,036,854,775,807], typically represented as a 64-bit two\u2019s complement number. int64 fp32 A 4-byte single-precision floating point number with the same range and precision as defined for the IEEE 754 32-bit floating-point format. float fp64 An 8-byte double-precision floating point number with the same range and precision as defined for the IEEE 754 64-bit floating-point format. double string A unicode string of text, [0..2,147,483,647] UTF-8 bytes in length. string binary A binary value, [0..2,147,483,647] bytes in length. binary timestamp A naive timestamp with microsecond precision. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. int64 microseconds since 1970-01-01 00:00:00.000000 (in an unspecified timezone) timestamp_tz A timezone-aware timestamp with microsecond precision. Similar to aware datetime in Python. int64 microseconds since 1970-01-01 00:00:00.000000 UTC date A date within [1000-01-01..9999-12-31]. int32 days since 1970-01-01 time A time since the beginning of any day. Range of [0..86,399,999,999] microseconds; leap seconds need not be supported. int64 microseconds past midnight interval_year Interval year to month. Supports a range of [-10,000..10,000] years with month precision (= [-120,000..120,000] months). Usually stored as separate integers for years and months, but only the total number of months is significant, i.e. 1y 0m is considered equal to 0y 12m or 1001y -12000m. int32 years and int32 months, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. -10000y 200000m is not allowed) uuid A universally-unique identifier composed of 128 bits. Typically presented to users in the following hexadecimal format: c48ffa9e-64f4-44cb-ae47-152b4e60e77b. Any 128-bit value is allowed, without specific adherence to RFC4122. 16-byte binary"},{"location":"types/type_classes/#compound-types","title":"Compound Types","text":"

    Compound type classes are type classes that need to be configured by means of a parameter pack.

    Type Name Description Protobuf representation for literals FIXEDCHAR<L> A fixed-length unicode string of L characters. L must be within [1..2,147,483,647]. L-character string VARCHAR<L> A unicode string of at most L characters.L must be within [1..2,147,483,647]. string with at most L characters FIXEDBINARY<L> A binary string of L bytes. When casting, values shorter than L are padded with zeros, and values longer than L are right-trimmed. L-byte bytes DECIMAL<P, S> A fixed-precision decimal value having precision (P, number of digits) <= 38 and scale (S, number of fractional digits) 0 <= S <= P. 16-byte bytes representing a little-endian 128-bit integer, to be divided by 10^S to get the decimal value STRUCT<T1,\u2026,Tn> A list of types in a defined order. repeated Literal, types matching T1..Tn NSTRUCT<N:T1,\u2026,N:Tn> Pseudo-type: A struct that maps unique names to value types. Each name is a UTF-8-encoded string. Each value can have a distinct type. Note that NSTRUCT is actually a pseudo-type, because Substrait\u2019s core type system is based entirely on ordinal positions, not named fields. Nonetheless, when working with systems outside Substrait, names are important. n/a LIST<T> A list of values of type T. The list can be between [0..2,147,483,647] values in length. repeated Literal, all types matching T MAP<K, V> An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. repeated KeyValue (in turn two Literals), all key types matching K and all value types matching V PRECISIONTIMESTAMP<P> A timestamp with fractional second precision (P, number of digits) 0 <= P <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. int64 seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone) PRECISIONTIMESTAMPTZ<P> A timezone-aware timestamp, with fractional second precision (P, number of digits) 0 <= P <= 9. Similar to aware datetime in Python. int64 seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 UTC INTERVAL_DAY<P> Interval day to second. Supports a range of [-3,650,000..3,650,000] days with fractional second precision (P, number of digits) 0 <= P <= 9. Usually stored as separate integers for various components, but only the total number of fractional seconds is significant, i.e. 1d 0s is considered equal to 0d 86400s. int32 days, int32 seconds, and int64 fractional seconds, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. 3650001d -86400s 0us is not allowed) INTERVAL_COMPOUND<P> A compound interval type that is composed of elements of the underlying elements and rules of both interval_month and interval_day to express arbitrary durations across multiple grains. Substrait gives no definition for the conversion of values between independent grains (e.g. months to days)."},{"location":"types/type_classes/#user-defined-types","title":"User-Defined Types","text":"

    User-defined type classes are defined as part of simple extensions. An extension can declare an arbitrary number of user-defined extension types. Once a type has been declared, it can be used in function declarations.

    For example, the following declares a type named point (namespaced to the associated YAML file) and two scalar functions that operate on it.

    types:\n  - name: \"point\"\n\nscalar_functions:\n  - name: \"lat\"\n    impls:\n      - args:\n        - name: p\n        - value: u!point\n    return: fp64\n  - name: \"lon\"\n    impls:\n      - args:\n        - name: p\n        - value: u!point\n    return: fp64\n
    "},{"location":"types/type_classes/#handling-user-defined-types","title":"Handling User-Defined Types","text":"

    Systems without support for a specific user-defined type: * Cannot generate values of the type. * Cannot implement functions operating on the type. * May support consuming and emitting values of the type without modifying them.

    "},{"location":"types/type_classes/#communicating-user-defined-types","title":"Communicating User-Defined Types","text":"

    Specifiers of user-defined types may provide additional structure information for the type to assist in communicating values of the type to and from systems without built-in support.

    For example, the following declares a point type with two i32 values named longitude and latitude:

    types:\n  - name: point\n    structure:\n      longitude: i32\n      latitude: i32\n

    The name-type object notation used above is syntactic sugar for NSTRUCT<longitude: i32, latitude: i32>. The following means the same thing:

    name: point\nstructure: \"NSTRUCT<longitude: i32, latitude: i32>\"\n

    The structure field of a type is only intended to inform systems that don\u2019t have built-in support for the type about how they can create and transfer values of that type to systems that do support the type.

    The structure field does not restrict or bind the internal representation of the type in any system.

    As such, it\u2019s currently not possible to \u201cunpack\u201d a user-defined type into its structure type or components thereof using FieldReferences or any other specialized record expression; if support for this is desired for a particular type, this can be accomplished with an extension function.

    "},{"location":"types/type_classes/#literals","title":"Literals","text":"

    Literals for user-defined types can be represented in one of two ways: * Using protobuf Any messages. * Using the structure representation of the type.

    "},{"location":"types/type_classes/#compound-user-defined-types","title":"Compound User-Defined Types","text":"

    User-defined types may be turned into compound types by requiring parameters to be passed to them. The supported \u201cmeta-types\u201d for parameters are data types (like those used in LIST, MAP, and STRUCT), booleans, integers, enumerations, and strings. Using parameters, we could redefine \u201cpoint\u201d with different types of coordinates. For example:

    name: point\nparameters:\n  - name: T\n    description: |\n      The type used for the longitude and latitude\n      components of the point.\n    type: dataType\n

    or:

    name: point\nparameters:\n  - name: coordinate_type\n    type: enumeration\n    options:\n      - integer\n      - double\n

    or:

    name: point\nparameters:\n  - name: LONG\n    type: dataType\n  - name: LAT\n    type: dataType\n

    We can\u2019t specify the internal structure in this case, because there is currently no support for derived types in the structure.

    The allowed range can be limited for integer parameters. For example:

    name: vector\nparameters:\n  - name: T\n    type: dataType\n  - name: dimensions\n    type: integer\n    min: 2\n    max: 3\n

    This specifies a vector that can be either 2- or 3-dimensional. Note however that it\u2019s not currently possible to put constraints on data type, string, or (technically) boolean parameters.

    Similar to function arguments, the last parameter may be specified to be variadic, allowing it to be specified one or more times instead of only once. For example:

    name: union\nparameters:\n  - name: T\n    type: dataType\nvariadic: true\n

    This defines a type that can be parameterized with one or more other data types, for example union<i32, i64> but also union<bool>. Zero or more is also possible, by making the last argument optional:

    name: tuple\nparameters:\n  - name: T\n    type: dataType\n    optional: true\nvariadic: true\n

    This would also allow for tuple<>, to define a zero-tuple.

    "},{"location":"types/type_parsing/","title":"Type Syntax Parsing","text":"

    In many places, it is useful to have a human-readable string representation of data types. Substrait has a custom syntax for type declaration. The basic structure of a type declaration is:

    name?[variation]<param0,...,paramN>\n

    The components of this expression are:

    Component Description Required Name Each type has a name. A type is expressed by providing a name. This name can be expressed in arbitrary case (e.g. varchar and vArChAr are equivalent) although lowercase is preferred. Nullability indicator A type is either non-nullable or nullable. To express nullability, a question mark is added after the type name (before any parameters). Optional, defaults to non-nullable Variation When expressing a type, a user can define the type based on a type variation. Some systems use type variations to describe different underlying representations of the same data type. This is expressed as a bracketed integer such as [2]. Optional, defaults to [0] Parameters Compound types may have one or more configurable properties. The two main types of properties are integer and type properties. The parameters for each type correspond to a list of known properties associated with a type as declared in the order defined in the type specification. For compound types (types that contain types), the data type syntax will include nested type declarations. The one exception is structs, which are further outlined below. Required where parameters are defined"},{"location":"types/type_parsing/#grammars","title":"Grammars","text":"

    It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an ANTLR grammar to ease consumption and production of types. (The grammar also supports an entire language for representing plans as text.)

    "},{"location":"types/type_parsing/#structs-named-structs","title":"Structs & Named Structs","text":"

    Structs are unique from other types because they have an arbitrary number of parameters. The parameters are recursive and may include their own subproperties. Struct parsing is declared in the following two ways:

    YAMLText Format Examples
    # Struct\nstruct?[variation]<type0, type1,..., typeN>\n\n# Named Struct\nnstruct?[variation]<name0:type0, name1:type1,..., nameN:typeN>\n
    // Struct\nstruct?<string, i8, i32?, timestamp_tz>\n\n// Named structs are not yet supported in the text format.\n

    In the normal (non-named) form, struct declares a set of types that are fields within that struct. In the named struct form, the parameters are formed by tuples of names + types, delineated by a colon. Names that are composed only of numbers and letters can be left unquoted. For other characters, names should be quoted with double quotes and use backslash for double-quote escaping.

    Note, in core Substrait algebra, fields are unnamed and references are always based on zero-index ordinal positions. However, data inputs must declare name-to-ordinal mappings and outputs must declare ordinal-to-name mappings. As such, Substrait also provides a named struct which is a pseudo-type that is useful for human consumption. Outside these places, most structs in a Substrait plan are structs, not named-structs. The two cannot be used interchangeably.

    "},{"location":"types/type_parsing/#other-complex-types","title":"Other Complex Types","text":"

    Similar to structs, maps and lists can also have a type as one of their parameters. Type references may be recursive. The key for a map is typically a simple type but it is not required.

    YAMLText Format Examples
    list?<type>>\nmap<type0, type1>\n
    list?<list<string>>\nlist<struct<string, i32>>\nmap<i32?, list<map<i32, string?>>>\n
    "},{"location":"types/type_system/","title":"Type System","text":"

    Substrait tries to cover the most common types used in data manipulation. Types beyond this common core may be represented using simple extensions.

    Substrait types fundamentally consist of four components:

    Component Condition Examples Description Class Always i8, string, STRUCT, extensions Together with the parameter pack, describes the set of non-null values supported by the type. Subdivided into simple and compound type classes. Nullability Always Either NULLABLE (? suffix) or REQUIRED (no suffix) Describes whether values of this type can be null. Note that null is considered to be a special value of a nullable type, rather than the only value of a special null type. Variation Always No suffix or explicitly [0] (system-preferred), or an extension Allows different variations of the same type class to exist in a system at a time, usually distinguished by in-memory format. Parameters Compound types only <10, 2> (for DECIMAL), <i32, string> (for STRUCT) Some combination of zero or more data types or integers. The expected set of parameters and the significance of each parameter depends on the type class.

    Refer to Type Parsing for a description of the syntax used to describe types.

    Note

    Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via cast expressions.

    "},{"location":"types/type_variations/","title":"Type Variations","text":"

    Type variations may be used to represent differences in representation between different consumers. For example, an engine might support dictionary encoding for a string, or could be using either a row-wise or columnar representation of a struct. All variations of a type are expected to have the same semantics when operated on by functions or other expressions.

    All variations except the \u201csystem-preferred\u201d variation (a.k.a. [0], see Type Parsing) must be defined using simple extensions. The key properties of these variations are:

    Property Description Base Type Class The type class that this variation belongs to. Name The name used to reference this type. Should be unique within type variations for this parent type within a simple extension. Description A human description of the purpose of this type variation. Function Behavior INHERITS or SEPARATE: whether functions that support the system-preferred variation implicitly also support this variation, or whether functions should be resolved independently. For example, if one has the function add(i8,i8) defined and then defines an i8 variation, this determines whether the i8 variation can be bound to the base add operation (inherits) or whether a specialized version of add needs to be defined specifically for this variation (separate). Defaults to inherits."}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Substrait: Cross-Language Serialization for Relational Algebra","text":""},{"location":"#what-is-substrait","title":"What is Substrait?","text":"

    Substrait is a format for describing compute operations on structured data. It is designed for interoperability across different languages and systems.

    "},{"location":"#how-does-it-work","title":"How does it work?","text":"

    Substrait provides a well-defined, cross-language specification for data compute operations. This includes a consistent declaration of common operations, custom operations and one or more serialized representations of this specification. The spec focuses on the semantics of each operation. In addition to the specification the Substrait ecosystem also includes a number of libraries and useful tools.

    We highly recommend the tutorial to learn how a Substrait plan is constructed.

    "},{"location":"#benefits","title":"Benefits","text":"
    • Avoids every system needing to create a communication method between every other system \u2013 each system merely supports ingesting and producing Substrait and it instantly becomes a part of the greater ecosystem.
    • Makes every part of the system upgradable. There\u2019s a new query engine that\u2019s ten times faster? Just plug it in!
    • Enables heterogeneous environments \u2013 run on a cluster of an unknown set of execution engines!
    • The text version of the Substrait plan allows you to quickly see how a plan functions without needing a visualizer (although there are Substrait visualizers as well!).
    "},{"location":"#example-use-cases","title":"Example Use Cases","text":"
    • Communicate a compute plan between a SQL parser and an execution engine (e.g. Calcite SQL parsing to Arrow C++ compute kernel)
    • Serialize a plan that represents a SQL view for consistent use in multiple systems (e.g. Iceberg views in Spark and Trino)
    • Submit a plan to different execution engines (e.g. Datafusion and Postgres) and get a consistent interpretation of the semantics.
    • Create an alternative plan generation implementation that can connect an existing end-user compute expression system to an existing end-user processing engine (e.g. Pandas operations executed inside SingleStore)
    • Build a pluggable plan visualization tool (e.g. D3 based plan visualizer)
    "},{"location":"about/","title":"Substrait: Cross-Language Serialization for Relational Algebra","text":""},{"location":"about/#project-vision","title":"Project Vision","text":"

    The Substrait project aims to create a well-defined, cross-language specification for data compute operations. The specification declares a set of common operations, defines their semantics, and describes their behavior unambiguously. The project also defines extension points and serialized representations of the specification.

    In many ways, the goal of this project is similar to that of the Apache Arrow project. Arrow is focused on a standardized memory representation of columnar data. Substrait is focused on what should be done to data.

    "},{"location":"about/#why-not-use-sql","title":"Why not use SQL?","text":"

    SQL is a well known language for describing queries against relational data. It is designed to be simple and allow reading and writing by humans. Substrait is not intended as a replacement for SQL and works alongside SQL to provide capabilities that SQL lacks. SQL is not a great fit for systems that actually satisfy the query because it does not provide sufficient detail and is not represented in a format that is easy for processing. Because of this, most modern systems will first translate the SQL query into a query plan, sometimes called the execution plan. There can be multiple levels of a query plan (e.g. physical and logical), a query plan may be split up and distributed across multiple systems, and a query plan often undergoes simplifying or optimizing transformations. The SQL standard does not define the format of the query or execution plan and there is no open format that is supported by a broad set of systems. Substrait was created to provide a standard and open format for these query plans.

    "},{"location":"about/#why-not-just-do-this-within-an-existing-oss-project","title":"Why not just do this within an existing OSS project?","text":"

    A key goal of the Substrait project is to not be coupled to any single existing technology. Trying to get people involved in something can be difficult when it seems to be primarily driven by the opinions and habits of a single community. In many ways, this situation is similar to the early situation with Arrow. The precursor to Arrow was the Apache Drill ValueVectors concepts. As part of creating Arrow, Wes and Jacques recognized the need to create a new community to build a fresh consensus (beyond just what the Apache Drill community wanted). This separation and new independent community was a key ingredient to Arrow\u2019s current success. The needs here are much the same: many separate communities could benefit from Substrait, but each have their own pain points, type systems, development processes and timelines. To help resolve these tensions, one of the approaches proposed in Substrait is to set a bar that at least two of the top four OSS data technologies (Arrow, Spark, Iceberg, Trino) supports something before incorporating it directly into the Substrait specification. (Another goal is to support strong extension points at key locations to avoid this bar being a limiter to broad adoption.)

    "},{"location":"about/#related-technologies","title":"Related Technologies","text":"
    • Apache Calcite: Many ideas in Substrait are inspired by the Calcite project. Calcite is a great JVM-based SQL query parsing and optimization framework. A key goal of the Substrait project is to expose Calcite capabilities more easily to non-JVM technologies as well as expose query planning operations as microservices.
    • Apache Arrow: The Arrow format for data is what the Substrait specification attempts to be for compute expressions. A key goal of Substrait is to enable Substrait producers to execute work within the Arrow Rust and C++ compute kernels.
    "},{"location":"about/#why-the-name-substrait","title":"Why the name Substrait?","text":"

    A strait is a narrow connector of water between two other pieces of water. In analytics, data is often thought of as water. Substrait is focused on instructions related to the data. In other words, what defines or supports the movement of water between one or more larger systems. Thus, the underlayment for the strait connecting different pools of water => sub-strait.

    "},{"location":"faq/","title":"Frequently Asked Questions","text":""},{"location":"faq/#what-is-the-purpose-of-the-post-join-filter-field-on-join-relations","title":"What is the purpose of the post-join filter field on Join relations?","text":"

    The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join.

    See the example here that highlights how the post-join filter behaves differently than a Filter relation in the case of a left join.

    "},{"location":"faq/#why-does-the-project-relation-keep-existing-columns","title":"Why does the project relation keep existing columns?","text":"

    In several relational algebra systems (DuckDB, Velox, Apache Spark, Apache DataFusion, etc.) the project relation is used both to add new columns and remove existing columns. It is defined by a list of expressions and there is one output column for each expression.

    In Substrait, the project relation is only used to add new columns. Any relation can remove columns by using the emit property in RelCommon. This is because it is very common for optimized plans to discard columns once they are no longer needed and this can happen anywhere in a plan. If this discard required a project relation then optimized plans would be cluttered with project relations that only remove columns.

    As a result, Substrait\u2019s project relation is a little different. It is also defined by a list of expressions. However, the output columns are a combination of the input columns and one column for each of the expressions.

    "},{"location":"faq/#where-are-field-names-represented","title":"Where are field names represented?","text":"

    Some relational algebra systems, such as Spark, give names to the output fields of a relation. For example, in PySpark I might run df.withColumn(\"num_chars\", length(\"text\")).filter(\"num_chars > 10\"). This creates a project relation, which calculates a new field named num_chars. This field is then referenced in the filter relation. Spark\u2019s logical plan maps closely to this and includes both the expression (length(\"text\")) and the name of the output field (num_chars) in its project relation.

    Substrait does not name intermediate fields in a plan. This is because these field names have no effect on the computation that must be performed. In addition, it opens the door to name-based references, which Substrait also does not support, because these can be a source of errors and confusion. One of the goals of Substrait is to make it very easy for consumers to understand plans. All references in Substrait are done with ordinals.

    In order to allow plans that do use named fields to round-trip through Substrait there is a hint that can be used to add field names to a plan. This hint is called output_names and is located in RelCommon. Consumers should not rely on this hint being present in a plan but, if present, it can be used to provide field names to intermediate relations in a plan for round-trip or debugging purposes.

    There are a few places where Substrait DOES define field names:

    • Read relations have field names in the base schema. This is because it is quite common for reads to do a name-based lookup to determine the columns that need to be read from source files.
    • The root relation has field names. This is because the root relation is the final output of the plan and it is useful to have names for the fields in the final output.
    "},{"location":"governance/","title":"Substrait Project Governance","text":"

    The Substrait project is run by volunteers in a collaborative and open way. Its governance is inspired by the Apache Software Foundation. In most cases, people familiar with the ASF model can work with Substrait in the same way. The biggest differences between the models are:

    • Substrait does not have a separate infrastructure governing body that gatekeeps the adoption of new developer tools and technologies.
    • Substrait Management Committee (SMC) members are responsible for recognizing the corporate relationship of its members and ensuring diverse representation and corporate independence.
    • Substrait does not condone private mailing lists. All project business should be discussed in public The only exceptions to this are security escalations (security@substrait.io) and harassment (harassment@substrait.io).
    • Substrait has an automated continuous release process with no formal voting process per release.

    More details about concrete things Substrait looks to avoid can be found below.

    "},{"location":"governance/#the-substrait-project","title":"The Substrait Project","text":"

    The Substrait project consists of the code and repositories that reside in the substrait-io GitHub organization (consisting of core repositories and -contrib repositories, which have relaxed requirements), the Substrait.io website, the Substrait mailing list, MS-hosted teams community calls and the Substrait Slack workspace. (All are open to everyone and recordings/transcripts are made where technology supports it.)

    "},{"location":"governance/#substrait-volunteers","title":"Substrait Volunteers","text":"

    We recognize four groups of individuals related to the project.

    "},{"location":"governance/#user","title":"User","text":"

    A user is someone who uses Substrait. They may contribute to Substrait by providing feedback to developers in the form of bug reports and feature suggestions. Users participate in the Substrait community by helping other users on mailing lists and user support forums.

    "},{"location":"governance/#contributors","title":"Contributors","text":"

    A contributor is a user who contributes to the project in the form of code or documentation. They take extra steps to participate in the project (loosely defined as the set of repositories under the github substrait-io organization), are active on the developer mailing list, participate in discussions, and provide patches, documentation, suggestions, and criticism.

    Contributors may be given write access to specific -contrib repositories by an SMC consensus vote per repository. The vote should be open for a week to allow adequate time for other SMC members to voice any concerns prior to providing write access.

    "},{"location":"governance/#committer","title":"Committer","text":"

    A committer is a developer who has write access to all (i.e., core and -contrib) repositories and has a signed Contributor License Agreement (CLA) on file. Not needing to depend on other people to make patches to the code or documentation, they are actually making short-term decisions for the project. The SMC can (even tacitly) agree and approve the changes into permanency, or they can reject them. Remember that the SMC makes the decisions, not the individual committers.

    "},{"location":"governance/#smc-member","title":"SMC Member","text":"

    A SMC member is a committer who was elected due to merit for the evolution of the project. They have write access to the code repository, the right to cast binding votes on all proposals on community-related decisions, the right to propose other active contributors for committership, and the right to invite active committers to the SMC. The SMC as a whole is the entity that controls the project, nobody else. They are responsible for the continued shaping of this governance model.

    "},{"location":"governance/#substrait-management-and-collaboration","title":"Substrait Management and Collaboration","text":"

    The Substrait project is managed using a collaborative, consensus-based process. We do not have a hierarchical structure; rather, different groups of contributors have different rights and responsibilities in the organization.

    "},{"location":"governance/#communication","title":"Communication","text":"

    Communication must be done via mailing lists, Slack, and/or Github. Communication is always done publicly. There are no private lists and all decisions related to the project are made in public. Communication is frequently done asynchronously since members of the community are distributed across many time zones.

    "},{"location":"governance/#substrait-management-committee","title":"Substrait Management Committee","text":"

    The Substrait Management Committee is responsible for the active management of Substrait. The main role of the SMC is to further the long-term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration takes place. As part of this, the SMC is the primary approver of specification changes, ensuring that proposed changes represent a balanced and thorough examination of possibilities. This doesn\u2019t mean that the SMC has to be involved in the minutiae of a particular specification change but should always shepard a healthy process around specification changes.

    "},{"location":"governance/#substrait-voting-process","title":"Substrait Voting Process","text":"

    Because one of the fundamental aspects of accomplishing things is doing so by consensus, we need a way to tell whether we have reached consensus. We do this by voting. There are several different types of voting. In all cases, it is recommended that all community members vote. The number of binding votes required to move forward and the community members who have \u201cbinding\u201d votes differs depending on the type of proposal made. In all cases, a veto of a binding voter results in an inability to move forward.

    The rules require that a community member registering a negative vote must include an alternative proposal or a detailed explanation of the reasons for the negative vote. The community then tries to gather consensus on an alternative proposal that can resolve the issue. In the great majority of cases, the concerns leading to the negative vote can be addressed. This process is called \u201cconsensus gathering\u201d and we consider it a very important indication of a healthy community.

    +1 votes required Binding voters Voting Location Process/Governance modifications & actions. This includes promoting new contributors to committer or SMC. 3 SMC Mailing List Management of -contrib repositories including adding repositories and giving write access to them 3 SMC Mailing List Format/Specification Modifications (including breaking extension changes) 2 SMC Github PR Documentation Updates (formatting, moves) 1 SMC Github PR Typos 1 Committers Github PR Non-breaking function introductions 1 (not including proposer) Committers Github PR Non-breaking extension additions & non-format code modifications 1 (not including proposer) Committers Github PR Changes (non-breaking or breaking) to a Substrait library (i.e. substrait-java, substrait-validator) 1 (not including proposer) Committers Github PR Changes to a Substrait -contrib repository 1 (not including proposer) Contributors Github PR"},{"location":"governance/#review-then-commit","title":"Review-Then-Commit","text":"

    Substrait follows a review-then-commit policy. This requires that all changes receive consensus approval before being committed to the code base. The specific vote requirements follow the table above.

    "},{"location":"governance/#expressing-votes","title":"Expressing Votes","text":"

    The voting process may seem more than a little weird if you\u2019ve never encountered it before. Votes are represented as numbers between -1 and +1, with \u2018-1\u2019 meaning \u2018no\u2019 and \u2018+1\u2019 meaning \u2018yes.\u2019

    The in-between values indicate how strongly the voting individual feels. Here are some examples of fractional votes and what the voter might be communicating with them:

    • +0: \u2018I don\u2019t feel strongly about it, but I\u2019m okay with this.\u2019
    • -0: \u2018I won\u2019t get in the way, but I\u2019d rather we didn\u2019t do this.\u2019
    • -0.5: \u2018I don\u2019t like this idea, but I can\u2019t find any rational justification for my feelings.\u2019
    • ++1: \u2018Wow! I like this! Let\u2019s do it!\u2019
    • -0.9: \u2018I really don\u2019t like this, but I\u2019m not going to stand in the way if everyone else wants to go ahead with it.\u2019
    • +0.9: \u2018This is a cool idea and I like it, but I don\u2019t have time/the skills necessary to help out.\u2019
    "},{"location":"governance/#votes-on-code-modification","title":"Votes on Code Modification","text":"

    For code-modification votes, +1 votes (review approvals in Github are considered equivalent to a +1) are in favor of the proposal, but -1 votes are vetoes and kill the proposal dead until all vetoers withdraw their -1 votes.

    "},{"location":"governance/#vetoes","title":"Vetoes","text":"

    A -1 (or an unaddressed PR request for changes) vote by a qualified voter stops a code-modification proposal in its tracks. This constitutes a veto, and it cannot be overruled nor overridden by anyone. Vetoes stand until and unless the individual withdraws their veto.

    To prevent vetoes from being used capriciously, the voter must provide with the veto a technical or community justification showing why the change is bad.

    "},{"location":"governance/#why-do-we-vote","title":"Why do we vote?","text":"

    Votes help us to openly resolve conflicts. Without a process, people tend to avoid conflict and thrash around. Votes help to make sure we do the hard work of resolving the conflict.

    "},{"location":"governance/#substrait-is-non-commercial-but-commercially-aware","title":"Substrait is non-commercial but commercially-aware","text":"

    Substrait\u2019s mission is to produce software for the public good. All Substrait software is always available for free, and solely under the Apache License.

    We\u2019re happy to have third parties, including for-profit corporations, take our software and use it for their own purposes. However it is important in these cases to ensure that the third party does not misuse the brand and reputation of the Substrait project for its own purposes. It is important for the longevity and community health of Substrait that the community gets the appropriate credit for producing freely available software.

    The SMC actively track the corporate allegiances of community members and strives to ensure influence around any particular aspect of the project isn\u2019t overly skewed towards a single corporate entity.

    "},{"location":"governance/#substrait-trademark","title":"Substrait Trademark","text":"

    The SMC is responsible for protecting the Substrait name and brand. TBD what action is taken to support this.

    "},{"location":"governance/#project-roster","title":"Project Roster","text":""},{"location":"governance/#substrait-management-committee-smc","title":"Substrait Management Committee (SMC)","text":"Name Association Phillip Cloud Voltron Data Weston Pace LanceDB Jacques Nadeau Sundeck Victor Barua Datadog David Sisson Voltron Data"},{"location":"governance/#substrait-committers","title":"Substrait Committers","text":"Name Association Jeroen van Straten Qblox Carlo Curino Microsoft James Taylor Sundeck Sutou Kouhei Clearcode Micah Kornfeld Google Jinfeng Ni Sundeck Andy Grove Nvidia Jesus Camacho Rodriguez Microsoft Rich Tia Voltron Data Vibhatha Abeykoon Voltron Data Nic Crane Recast Gil Forsyth Voltron Data ChaoJun Zhang Intel Matthijs Brobbel Voltron Data Matt Topol Voltron Data Ingo M\u00fcller Google Arttu Voutilainen Palantir Technologies Bruno Volpato Datadog Anshul Data Sundeck Chandra Sanapala Sundeck"},{"location":"governance/#additional-detail-about-differences-from-asf","title":"Additional detail about differences from ASF","text":"

    Corporate Awareness: The ASF takes a blind-eye approach that has proven to be too slow to correct corporate influence which has substantially undermined many OSS projects. In contrast, Substrait SMC members are responsible for identifying corporate risks and over-representation and adjusting inclusion in the project based on that (limiting committership, SMC membership, etc). Each member of the SMC shares responsibility to expand the community and seek out corporate diversity.

    Infrastructure: The ASF shows its age wrt to infrastructure, having been originally built on SVN. Some examples of requirements that Substrait is eschewing that exist in ASF include: custom git infrastructure, release process that is manual, project external gatekeeping around the use of new tools/technologies.

    "},{"location":"community/","title":"Community","text":"

    Substrait is developed as a consensus-driven open source product under the Apache 2.0 license. Development is done in the open leveraging GitHub issues and PRs.

    "},{"location":"community/#get-in-touch","title":"Get In Touch","text":"Mailing List/Google Group We use the mailing list to discuss questions, formulate plans and collaborate asynchronously. Slack Channel The developers of Substrait frequent the Slack channel. You can get an invite to the channel by following this link. GitHub Issues Substrait is developed via GitHub issues and pull requests. If you see a problem or want to enhance the product, we suggest you file a GitHub issue for developers to review. Twitter The @substrait_io account on Twitter is our official account. Follow-up to keep to date on what is happening with Substrait! Docs Our website is all maintained in our source repository. If there is something you think can be improved, feel free to fork our repository and post a pull request. Meetings Our community meets every other week on Wednesday."},{"location":"community/#talks","title":"Talks","text":"

    Want to learn more about Substrait? Try the following presentations and slide decks.

    • Substrait: A Common Representation for Data Compute Plans (Jacques Nadeau, April 2022) [slides]
    "},{"location":"community/#citation","title":"Citation","text":"

    If you use Substrait in your research, please cite it using the following BibTeX entry:

    @misc{substrait,\n  author = {substrait-io},\n  title = {Substrait: Cross-Language Serialization for Relational Algebra},\n  year = {2021},\n  month = {8},\n  day = {31},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/substrait-io/substrait}}\n}\n
    "},{"location":"community/#contribution","title":"Contribution","text":"

    All contributors are welcome to Substrait. If you want to join the project, open a PR or get in touch with us as above.

    "},{"location":"community/#principles","title":"Principles","text":"
    • Be inclusive and open to all.
    • Ensure a diverse set of contributors that come from multiple data backgrounds to maximize general utility.
    • Build a specification based on open consensus.
    • Avoid over-reliance/coupling to any single technology.
    • Make the specification and all tools freely available on a permissive license (ApacheV2)
    "},{"location":"community/powered_by/","title":"Powered by Substrait","text":"

    In addition to the work maintained in repositories within the substrait-io GitHub organization, a growing list of other open source projects have adopted Substrait.

    Acero Acero is a query execution engine implemented as a part of the Apache Arrow C++ library. Acero provides a Substrait consumer interface. ADBC ADBC (Arrow Database Connectivity) is an API specification for Apache Arrow-based database access. ADBC allows applications to pass queries either as SQL strings or Substrait plans. Arrow Flight SQL Arrow Flight SQL is a client-server protocol for interacting with databases and query engines using the Apache Arrow in-memory columnar format and the Arrow Flight RPC framework. Arrow Flight SQL allows clients to send queries as SQL strings or Substrait plans. DataFusion DataFusion is an extensible query planning, optimization, and execution framework, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion provides a Substrait producer and consumer that can convert DataFusion logical plans to and from Substrait plans. It can be used through the DataFusion Python bindings. DuckDB DuckDB is an in-process SQL OLAP database management system. DuckDB provides a Substrait extension that allows users to produce and consume Substrait plans through DuckDB\u2019s SQL, Python, and R APIs. Gluten Gluten is a plugin for Apache Spark that allows computation to be offloaded to engines that have better performance or efficiency than Spark\u2019s built-in JVM-based engine. Gluten converts Spark physical plans to Substrait plans. Ibis Ibis is a Python library that provides a lightweight, universal interface for data wrangling. It includes a dataframe API for Python with support for more than 10 query execution engines, plus a Substrait producer to enable support for Substrait-consuming execution engines. Substrait R Interface The Substrait R interface package allows users to construct Substrait plans from R for evaluation by Substrait-consuming execution engines. The package provides a dplyr backend as well as lower-level interfaces for creating Substrait plans and integrations with Acero and DuckDB. Velox Velox is a unified execution engine aimed at accelerating data management systems and streamlining their development. Velox provides a Substrait consumer interface.

    To add your project to this list, please open a pull request.

    "},{"location":"expressions/aggregate_functions/","title":"Aggregate Functions","text":"

    Aggregate functions are functions that define an operation which consumes values from multiple records to a produce a single output. Aggregate functions in SQL are typically used in GROUP BY functions. Aggregate functions are similar to scalar functions and function signatures with a small set of different properties.

    Aggregate function signatures contain all the properties defined for scalar functions. Additionally, they contain the properties below:

    Property Description Required Inherits All properties defined for scalar function. N/A Ordered Whether the result of this function is sensitive to sort order. Optional, defaults to false Maximum set size Maximum allowed set size as an unsigned integer. Optional, defaults to unlimited Decomposable Whether the function can be executed in one or more intermediate steps. Valid options are: NONE, ONE, MANY, describing how intermediate steps can be taken. Optional, defaults to NONE Intermediate Output Type If the function is decomposable, represents the intermediate output type that is used, if the function is defined as either ONE or MANY decomposable. Will be a struct in many cases. Required for ONE and MANY. Invocation Whether the function uses all or only distinct values in the aggregation calculation. Valid options are: ALL, DISTINCT. Optional, defaults to ALL"},{"location":"expressions/aggregate_functions/#aggregate-binding","title":"Aggregate Binding","text":"

    When binding an aggregate function, the binding must include the following additional properties beyond the standard scalar binding properties:

    Property Description Phase Describes the input type of the data: [INITIAL_TO_INTERMEDIATE, INTERMEDIATE_TO_INTERMEDIATE, INITIAL_TO_RESULT, INTERMEDIATE_TO_RESULT] describing what portion of the operation is required. For functions that are NOT decomposable, the only valid option will be INITIAL_TO_RESULT. Ordering Zero or more ordering keys along with key order (ASC|DESC|NULL FIRST, etc.), declared similar to the sort keys in an ORDER BY relational operation. If no sorts are specified, the records are not sorted prior to being passed to the aggregate function."},{"location":"expressions/embedded_functions/","title":"Embedded Functions","text":"

    Embedded functions are a special kind of function where the implementation is embedded within the actual plan. They are commonly used in tools where a user intersperses business logic within a data pipeline. This is more common in data science workflows than traditional SQL workflows.

    Embedded functions are not pre-registered. Embedded functions require that data be consumed and produced with a standard API, may require memory allocation and have determinate error reporting behavior. They may also have specific runtime dependencies. For example, a Python pickle function may depend on pyarrow 5.0 and pynessie 1.0.

    Properties for an embedded function include:

    Property Description Required Function Type The type of embedded function presented. Required Function Properties Function properties, one of those items defined below. Required Output Type The fully resolved output type for this embedded function. Required

    The binary representation of an embedded function is:

    Binary RepresentationHuman Readable Representation
    message EmbeddedFunction {\n  repeated Expression arguments = 1;\n  Type output_type = 2;\n  oneof kind {\n    PythonPickleFunction python_pickle_function = 3;\n    WebAssemblyFunction web_assembly_function = 4;\n  }\n\n  message PythonPickleFunction {\n    bytes function = 1;\n    repeated string prerequisite = 2;\n  }\n\n  message WebAssemblyFunction {\n    bytes script = 1;\n    repeated string prerequisite = 2;\n  }\n}\n

    As the bytes are opaque to Substrait there is no equivalent human readable form.

    "},{"location":"expressions/embedded_functions/#function-details","title":"Function Details","text":"

    There are many types of possible stored functions. For each, Substrait works to expose the function in as descriptive a way as possible to support the largest number of consumers.

    "},{"location":"expressions/embedded_functions/#python-pickle-function-type","title":"Python Pickle Function Type","text":"Property Description Required Pickle Body binary pickle encoded function using [TBD] API representation to access arguments. True Prereqs A list of specific Python conda packages that are prerequisites for access (a structured version of a requirements.txt file). Optional, defaults to none"},{"location":"expressions/embedded_functions/#webassembly-function-type","title":"WebAssembly Function Type","text":"Property Description Required Script WebAssembly function True Prereqs A list of AssemblyScript prerequisites required to compile the assemblyscript function using NPM coordinates. Optional, defaults to none Discussion Points
    • What are the common embedded function formats?
    • How do we expose the data for a function?
    • How do we express batching capabilities?
    • How do we ensure/declare containerization?
    "},{"location":"expressions/extended_expression/","title":"Extended Expression","text":"

    Extended Expression messages are provided for expression-level protocols as an alternative to using a Plan. They mainly target expression-only evaluations, such as those computed in Filter/Project/Aggregation rels. Unlike the original Expression defined in the substrait protocol, Extended Expression messages require more information to completely describe the computation context including: input data schema, referred function signatures, and output schema.

    Since Extended Expression will be used seperately from the Plan rel representation, it will need to include basic fields like Version.

    ExtendedExpression Message
    message ExtendedExpression {\n  // Substrait version of the expression. Optional up to 0.17.0, required for later\n  // versions.\n  Version version = 7;\n\n  // a list of yaml specifications this expression may depend on\n  repeated substrait.extensions.SimpleExtensionURI extension_uris = 1;\n\n  // a list of extensions this expression may depend on\n  repeated substrait.extensions.SimpleExtensionDeclaration extensions = 2;\n\n  // one or more expression trees with same order in plan rel\n  repeated ExpressionReference referred_expr = 3;\n\n  NamedStruct base_schema = 4;\n  // additional extensions associated with this expression.\n  substrait.extensions.AdvancedExtension advanced_extensions = 5;\n\n  // A list of com.google.Any entities that this plan may use. Can be used to\n  // warn if some embedded message types are unknown. Note that this list may\n  // include message types that are ignorable (optimizations) or that are\n  // unused. In many cases, a consumer may be able to work with a plan even if\n  // one or more message types defined here are unknown.\n  repeated string expected_type_urls = 6;\n\n}\n
    "},{"location":"expressions/extended_expression/#input-and-output-data-schema","title":"Input and output data schema","text":"

    Similar to base_schema defined in ReadRel, the input data schema describes the name/type/nullibilty and layout info of input data for the target expression evalutation. It also has a field name to define the name of the output data.

    "},{"location":"expressions/extended_expression/#referred-expression","title":"Referred expression","text":"

    An Extended Exression will have one or more referred expressions, which can be either Expression or AggregateFunction. Additional types of expressions may be added in the future.

    For a message with multiple expressions, users may produce each Extended Expression in the same order as they occur in the original Plan rel. But, the consumer does NOT have to handle them in this order. A consumer needs only to ensure that the columns in the final output are organized in the same order as defined in the message.

    "},{"location":"expressions/extended_expression/#function-extensions","title":"Function extensions","text":"

    Function extensions work the same for both Extended Expression and the original Expression defined in the Substrait protocol.

    "},{"location":"expressions/field_references/","title":"Field References","text":"

    In Substrait, all fields are dealt with on a positional basis. Field names are only used at the edge of a plan, for the purposes of naming fields for the outside world. Each operation returns a simple or compound data type. Additional operations can refer to data within that initial operation using field references. To reference a field, you use a reference based on the type of field position you want to reference.

    Reference Type Properties Type Applicability Type return Struct Field Ordinal position. Zero-based. Only legal within the range of possible fields within a struct. Selecting an ordinal outside the applicable field range results in an invalid plan. struct Type of field referenced Array Value Array offset. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Negative and positive overflows return null values (no wrapping). list type of list Array Slice Array offset and element count. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Position does not wrap, nor does length. list Same type as original list Map Key A map value that is matched exactly against available map keys and returned. map Value type of map Map KeyExpression A wildcard string that is matched against a simplified form of regular expressions. Requires the key type of the map to be a character type. [Format detail needed, intention to include basic regex concepts such as greedy/non-greedy.] map List of map value type Masked Complex Expression An expression that provides a mask over a schema declaring which portions of the schema should be presented. This allows a user to select a portion of a complex object but mask certain subsections of that same object. any any"},{"location":"expressions/field_references/#compound-references","title":"Compound References","text":"

    References are typically constructed as a sequence. For example: [struct position 0, struct position 1, array offset 2, array slice 1..3].

    Field references are in the same order they are defined in their schema. For example, let\u2019s consider the following schema:

    column a:\n  struct<\n    b: list<\n      struct<\n        c: map<string, \n          struct<\n            x: i32>>>>>\n

    If we want to represent the SQL expression:

    a.b[2].c['my_map_key'].x\n

    We will need to declare the nested field such that:

    Struct field reference a\nStruct field b\nList offset 2\nStruct field c\nMap key my_map_key\nStruct field x\n

    Or more formally in Protobuf Text, we get:

    selection {\n  direct_reference {\n    struct_field {\n      field: 0 # .a\n      child {\n        struct_field {\n          field: 0 # .b\n          child {\n            list_element {\n              offset: 2\n              child {\n                struct_field {\n                  field: 0 # .c\n                  child {\n                    map_key {\n                      map_key {\n                        string: \"my_map_key\" # ['my_map_key']\n                      }\n                      child {\n                        struct_field {\n                          field: 0 # .x\n                        }\n                      }\n                    }\n                  }\n                }\n              }\n            }\n          }\n        }\n      }\n    }\n  }\n  root_reference { }\n}\n
    "},{"location":"expressions/field_references/#validation","title":"Validation","text":"

    References must validate against the schema of the record being referenced. If not, an error is expected.

    "},{"location":"expressions/field_references/#masked-complex-expression","title":"Masked Complex Expression","text":"

    A masked complex expression is used to do a subselection of a portion of a complex record. It allows a user to specify the portion of the complex object to consume. Imagine you have a schema of (note that structs are lists of fields here, as they are in general in Substrait as field names are not used internally in Substrait):

    struct:\n  - struct:\n    - integer\n    - list:\n      struct:\n        - i32\n        - string\n        - string\n     - i32\n  - i16\n  - i32\n  - i64\n

    Given this schema, you could declare a mask of fields to include in pseudocode, such as:

    0:[0,1:[..5:[0,2]]],2,3\n\nOR\n\n0:\n  - 0\n  - 1:\n    ..5:\n      -0\n      -2\n2\n3\n

    This mask states that we would like to include fields 0 2 and 3 at the top-level. Within field 0, we want to include subfields 0 and 1. For subfield 0.1, we want to include up to only the first 5 records in the array and only includes fields 0 and 2 within the struct within that array. The resulting schema would be:

    struct:\n  - struct:\n    - integer\n    - list:\n      struct: \n        - i32\n        - string\n  - i32\n  - i64\n
    "},{"location":"expressions/field_references/#unwrapping-behavior","title":"Unwrapping Behavior","text":"

    By default, when only a single field is selected from a struct, that struct is removed. When only a single element is removed from a list, the list is removed. A user can also configure the mask to avoid unwrapping in these cases. [TBD how we express this in the serialization formats.]

    Discussion Points
    • Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.)
    "},{"location":"expressions/scalar_functions/","title":"Scalar Functions","text":"

    A function is a scalar function if that function takes in values from a single record and produces an output value. To clearly specify the definition of functions, Substrait declares an extensible specification plus binding approach to function resolution. A scalar function signature includes the following properties:

    Property Description Required Name One or more user-friendly UTF-8 strings that are used to reference this function. At least one value is required. List of arguments Argument properties are defined below. Arguments can be fully defined or calculated with a type expression. See further details below. Optional, defaults to niladic. Deterministic Whether this function is expected to reproduce the same output when it is invoked multiple times with the same input. This informs a plan consumer on whether it can constant-reduce the defined function. An example would be a random() function, which is typically expected to be evaluated repeatedly despite having the same set of inputs. Optional, defaults to true. Session Dependent Whether this function is influenced by the session context it is invoked within. For example, a function may be influenced by a user who is invoking the function, the time zone of a session, or some other non-obvious parameter. This can inform caching systems on whether a particular function is cacheable. Optional, defaults to false. Variadic Behavior Whether the last argument of the function is variadic or a single argument. If variadic, the argument can optionally have a lower bound (minimum number of instances) and an upper bound (maximum number of instances). Optional, defaults to single value. Nullability Handling Describes how nullability of input arguments maps to nullability of output arguments. Three options are: MIRROR, DECLARED_OUTPUT and DISCRETE. More details about nullability handling are listed below. Optional, defaults to MIRROR Description Additional description of function for implementers or users. Should be written human-readable to allow exposure to end users. Presented as a map with language => description mappings. E.g. { \"en\": \"This adds two numbers together.\", \"fr\": \"cela ajoute deux nombres\"}. Optional Return Value The output type of the expression. Return types can be expressed as a fully-defined type or a type expression. See below for more on type expressions. Required Implementation Map A map of implementation locations for one or more implementations of the given function. Each key is a function implementation type. Implementation types include examples such as: AthenaArrowLambda, TrinoV361Jar, ArrowCppKernelEnum, GandivaEnum, LinkedIn Transport Jar, etc. [Definition TBD]. Implementation type has one or more properties associated with retrieval of that implementation. Optional"},{"location":"expressions/scalar_functions/#argument-types","title":"Argument Types","text":"

    There are three main types of arguments: value arguments, type arguments, and enumerations. Every defined arguments must be specified in every invocation of the function. When specified, the position of these arguments in the function invocation must match the position of the arguments as defined in the YAML function definition.

    • Value arguments: arguments that refer to a data value. These could be constants (literal expressions defined in the plan) or variables (a reference expression that references data being processed by the plan). This is the most common type of argument. The value of a value argument is not available in output derivation, but its type is. Value arguments can be declared in one of two ways: concrete or parameterized. Concrete types are either simple types or compound types with all parameters fully defined (without referencing any type arguments). Examples include i32, fp32, VARCHAR<20>, List<fp32>, etc. Parameterized types are discussed further below.
    • Type arguments: arguments that are used only to inform the evaluation and/or type derivation of the function. For example, you might have a function which is truncate(<type> DECIMAL<P0,S0>, <value> DECIMAL<P1, S1>, <value> i32). This function declares two value arguments and a type argument. The difference between them is that the type argument has no value at runtime, while the value arguments do.
    • Enumeration: arguments that support a fixed set of declared values as constant arguments. These arguments must be specified as part of an expression. While these could also have been implemented as constant string value arguments, they are formally included to improve validation/contextual help/etc. for frontend processors and IDEs. An example might be extract([DAY|YEAR|MONTH], <date value>). In this example, a producer must specify a type of date part to extract. Note, the value of a required enumeration cannot be used in type derivation.
    "},{"location":"expressions/scalar_functions/#value-argument-properties","title":"Value Argument Properties","text":"Property Description Required Name A human-readable name for this argument to help clarify use. Optional, defaults to a name based on position (e.g. arg0) Description Additional description of this argument. Optional Value A fully defined type or a type expression. Required Constant Whether this argument is required to be a constant for invocation. For example, in some system a regular expression pattern would only be accepted as a literal and not a column value reference. Optional, defaults to false"},{"location":"expressions/scalar_functions/#type-argument-properties","title":"Type Argument Properties","text":"Property Description Required Type A partially or completely parameterized type. E.g. List<K> or K Required Name A human-readable name for this argument to help clarify use. Optional, defaults to a name based on position (e.g. arg0) Description Additional description of this argument. Optional"},{"location":"expressions/scalar_functions/#required-enumeration-properties","title":"Required Enumeration Properties","text":"Property Description Required Options List of valid string options for this argument Required Name A human-readable name for this argument to help clarify use. Optional, defaults to a name based on position (e.g. arg0) Description Additional description of this argument. Optional"},{"location":"expressions/scalar_functions/#options","title":"Options","text":"

    In addition to arguments each call may specify zero or more options. These are similar to a required enumeration but more focused on supporting alternative behaviors. Options can be left unspecified and the consumer is free to choose which implementation to use. An example use case might be OVERFLOW_BEHAVIOR:[OVERFLOW, SATURATE, ERROR] If unspecified, an engine is free to use any of the three choices or even some alternative behavior (e.g. setting the value to null on overflow). If specified, the engine would be expected to behave as specified or fail. Note, the value of an optional enumeration cannot be used in type derivation.

    "},{"location":"expressions/scalar_functions/#option-preference","title":"Option Preference","text":"

    A producer may specify multiple values for an option. If the producer does so then the consumer must deliver the first behavior in the list of values that the consumer is capable of delivering. For example, considering overflow as defined above, if a producer specified [ERROR, SATURATE] then the consumer must deliver ERROR if it is capable of doing so. If it is not then it may deliver SATURATE. If the consumer cannot deliver either behavior then it is an error and the consumer must reject the plan.

    "},{"location":"expressions/scalar_functions/#optional-properties","title":"Optional Properties","text":"Property Description Required Values A list of valid strings for this option. Required Name A human-readable name for this option. Required"},{"location":"expressions/scalar_functions/#nullability-handling","title":"Nullability Handling","text":"Mode Description MIRROR This means that the function has the behavior that if at least one of the input arguments are nullable, the return type is also nullable. If all arguments are non-nullable, the return type will be non-nullable. An example might be the + function. DECLARED_OUTPUT Input arguments are accepted of any mix of nullability. The nullability of the output function is whatever the return type expression states. Example use might be the function is_null() where the output is always boolean independent of the nullability of the input. DISCRETE The input and arguments all define concrete nullability and can only be bound to the types that have those nullability. For example, if a type input is declared i64? and one has an i64 literal, the i64 literal must be specifically cast to i64? to allow the operation to bind."},{"location":"expressions/scalar_functions/#parameterized-types","title":"Parameterized Types","text":"

    Types are parameterized by two types of values: by inner types (e.g. List<K>) and numeric values (e.g. DECIMAL<P,S>). Parameter names are simple strings (frequently a single character). There are two types of parameters: integer parameters and type parameters.

    When the same parameter name is used multiple times in a function definition, the function can only bind if the exact same value is used for all parameters of that name. For example, if one had a function with a signature of fn(VARCHAR<N>, VARCHAR<N>), the function would be only be usable if both VARCHAR types had the same length value N. This necessitates that all instances of the same parameter name must be of the same parameter type (all instances are a type parameter or all instances are an integer parameter).

    "},{"location":"expressions/scalar_functions/#type-parameter-resolution-in-variadic-functions","title":"Type Parameter Resolution in Variadic Functions","text":"

    When the last argument of a function is variadic and declares a type parameter e.g. fn(A, B, C...), the C parameter can be marked as either consistent or inconsistent. If marked as consistent, the function can only be bound to arguments where all the C types are the same concrete type. If marked as inconsistent, each unique C can be bound to a different type within the constraints of what T allows.

    "},{"location":"expressions/scalar_functions/#output-type-derivation","title":"Output Type Derivation","text":""},{"location":"expressions/scalar_functions/#concrete-return-types","title":"Concrete Return Types","text":"

    A concrete return type is one that is fully known at function definition time. Examples of simple concrete return types would be things such as i32, fp32. For compound types, a concrete return type must be fully declared. Example of fully defined compound types: VARCHAR<20>, DECIMAL<25,5>

    "},{"location":"expressions/scalar_functions/#return-type-expressions","title":"Return Type Expressions","text":"

    Any function can declare a return type expression. A return type expression uses a simplified set of expressions to describe how the return type should be returned. For example, a return expression could be as simple as the return of parameter declared in the arguments. For example f(List<K>) => K or can be a simple mathematical or conditional expression such as add(decimal<a,b>, decimal<c,d>) => decimal<a+c, b+d>. For the simple expression language, there is a very narrow set of types:

    • Integer: 64-bit signed integer (can be a literal or a parameter value)
    • Boolean: True and False
    • Type: A Substrait type (with possibly additional embedded expressions)

    These types are evaluated using a small set of operations to support common scenarios. List of valid operations:

    Math: +, -, *, /, min, max\nBoolean: &&, ||, !, <, >, ==\nParameters: type, integer\nLiterals: type, integer\n

    Fully defined with argument types:

    • type_parameter(string name) => type
    • integer_parameter(string name) => integer
    • not(boolean x) => boolean
    • and(boolean a, boolean b) => boolean
    • or(boolean a, boolean b) => boolean
    • multiply(integer a, integer b) => integer
    • divide(integer a, integer b) => integer
    • add(integer a, integer b) => integer
    • subtract(integer a, integer b) => integer
    • min(integer a, integer b) => integer
    • max(integer a, integer b) => integer
    • equal(integer a, integer b) => boolean
    • greater_than(integer a, integer b) => boolean
    • less_than(integer a, integer b) => boolean
    • covers(Type a, Type b) => boolean Covers means that type b matches type A for as much as type B is defined. For example, if type A is VARCHAR<20> and type B is VARCHAR<N>, type B would be considered covering. Similarlily if type A was List<Struct<a:f32, b:f32>>and type B was List<Struct<>>, it would be considered covering. Note that this is directional \u201cas in B covers A\u201d or \u201cB can be further enhanced to match the definition A\u201d.
    • if(boolean a) then (integer) else (integer)
    • if(boolean a) then (type) else (type)
    "},{"location":"expressions/scalar_functions/#example-type-expressions","title":"Example Type Expressions","text":"

    For reference, here are are some common output type derivations and how they can be expressed with a return type expression:

    Operation Definition Add item to list add(List<T>, T) => List<T> Decimal Division divide(Decimal<P1,S1>, Decimal<P2,S2>) => Decimal<P1 -S1 + S2 + MAX(6, S1 + P2 + 1), MAX(6, S1 + P2 + 1)> Select a subset of map keys based on a regular expression (requires stringlike keys) extract_values(regex:string, map:Map<K,V>) => List<V> WHERE K IN [STRING, VARCHAR<N>, FIXEDCHAR<N>] Concatenate two fixed sized character strings concat(FIXEDCHAR<A>, FIXEDCHAR<B>) => FIXEDCHAR<A+B> Make a struct of a set of fields and a struct definition. make_struct(<type> T, K...) => T"},{"location":"expressions/specialized_record_expressions/","title":"Specialized Record Expressions","text":"

    While all types of operations could be reduced to functions, in some cases this would be overly simplistic. Instead, it is helpful to construct some other expression constructs.

    These constructs should be focused on different expression types as opposed to something that directly related to syntactic sugar. For example, CAST and EXTRACT or SQL operations that are presented using specialized syntax. However, they can easily be modeled using a function paradigm with minimal complexity.

    "},{"location":"expressions/specialized_record_expressions/#literal-expressions","title":"Literal Expressions","text":"

    For each data type, it is possible to create a literal value for that data type. The representation depends on the serialization format. Literal expressions include both a type literal and a possibly null value.

    "},{"location":"expressions/specialized_record_expressions/#nested-type-constructor-expressions","title":"Nested Type Constructor Expressions","text":"

    These expressions allow structs, lists, and maps to be constructed from a set of expressions. For example, they allow a struct expression like (field 0 - field 1, field 0 + field 1) to be represented.

    "},{"location":"expressions/specialized_record_expressions/#cast-expression","title":"Cast Expression","text":"

    To convert a value from one type to another, Substrait defines a cast expression. Cast expressions declare an expected type, an input argument and an enumeration specifying failure behavior, indicating whether cast should return null on failure or throw an exception.

    Note that Substrait always requires a cast expression whenever the current type is not exactly equal to (one of) the expected types. For example, it is illegal to directly pass a value of type i8[0] to a function that only supports an i8?[0] argument.

    "},{"location":"expressions/specialized_record_expressions/#if-expression","title":"If Expression","text":"

    An if value expression is an expression composed of one if clause, zero or more else if clauses and an else clause. In pseudocode, they are envisioned as:

    if <boolean expression> then <result expression 1>\nelse if <boolean expression> then <result expression 2> (zero or more times)\nelse <result expression 3>\n

    When an if expression is declared, all return expressions must be the same identical type.

    "},{"location":"expressions/specialized_record_expressions/#shortcut-behavior","title":"Shortcut Behavior","text":"

    An if expression is expected to logically short-circuit on a positive outcome. This means that a skipped else/elseif expression cannot cause an error. For example, this should not actually throw an error despite the fact that the cast operation should fail.

    if 'value' = 'value' then 0\nelse cast('hello' as integer) \n
    "},{"location":"expressions/specialized_record_expressions/#switch-expression","title":"Switch Expression","text":"

    Switch expression allow a selection of alternate branches based on the value of a given expression. They are an optimized form of a generic if expression where all conditions are equality to the same value. In pseudocode:

    switch(value)\n<value 1> => <return 1> (1 or more times)\n<else> => <return default>\n

    Return values for a switch expression must all be of identical type.

    "},{"location":"expressions/specialized_record_expressions/#shortcut-behavior_1","title":"Shortcut Behavior","text":"

    As in if expressions, switch expression evaluation should not be interrupted by \u201croads not taken\u201d.

    "},{"location":"expressions/specialized_record_expressions/#or-list-equality-expression","title":"Or List Equality Expression","text":"

    A specialized structure that is often used is a large list of possible values. In SQL, these are typically large IN lists. They can be composed from one or more fields. There are two common patterns, single value and multi value. In pseudocode they are represented as:

    Single Value:\nexpression, [<value1>, <value2>, ... <valueN>]\n\nMulti Value:\n[expressionA, expressionB], [[value1a, value1b], [value2a, value2b].. [valueNa, valueNb]]\n

    For single value expressions, these are a compact equivalent of expression = value1 OR expression = value2 OR .. OR expression = valueN. When using an expression of this type, two things are required; the types of the test expression and all value expressions that are related must be of the same type. Additionally, a function signature for equality must be available for the expression type used.

    "},{"location":"expressions/subqueries/","title":"Subqueries","text":"

    Subqueries are scalar expressions comprised of another query.

    "},{"location":"expressions/subqueries/#forms","title":"Forms","text":""},{"location":"expressions/subqueries/#scalar","title":"Scalar","text":"

    Scalar subqueries are subqueries that return one row and one column.

    Property Description Required Input Input relation Yes"},{"location":"expressions/subqueries/#in-predicate","title":"IN predicate","text":"

    An IN subquery predicate checks that the left expression is contained in the right subquery.

    "},{"location":"expressions/subqueries/#examples","title":"Examples","text":"
    SELECT *\nFROM t1\nWHERE x IN (SELECT * FROM t2)\n
    SELECT *\nFROM t1\nWHERE (x, y) IN (SELECT a, b FROM t2)\n
    Property Description Required Needles Expressions whose existence will be checked Yes Haystack Subquery to check Yes"},{"location":"expressions/subqueries/#set-predicates","title":"Set predicates","text":"

    A set predicate is a predicate over a set of rows in the form of a subquery.

    EXISTS and UNIQUE are common SQL spellings of these kinds of predicates.

    Property Description Required Operation The operation to perform over the set Yes Tuples Set of tuples to check using the operation Yes"},{"location":"expressions/subqueries/#set-comparisons","title":"Set comparisons","text":"

    A set comparison subquery is a subquery comparison using ANY or ALL operations.

    "},{"location":"expressions/subqueries/#examples_1","title":"Examples","text":"
    SELECT *\nFROM t1\nWHERE x < ANY(SELECT y from t2)\n
    Property Description Required Reduction operation The kind of reduction to use over the subquery Yes Comparison operation The kind of comparison operation to use Yes Expression Left-hand side expression to check Yes Subquery Subquery to check Yes Protobuf Representation
    message Subquery {\n  oneof subquery_type {\n    // Scalar subquery\n    Scalar scalar = 1;\n    // x IN y predicate\n    InPredicate in_predicate = 2;\n    // EXISTS/UNIQUE predicate\n    SetPredicate set_predicate = 3;\n    // ANY/ALL predicate\n    SetComparison set_comparison = 4;\n  }\n\n  // A subquery with one row and one column. This is often an aggregate\n  // though not required to be.\n  message Scalar {\n    Rel input = 1;\n  }\n\n  // Predicate checking that the left expression is contained in the right\n  // subquery\n  //\n  // Examples:\n  //\n  // x IN (SELECT * FROM t)\n  // (x, y) IN (SELECT a, b FROM t)\n  message InPredicate {\n    repeated Expression needles = 1;\n    Rel haystack = 2;\n  }\n\n  // A predicate over a set of rows in the form of a subquery\n  // EXISTS and UNIQUE are common SQL forms of this operation.\n  message SetPredicate {\n    enum PredicateOp {\n      PREDICATE_OP_UNSPECIFIED = 0;\n      PREDICATE_OP_EXISTS = 1;\n      PREDICATE_OP_UNIQUE = 2;\n    }\n    // TODO: should allow expressions\n    PredicateOp predicate_op = 1;\n    Rel tuples = 2;\n  }\n\n  // A subquery comparison using ANY or ALL.\n  // Examples:\n  //\n  // SELECT *\n  // FROM t1\n  // WHERE x < ANY(SELECT y from t2)\n  message SetComparison {\n    enum ComparisonOp {\n      COMPARISON_OP_UNSPECIFIED = 0;\n      COMPARISON_OP_EQ = 1;\n      COMPARISON_OP_NE = 2;\n      COMPARISON_OP_LT = 3;\n      COMPARISON_OP_GT = 4;\n      COMPARISON_OP_LE = 5;\n      COMPARISON_OP_GE = 6;\n    }\n\n    enum ReductionOp {\n      REDUCTION_OP_UNSPECIFIED = 0;\n      REDUCTION_OP_ANY = 1;\n      REDUCTION_OP_ALL = 2;\n    }\n\n    // ANY or ALL\n    ReductionOp reduction_op = 1;\n    // A comparison operator\n    ComparisonOp comparison_op = 2;\n    // left side of the expression\n    Expression left = 3;\n    // right side of the expression\n    Rel right = 4;\n  }\n}\n
    "},{"location":"expressions/table_functions/","title":"Table Functions","text":"

    Table functions produce zero or more records for each input record. Table functions use a signature similar to scalar functions. However, they are not allowed in the same contexts.

    to be completed\u2026

    "},{"location":"expressions/user_defined_functions/","title":"User-Defined Functions","text":"

    Substrait supports the creation of custom functions using simple extensions, using the facilities described in scalar functions. The functions defined by Substrait use the same mechanism. The extension files for standard functions can be found here.

    Here\u2019s an example function that doubles its input:

    Implementation Note

    This implementation is only defined on 32-bit floats and integers but could be defined on all numbers (and even lists and strings). The user of the implementation can specify what happens when the resulting value falls outside of the valid range for a 32-bit float (either return NAN or raise an error).

    %YAML 1.2\n---\nscalar_functions:\n  -\n    name: \"double\"\n    description: \"Double the value\"\n    impls:\n      - args:\n          - name: x\n            value: fp32\n        options:\n          on_domain_error:\n            values: [ NAN, ERROR ]\n        return: fp32\n      - args:\n          - name: x\n            value: i32\n        options:\n          on_domain_error:\n            values: [ NAN, ERROR ]\n        return: i32\n
    "},{"location":"expressions/window_functions/","title":"Window Functions","text":"

    Window functions are functions which consume values from multiple records to produce a single output. They are similar to aggregate functions, but also have a focused window of analysis to compare to their partition window. Window functions are similar to scalar values to an end user, producing a single value for each input record. However, the consumption visibility for the production of each single record can be many records.

    Window function signatures contain all the properties defined for aggregate functions. Additionally, they contain the properties below

    Property Description Required Inherits All properties defined for aggregate functions. N/A Window Type STREAMING or PARTITION. Describes whether the function needs to see all data for the specific partition operation simultaneously. Operations like SUM can produce values in a streaming manner with no complete visibility of the partition. NTILE requires visibility of the entire partition before it can start producing values. Optional, defaults to PARTITION

    When binding an aggregate function, the binding must include the following additional properties beyond the standard scalar binding properties:

    Property Description Required Partition A list of partitioning expressions. False, defaults to a single partition for the entire dataset Lower Bound Bound Following(int64), Bound Trailing(int64) or CurrentRow. False, defaults to start of partition Upper Bound Bound Following(int64), Bound Trailing(int64) or CurrentRow. False, defaults to end of partition"},{"location":"expressions/window_functions/#aggregate-functions-as-window-functions","title":"Aggregate Functions as Window Functions","text":"

    Aggregate functions can be treated as a window functions with Window Type set to STREAMING.

    AVG, COUNT, MAX, MIN and SUM are examples of aggregate functions that are commonly allowed in window contexts.

    "},{"location":"extensions/","title":"Extensions","text":"

    In many cases, the existing objects in Substrait will be sufficient to accomplish a particular use case. However, it is sometimes helpful to create a new data type, scalar function signature or some other custom representation within a system. For that, Substrait provides a number of extension points.

    "},{"location":"extensions/#simple-extensions","title":"Simple Extensions","text":"

    Some kinds of primitives are so frequently extended that Substrait defines a standard YAML format that describes how the extended functionality can be interpreted. This allows different projects/systems to use the YAML definition as a specification so that interoperability isn\u2019t constrained to the base Substrait specification. The main types of extensions that are defined in this manner include the following:

    • Data types
    • Type variations
    • Scalar Functions
    • Aggregate Functions
    • Window Functions
    • Table Functions

    To extend these items, developers can create one or more YAML files at a defined URI that describes the properties of each of these extensions. The YAML file is constructed according to the YAML Schema. Each definition in the file corresponds to the YAML-based serialization of the relevant data structure. If a user only wants to extend one of these types of objects (e.g. types), a developer does not have to provide definitions for the other extension points.

    A Substrait plan can reference one or more YAML files via URI for extension. In the places where these entities are referenced, they will be referenced using a URI + name reference. The name scheme per type works as follows:

    Category Naming scheme Type The name as defined on the type object. Type Variation The name as defined on the type variation object. Function Signature A function signature compound name as described below.

    A YAML file can also reference types and type variations defined in another YAML file. To do this, it must declare the YAML file it depends on using a key-value pair in the dependencies key, where the value is the URI to the YAML file, and the key is a valid identifier that can then be used as an identifier-safe alias for the URI. This alias can then be used as a .-separated namespace prefix wherever a type class or type variation name is expected.

    For example, if the YAML file at file:///extension_types.yaml defines a type called point, a different YAML file can use the type in a function declaration as follows:

    dependencies:\n  ext: file:///extension_types.yaml\nscalar_functions:\n- name: distance\n  description: The distance between two points.\n  impls:\n  - args:\n    - name: a\n      value: ext.point\n    - name: b\n      value: ext.point\n    return: f64\n

    Here, the choice for the name ext is arbitrary, as long as it does not conflict with anything else in the YAML file.

    "},{"location":"extensions/#function-signature-compound-names","title":"Function Signature Compound Names","text":"

    A YAML file may contain one or more functions by the same name. The key used in the function extension declaration to reference a function is a combination of the name of the function along with a list of the required input argument types. The format is as follows:

    <function name>:<short_arg_type0>_<short_arg_type1>_..._<short_arg_typeN>\n

    Rather than using a full data type representation, the input argument types (short_arg_type) are mapped to single-level short name. The mappings are listed in the table below.

    Note

    Every compound function signature must be unique. If two function implementations in a YAML file would generate the same compound function signature, then the YAML file is invalid and behavior is undefined.

    Argument Type Signature Name Required Enumeration req i8 i8 i16 i16 i32 i32 i64 i64 fp32 fp32 fp64 fp64 string str binary vbin boolean bool timestamp ts timestamp_tz tstz date date time time interval_year iyear interval_day iday interval_compound icompound uuid uuid fixedchar<N> fchar varchar<N> vchar fixedbinary<N> fbin decimal<P,S> dec precision_timestamp<P> pts precision_timestamp_tz<P> ptstz struct<T1,T2,\u2026,TN> struct list<T> list map<K,V> map any[\\d]? any user defined type u!name"},{"location":"extensions/#examples","title":"Examples","text":"Function Signature Function Name add(optional enumeration, i8, i8) => i8 add:i8_i8 avg(fp32) => fp32 avg:fp32 extract(required enumeration, timestamp) => i64 extract:req_ts sum(any1) => any1 sum:any"},{"location":"extensions/#any-types","title":"Any Types","text":"
    scalar_functions:\n- name: foo\n  impls:\n  - args:\n    - name: a\n      value: any\n    - name: b\n      value: any\n    return: int64\n

    The any type indicates that the argument can take any possible type. In the foo function above, arguments a and b can be of any type, even different ones in the same function invocation.

    scalar_functions:\n- name: bar\n  impls:\n  - args:\n    - name: a\n      value: any1\n    - name: b\n      value: any1\n    return: int64\n
    The any[\\d] types (i.e. any1, any2, \u2026, any9) impose an additional restriction. Within a single function invocation, all any types with same numeric suffix must be of the same type. In the bar function above, arguments a and b can have any type as long as both types are the same.

    "},{"location":"extensions/#advanced-extensions","title":"Advanced Extensions","text":"

    Less common extensions can be extended using customization support at the serialization level. This includes the following kinds of extensions:

    Extension Type Description Relation Modification (semantic) Extensions to an existing relation that will alter the semantics of that relation. These kinds of extensions require that any plan consumer understand the extension to be able to manipulate or execute that operator. Ignoring these extensions will result in an incorrect interpretation of the plan. An example extension might be creating a customized version of Aggregate that can optionally apply a filter before aggregating the data. Note: Semantic-changing extensions shouldn\u2019t change the core characteristics of the underlying relation. For example, they should not change the default direct output field ordering, change the number of fields output or change the behavior of physical property characteristics. If one needs to change one of these behaviors, one should define a new relation as described below. Relation Modification (optimization) Extensions to an existing relation that can improve the efficiency of a plan consumer but don\u2019t fundamentally change the behavior of the operation. An example might be an estimated amount of memory the relation is expected to use or a particular algorithmic pattern that is perceived to be optimal. New Relations Creates an entirely new kind of relation. It is the most flexible way to extend Substrait but also make the Substrait plan the least interoperable. In most cases it is better to use a semantic changing relation as oppposed to a new relation as it means existing code patterns can easily be extended to work with the additional properties. New Read Types Defines a new subcategory of read that can be used in a ReadRel. One of Substrait is to provide a fairly extensive set of read patterns within the project as opposed to requiring people to define new types externally. As such, we suggest that you first talk with the Substrait community to determine whether you read type can be incorporated directly in the core specification. New Write Types Similar to a read type but for writes. As with reads, the community recommends that interested extenders first discuss with the community about developing new write types in the community before using the extension mechanisms. Plan Extensions Semantic and/or optimization based additions at the plan level.

    Because extension mechanisms are different for each serialization format, please refer to the corresponding serialization sections to understand how these extensions are defined in more detail.

    "},{"location":"extensions/functions_aggregate_approx/","title":"functions_aggregate_approx.yaml","text":"

    This document file is generated for functions_aggregate_approx.yaml

    "},{"location":"extensions/functions_aggregate_approx/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_aggregate_approx/#approx_count_distinct","title":"approx_count_distinct","text":"

    Implementations: approx_count_distinct(x): -> return_type 0. approx_count_distinct(any): -> i64

    Calculates the approximate number of rows that contain distinct values of the expression argument using HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result.

    "},{"location":"extensions/functions_aggregate_decimal_output/","title":"functions_aggregate_decimal_output.yaml","text":"

    This document file is generated for functions_aggregate_decimal_output.yaml

    "},{"location":"extensions/functions_aggregate_decimal_output/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_aggregate_decimal_output/#count","title":"count","text":"

    Implementations: count(x, option:overflow): -> return_type 0. count(any, option:overflow): -> decimal<38,0>

    Count a set of values. Result is returned as a decimal instead of i64.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_aggregate_decimal_output/#count_1","title":"count","text":"

    Implementations:

    Count a set of records (not field referenced). Result is returned as a decimal instead of i64.

    "},{"location":"extensions/functions_aggregate_decimal_output/#approx_count_distinct","title":"approx_count_distinct","text":"

    Implementations: approx_count_distinct(x): -> return_type 0. approx_count_distinct(any): -> decimal<38,0>

    Calculates the approximate number of rows that contain distinct values of the expression argument using HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result. Result is returned as a decimal instead of i64.

    "},{"location":"extensions/functions_aggregate_generic/","title":"functions_aggregate_generic.yaml","text":"

    This document file is generated for functions_aggregate_generic.yaml

    "},{"location":"extensions/functions_aggregate_generic/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_aggregate_generic/#count","title":"count","text":"

    Implementations: count(x, option:overflow): -> return_type 0. count(any, option:overflow): -> i64

    Count a set of values

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_aggregate_generic/#count_1","title":"count","text":"

    Implementations:

    Count a set of records (not field referenced)

    "},{"location":"extensions/functions_aggregate_generic/#any_value","title":"any_value","text":"

    Implementations: any_value(x, option:ignore_nulls): -> return_type 0. any_value(any1, option:ignore_nulls): -> any1?

    *Selects an arbitrary value from a group of values. If the input is empty, the function returns null. *

    Options:
  • ignore_nulls ['TRUE', 'FALSE']
  • "},{"location":"extensions/functions_arithmetic/","title":"functions_arithmetic.yaml","text":"

    This document file is generated for functions_arithmetic.yaml

    "},{"location":"extensions/functions_arithmetic/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_arithmetic/#add","title":"add","text":"

    Implementations: add(x, y, option:overflow): -> return_type 0. add(i8, i8, option:overflow): -> i8 1. add(i16, i16, option:overflow): -> i16 2. add(i32, i32, option:overflow): -> i32 3. add(i64, i64, option:overflow): -> i64 4. add(fp32, fp32, option:rounding): -> fp32 5. add(fp64, fp64, option:rounding): -> fp64

    Add two values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#subtract","title":"subtract","text":"

    Implementations: subtract(x, y, option:overflow): -> return_type 0. subtract(i8, i8, option:overflow): -> i8 1. subtract(i16, i16, option:overflow): -> i16 2. subtract(i32, i32, option:overflow): -> i32 3. subtract(i64, i64, option:overflow): -> i64 4. subtract(fp32, fp32, option:rounding): -> fp32 5. subtract(fp64, fp64, option:rounding): -> fp64

    Subtract one value from another.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#multiply","title":"multiply","text":"

    Implementations: multiply(x, y, option:overflow): -> return_type 0. multiply(i8, i8, option:overflow): -> i8 1. multiply(i16, i16, option:overflow): -> i16 2. multiply(i32, i32, option:overflow): -> i32 3. multiply(i64, i64, option:overflow): -> i64 4. multiply(fp32, fp32, option:rounding): -> fp32 5. multiply(fp64, fp64, option:rounding): -> fp64

    Multiply two values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#divide","title":"divide","text":"

    Implementations: divide(x, y, option:overflow, option:on_domain_error, option:on_division_by_zero): -> return_type 0. divide(i8, i8, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i8 1. divide(i16, i16, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i16 2. divide(i32, i32, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i32 3. divide(i64, i64, option:overflow, option:on_domain_error, option:on_division_by_zero): -> i64 4. divide(fp32, fp32, option:rounding, option:on_domain_error, option:on_division_by_zero): -> fp32 5. divide(fp64, fp64, option:rounding, option:on_domain_error, option:on_division_by_zero): -> fp64

    *Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0). The on_division_by_zero option governs behavior in cases where y is 0. If the option is IEEE then the IEEE754 standard is followed: all values except \u00b1infinity return NaN and \u00b1infinity are unchanged. If the option is LIMIT then the result is \u00b1infinity in all cases. If either x or y are NaN then behavior will be governed by on_domain_error. If x and y are both \u00b1infinity, behavior will be governed by on_domain_error. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • on_domain_error ['NULL', 'ERROR']
  • on_division_by_zero ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • rounding ['NAN', 'NULL', 'ERROR']
  • overflow ['IEEE', 'LIMIT', 'NULL', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#negate","title":"negate","text":"

    Implementations: negate(x, option:overflow): -> return_type 0. negate(i8, option:overflow): -> i8 1. negate(i16, option:overflow): -> i16 2. negate(i32, option:overflow): -> i32 3. negate(i64, option:overflow): -> i64 4. negate(fp32): -> fp32 5. negate(fp64): -> fp64

    Negation of the value

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#modulus","title":"modulus","text":"

    Implementations: modulus(x, y, option:division_type, option:overflow, option:on_domain_error): -> return_type 0. modulus(i8, i8, option:division_type, option:overflow, option:on_domain_error): -> i8 1. modulus(i16, i16, option:division_type, option:overflow, option:on_domain_error): -> i16 2. modulus(i32, i32, option:division_type, option:overflow, option:on_domain_error): -> i32 3. modulus(i64, i64, option:division_type, option:overflow, option:on_domain_error): -> i64

    *Calculate the remainder \u00ae when dividing dividend (x) by divisor (y). In mathematics, many conventions for the modulus (mod) operation exists. The result of a mod operation depends on the software implementation and underlying hardware. Substrait is a format for describing compute operations on structured data and designed for interoperability. Therefore the user is responsible for determining a definition of division as defined by the quotient (q). The following basic conditions of division are satisfied: (1) q \u2208 \u2124 (the quotient is an integer) (2) x = y * q + r (division rule) (3) abs\u00ae < abs(y) where q is the quotient. The division_type option determines the mathematical definition of quotient to use in the above definition of division. When division_type=TRUNCATE, q = trunc(x/y). When division_type=FLOOR, q = floor(x/y). In the cases of TRUNCATE and FLOOR division: remainder r = x - round_func(x/y) The on_domain_error option governs behavior in cases where y is 0, y is \u00b1inf, or x is \u00b1inf. In these cases the mod is undefined. The overflow option governs behavior when integer overflow occurs. If x and y are both 0 or both \u00b1infinity, behavior will be governed by on_domain_error. *

    Options:
  • division_type ['TRUNCATE', 'FLOOR']
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • on_domain_error ['NULL', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#power","title":"power","text":"

    Implementations: power(x, y, option:overflow): -> return_type 0. power(i64, i64, option:overflow): -> i64 1. power(fp32, fp32): -> fp32 2. power(fp64, fp64): -> fp64

    Take the power with x as the base and y as exponent.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#sqrt","title":"sqrt","text":"

    Implementations: sqrt(x, option:rounding, option:on_domain_error): -> return_type 0. sqrt(i64, option:rounding, option:on_domain_error): -> fp64 1. sqrt(fp32, option:rounding, option:on_domain_error): -> fp32 2. sqrt(fp64, option:rounding, option:on_domain_error): -> fp64

    Square root of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#exp","title":"exp","text":"

    Implementations: exp(x, option:rounding): -> return_type 0. exp(i64, option:rounding): -> fp64 1. exp(fp32, option:rounding): -> fp32 2. exp(fp64, option:rounding): -> fp64

    The mathematical constant e, raised to the power of the value.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#cos","title":"cos","text":"

    Implementations: cos(x, option:rounding): -> return_type 0. cos(fp32, option:rounding): -> fp32 1. cos(fp64, option:rounding): -> fp64

    Get the cosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#sin","title":"sin","text":"

    Implementations: sin(x, option:rounding): -> return_type 0. sin(fp32, option:rounding): -> fp32 1. sin(fp64, option:rounding): -> fp64

    Get the sine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#tan","title":"tan","text":"

    Implementations: tan(x, option:rounding): -> return_type 0. tan(fp32, option:rounding): -> fp32 1. tan(fp64, option:rounding): -> fp64

    Get the tangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#cosh","title":"cosh","text":"

    Implementations: cosh(x, option:rounding): -> return_type 0. cosh(fp32, option:rounding): -> fp32 1. cosh(fp64, option:rounding): -> fp64

    Get the hyperbolic cosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#sinh","title":"sinh","text":"

    Implementations: sinh(x, option:rounding): -> return_type 0. sinh(fp32, option:rounding): -> fp32 1. sinh(fp64, option:rounding): -> fp64

    Get the hyperbolic sine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#tanh","title":"tanh","text":"

    Implementations: tanh(x, option:rounding): -> return_type 0. tanh(fp32, option:rounding): -> fp32 1. tanh(fp64, option:rounding): -> fp64

    Get the hyperbolic tangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#acos","title":"acos","text":"

    Implementations: acos(x, option:rounding, option:on_domain_error): -> return_type 0. acos(fp32, option:rounding, option:on_domain_error): -> fp32 1. acos(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the arccosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#asin","title":"asin","text":"

    Implementations: asin(x, option:rounding, option:on_domain_error): -> return_type 0. asin(fp32, option:rounding, option:on_domain_error): -> fp32 1. asin(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the arcsine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#atan","title":"atan","text":"

    Implementations: atan(x, option:rounding): -> return_type 0. atan(fp32, option:rounding): -> fp32 1. atan(fp64, option:rounding): -> fp64

    Get the arctangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#acosh","title":"acosh","text":"

    Implementations: acosh(x, option:rounding, option:on_domain_error): -> return_type 0. acosh(fp32, option:rounding, option:on_domain_error): -> fp32 1. acosh(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the hyperbolic arccosine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#asinh","title":"asinh","text":"

    Implementations: asinh(x, option:rounding): -> return_type 0. asinh(fp32, option:rounding): -> fp32 1. asinh(fp64, option:rounding): -> fp64

    Get the hyperbolic arcsine of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#atanh","title":"atanh","text":"

    Implementations: atanh(x, option:rounding, option:on_domain_error): -> return_type 0. atanh(fp32, option:rounding, option:on_domain_error): -> fp32 1. atanh(fp64, option:rounding, option:on_domain_error): -> fp64

    Get the hyperbolic arctangent of a value in radians.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#atan2","title":"atan2","text":"

    Implementations: atan2(x, y, option:rounding, option:on_domain_error): -> return_type 0. atan2(fp32, fp32, option:rounding, option:on_domain_error): -> fp32 1. atan2(fp64, fp64, option:rounding, option:on_domain_error): -> fp64

    Get the arctangent of values given as x/y pairs.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#radians","title":"radians","text":"

    Implementations: radians(x, option:rounding): -> return_type 0. radians(fp32, option:rounding): -> fp32 1. radians(fp64, option:rounding): -> fp64

    *Converts angle x in degrees to radians. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#degrees","title":"degrees","text":"

    Implementations: degrees(x, option:rounding): -> return_type 0. degrees(fp32, option:rounding): -> fp32 1. degrees(fp64, option:rounding): -> fp64

    *Converts angle x in radians to degrees. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#abs","title":"abs","text":"

    Implementations: abs(x, option:overflow): -> return_type 0. abs(i8, option:overflow): -> i8 1. abs(i16, option:overflow): -> i16 2. abs(i32, option:overflow): -> i32 3. abs(i64, option:overflow): -> i64 4. abs(fp32): -> fp32 5. abs(fp64): -> fp64

    *Calculate the absolute value of the argument. Integer values allow the specification of overflow behavior to handle the unevenness of the twos complement, e.g. Int8 range [-128 : 127]. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#sign","title":"sign","text":"

    Implementations: sign(x): -> return_type 0. sign(i8): -> i8 1. sign(i16): -> i16 2. sign(i32): -> i32 3. sign(i64): -> i64 4. sign(fp32): -> fp32 5. sign(fp64): -> fp64

    *Return the signedness of the argument. Integer values return signedness with the same type as the input. Possible return values are [-1, 0, 1] Floating point values return signedness with the same type as the input. Possible return values are [-1.0, -0.0, 0.0, 1.0, NaN] *

    "},{"location":"extensions/functions_arithmetic/#factorial","title":"factorial","text":"

    Implementations: factorial(n, option:overflow): -> return_type 0. factorial(i32, option:overflow): -> i32 1. factorial(i64, option:overflow): -> i64

    *Return the factorial of a given integer input. The factorial of 0! is 1 by convention. Negative inputs will raise an error. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#bitwise_not","title":"bitwise_not","text":"

    Implementations: bitwise_not(x): -> return_type 0. bitwise_not(i8): -> i8 1. bitwise_not(i16): -> i16 2. bitwise_not(i32): -> i32 3. bitwise_not(i64): -> i64

    *Return the bitwise NOT result for one integer input. *

    "},{"location":"extensions/functions_arithmetic/#bitwise_and","title":"bitwise_and","text":"

    Implementations: bitwise_and(x, y): -> return_type 0. bitwise_and(i8, i8): -> i8 1. bitwise_and(i16, i16): -> i16 2. bitwise_and(i32, i32): -> i32 3. bitwise_and(i64, i64): -> i64

    *Return the bitwise AND result for two integer inputs. *

    "},{"location":"extensions/functions_arithmetic/#bitwise_or","title":"bitwise_or","text":"

    Implementations: bitwise_or(x, y): -> return_type 0. bitwise_or(i8, i8): -> i8 1. bitwise_or(i16, i16): -> i16 2. bitwise_or(i32, i32): -> i32 3. bitwise_or(i64, i64): -> i64

    *Return the bitwise OR result for two given integer inputs. *

    "},{"location":"extensions/functions_arithmetic/#bitwise_xor","title":"bitwise_xor","text":"

    Implementations: bitwise_xor(x, y): -> return_type 0. bitwise_xor(i8, i8): -> i8 1. bitwise_xor(i16, i16): -> i16 2. bitwise_xor(i32, i32): -> i32 3. bitwise_xor(i64, i64): -> i64

    *Return the bitwise XOR result for two integer inputs. *

    "},{"location":"extensions/functions_arithmetic/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_arithmetic/#sum","title":"sum","text":"

    Implementations: sum(x, option:overflow): -> return_type 0. sum(i8, option:overflow): -> i64? 1. sum(i16, option:overflow): -> i64? 2. sum(i32, option:overflow): -> i64? 3. sum(i64, option:overflow): -> i64? 4. sum(fp32, option:overflow): -> fp64? 5. sum(fp64, option:overflow): -> fp64?

    Sum a set of values. The sum of zero elements yields null.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#sum0","title":"sum0","text":"

    Implementations: sum0(x, option:overflow): -> return_type 0. sum0(i8, option:overflow): -> i64 1. sum0(i16, option:overflow): -> i64 2. sum0(i32, option:overflow): -> i64 3. sum0(i64, option:overflow): -> i64 4. sum0(fp32, option:overflow): -> fp64 5. sum0(fp64, option:overflow): -> fp64

    *Sum a set of values. The sum of zero elements yields zero. Null values are ignored. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#avg","title":"avg","text":"

    Implementations: avg(x, option:overflow): -> return_type 0. avg(i8, option:overflow): -> i8? 1. avg(i16, option:overflow): -> i16? 2. avg(i32, option:overflow): -> i32? 3. avg(i64, option:overflow): -> i64? 4. avg(fp32, option:overflow): -> fp32? 5. avg(fp64, option:overflow): -> fp64?

    Average a set of values. For integral types, this truncates partial values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#min","title":"min","text":"

    Implementations: min(x): -> return_type 0. min(i8): -> i8? 1. min(i16): -> i16? 2. min(i32): -> i32? 3. min(i64): -> i64? 4. min(fp32): -> fp32? 5. min(fp64): -> fp64?

    Min a set of values.

    "},{"location":"extensions/functions_arithmetic/#max","title":"max","text":"

    Implementations: max(x): -> return_type 0. max(i8): -> i8? 1. max(i16): -> i16? 2. max(i32): -> i32? 3. max(i64): -> i64? 4. max(fp32): -> fp32? 5. max(fp64): -> fp64?

    Max a set of values.

    "},{"location":"extensions/functions_arithmetic/#product","title":"product","text":"

    Implementations: product(x, option:overflow): -> return_type 0. product(i8, option:overflow): -> i8 1. product(i16, option:overflow): -> i16 2. product(i32, option:overflow): -> i32 3. product(i64, option:overflow): -> i64 4. product(fp32, option:rounding): -> fp32 5. product(fp64, option:rounding): -> fp64

    Product of a set of values. Returns 1 for empty input.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#std_dev","title":"std_dev","text":"

    Implementations: std_dev(x, option:rounding, option:distribution): -> return_type 0. std_dev(fp32, option:rounding, option:distribution): -> fp32? 1. std_dev(fp64, option:rounding, option:distribution): -> fp64?

    Calculates standard-deviation for a set of values.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • distribution ['SAMPLE', 'POPULATION']
  • "},{"location":"extensions/functions_arithmetic/#variance","title":"variance","text":"

    Implementations: variance(x, option:rounding, option:distribution): -> return_type 0. variance(fp32, option:rounding, option:distribution): -> fp32? 1. variance(fp64, option:rounding, option:distribution): -> fp64?

    Calculates variance for a set of values.

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • distribution ['SAMPLE', 'POPULATION']
  • "},{"location":"extensions/functions_arithmetic/#corr","title":"corr","text":"

    Implementations: corr(x, y, option:rounding): -> return_type 0. corr(fp32, fp32, option:rounding): -> fp32? 1. corr(fp64, fp64, option:rounding): -> fp64?

    *Calculates the value of Pearson\u2019s correlation coefficient between x and y. If there is no input, null is returned. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#mode","title":"mode","text":"

    Implementations: mode(x): -> return_type 0. mode(i8): -> i8? 1. mode(i16): -> i16? 2. mode(i32): -> i32? 3. mode(i64): -> i64? 4. mode(fp32): -> fp32? 5. mode(fp64): -> fp64?

    *Calculates mode for a set of values. If there is no input, null is returned. *

    "},{"location":"extensions/functions_arithmetic/#median","title":"median","text":"

    Implementations: median(precision, x, option:rounding): -> return_type 0. median(precision, i8, option:rounding): -> i8? 1. median(precision, i16, option:rounding): -> i16? 2. median(precision, i32, option:rounding): -> i32? 3. median(precision, i64, option:rounding): -> i64? 4. median(precision, fp32, option:rounding): -> fp32? 5. median(precision, fp64, option:rounding): -> fp64?

    *Calculate the median for a set of values. Returns null if applied to zero records. For the integer implementations, the rounding option determines how the median should be rounded if it ends up midway between two values. For the floating point implementations, they specify the usual floating point rounding mode. *

    Options:
  • precision ['EXACT', 'APPROXIMATE']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#quantile","title":"quantile","text":"

    Implementations: quantile(boundaries, precision, n, distribution, option:rounding): -> return_type

  • n: A positive integer which defines the number of quantile partitions.
  • distribution: The data for which the quantiles should be computed.
  • 0. quantile(boundaries, precision, i64, any, option:rounding): -> LIST?<any>

    *Calculates quantiles for a set of values. This function will divide the aggregated values (passed via the distribution argument) over N equally-sized bins, where N is passed via a constant argument. It will then return the values at the boundaries of these bins in list form. If the input is appropriately sorted, this computes the quantiles of the distribution. The function can optionally return the first and/or last element of the input, as specified by the boundaries argument. If the input is appropriately sorted, this will thus be the minimum and/or maximum values of the distribution. When the boundaries do not lie exactly on elements of the incoming distribution, the function will interpolate between the two nearby elements. If the interpolated value cannot be represented exactly, the rounding option controls how the value should be selected or computed. The function fails and returns null in the following cases: - n is null or less than one; - any value in distribution is null.

    The function returns an empty list if n equals 1 and boundaries is set to NEITHER. *

    Options:
  • boundaries ['NEITHER', 'MINIMUM', 'MAXIMUM', 'BOTH']
  • precision ['EXACT', 'APPROXIMATE']
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • "},{"location":"extensions/functions_arithmetic/#window-functions","title":"Window Functions","text":""},{"location":"extensions/functions_arithmetic/#row_number","title":"row_number","text":"

    Implementations: 0. row_number(): -> i64?

    the number of the current row within its partition, starting at 1

    "},{"location":"extensions/functions_arithmetic/#rank","title":"rank","text":"

    Implementations: 0. rank(): -> i64?

    the rank of the current row, with gaps.

    "},{"location":"extensions/functions_arithmetic/#dense_rank","title":"dense_rank","text":"

    Implementations: 0. dense_rank(): -> i64?

    the rank of the current row, without gaps.

    "},{"location":"extensions/functions_arithmetic/#percent_rank","title":"percent_rank","text":"

    Implementations: 0. percent_rank(): -> fp64?

    the relative rank of the current row.

    "},{"location":"extensions/functions_arithmetic/#cume_dist","title":"cume_dist","text":"

    Implementations: 0. cume_dist(): -> fp64?

    the cumulative distribution.

    "},{"location":"extensions/functions_arithmetic/#ntile","title":"ntile","text":"

    Implementations: ntile(x): -> return_type 0. ntile(i32): -> i32? 1. ntile(i64): -> i64?

    Return an integer ranging from 1 to the argument value,dividing the partition as equally as possible.

    "},{"location":"extensions/functions_arithmetic/#first_value","title":"first_value","text":"

    Implementations: first_value(expression): -> return_type 0. first_value(any1): -> any1

    *Returns the first value in the window. *

    "},{"location":"extensions/functions_arithmetic/#last_value","title":"last_value","text":"

    Implementations: last_value(expression): -> return_type 0. last_value(any1): -> any1

    *Returns the last value in the window. *

    "},{"location":"extensions/functions_arithmetic/#nth_value","title":"nth_value","text":"

    Implementations: nth_value(expression, window_offset, option:on_domain_error): -> return_type 0. nth_value(any1, i32, option:on_domain_error): -> any1?

    *Returns a value from the nth row based on the window_offset. window_offset should be a positive integer. If the value of the window_offset is outside the range of the window, null is returned. The on_domain_error option governs behavior in cases where window_offset is not a positive integer or null. *

    Options:
  • on_domain_error ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic/#lead","title":"lead","text":"

    Implementations: lead(expression): -> return_type 0. lead(any1): -> any1? 1. lead(any1, i32): -> any1? 2. lead(any1, i32, any1): -> any1?

    *Return a value from a following row based on a specified physical offset. This allows you to compare a value in the current row against a following row. The expression is evaluated against a row that comes after the current row based on the row_offset. The row_offset should be a positive integer and is set to 1 if not specified explicitly. If the row_offset is negative, the expression will be evaluated against a row coming before the current row, similar to the lag function. A row_offset of null will return null. The function returns the default input value if row_offset goes beyond the scope of the window. If a default value is not specified, it is set to null. Example comparing the sales of the current year to the following year. row_offset of 1. | year | sales | next_year_sales | | 2019 | 20.50 | 30.00 | | 2020 | 30.00 | 45.99 | | 2021 | 45.99 | null | *

    "},{"location":"extensions/functions_arithmetic/#lag","title":"lag","text":"

    Implementations: lag(expression): -> return_type 0. lag(any1): -> any1? 1. lag(any1, i32): -> any1? 2. lag(any1, i32, any1): -> any1?

    *Return a column value from a previous row based on a specified physical offset. This allows you to compare a value in the current row against a previous row. The expression is evaluated against a row that comes before the current row based on the row_offset. The expression can be a column, expression or subquery that evaluates to a single value. The row_offset should be a positive integer and is set to 1 if not specified explicitly. If the row_offset is negative, the expression will be evaluated against a row coming after the current row, similar to the lead function. A row_offset of null will return null. The function returns the default input value if row_offset goes beyond the scope of the partition. If a default value is not specified, it is set to null. Example comparing the sales of the current year to the previous year. row_offset of 1. | year | sales | previous_year_sales | | 2019 | 20.50 | null | | 2020 | 30.00 | 20.50 | | 2021 | 45.99 | 30.00 | *

    "},{"location":"extensions/functions_arithmetic_decimal/","title":"functions_arithmetic_decimal.yaml","text":"

    This document file is generated for functions_arithmetic_decimal.yaml

    "},{"location":"extensions/functions_arithmetic_decimal/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_arithmetic_decimal/#add","title":"add","text":"

    Implementations: add(x, y, option:overflow): -> return_type 0. add(decimal<P1,S1>, decimal<P2,S2>, option:overflow): ->

    init_scale = max(S1,S2)\ninit_prec = init_scale + max(P1 - S1, P2 - S2) + 1\nmin_scale = min(init_scale, 6)\ndelta = init_prec - 38\nprec = min(init_prec, 38)\nscale_after_borrow = max(init_scale - delta, min_scale)\nscale = init_prec > 38 ? scale_after_borrow : init_scale\nDECIMAL<prec, scale>  \n

    Add two decimal values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#subtract","title":"subtract","text":"

    Implementations: subtract(x, y, option:overflow): -> return_type 0. subtract(decimal<P1,S1>, decimal<P2,S2>, option:overflow): ->

    init_scale = max(S1,S2)\ninit_prec = init_scale + max(P1 - S1, P2 - S2) + 1\nmin_scale = min(init_scale, 6)\ndelta = init_prec - 38\nprec = min(init_prec, 38)\nscale_after_borrow = max(init_scale - delta, min_scale)\nscale = init_prec > 38 ? scale_after_borrow : init_scale\nDECIMAL<prec, scale>  \n

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#multiply","title":"multiply","text":"

    Implementations: multiply(x, y, option:overflow): -> return_type 0. multiply(decimal<P1,S1>, decimal<P2,S2>, option:overflow): ->

    init_scale = S1 + S2\ninit_prec = P1 + P2 + 1\nmin_scale = min(init_scale, 6)\ndelta = init_prec - 38\nprec = min(init_prec, 38)\nscale_after_borrow = max(init_scale - delta, min_scale)\nscale = init_prec > 38 ? scale_after_borrow : init_scale\nDECIMAL<prec, scale>  \n

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#divide","title":"divide","text":"

    Implementations: divide(x, y, option:overflow): -> return_type 0. divide(decimal<P1,S1>, decimal<P2,S2>, option:overflow): ->

    init_scale = max(6, S1 + P2 + 1)\ninit_prec = P1 - S1 + P2 + init_scale\nmin_scale = min(init_scale, 6)\ndelta = init_prec - 38\nprec = min(init_prec, 38)\nscale_after_borrow = max(init_scale - delta, min_scale)\nscale = init_prec > 38 ? scale_after_borrow : init_scale\nDECIMAL<prec, scale>  \n

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#modulus","title":"modulus","text":"

    Implementations: modulus(x, y, option:overflow): -> return_type 0. modulus(decimal<P1,S1>, decimal<P2,S2>, option:overflow): ->

    init_scale = max(S1,S2)\ninit_prec = min(P1 - S1, P2 - S2) + init_scale\nmin_scale = min(init_scale, 6)\ndelta = init_prec - 38\nprec = min(init_prec, 38)\nscale_after_borrow = max(init_scale - delta, min_scale)\nscale = init_prec > 38 ? scale_after_borrow : init_scale\nDECIMAL<prec, scale>  \n

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#abs","title":"abs","text":"

    Implementations: abs(x): -> return_type 0. abs(decimal<P,S>): -> decimal<P,S>

    Calculate the absolute value of the argument.

    "},{"location":"extensions/functions_arithmetic_decimal/#bitwise_and","title":"bitwise_and","text":"

    Implementations: bitwise_and(x, y): -> return_type 0. bitwise_and(DECIMAL<P1,0>, DECIMAL<P2,0>): ->

    max_precision = max(P1, P2)\nDECIMAL<max_precision, 0>  \n

    *Return the bitwise AND result for two decimal inputs. In inputs scale must be 0 (i.e. only integer types are allowed) *

    "},{"location":"extensions/functions_arithmetic_decimal/#bitwise_or","title":"bitwise_or","text":"

    Implementations: bitwise_or(x, y): -> return_type 0. bitwise_or(DECIMAL<P1,0>, DECIMAL<P2,0>): ->

    max_precision = max(P1, P2)\nDECIMAL<max_precision, 0>  \n

    *Return the bitwise OR result for two given decimal inputs. In inputs scale must be 0 (i.e. only integer types are allowed) *

    "},{"location":"extensions/functions_arithmetic_decimal/#bitwise_xor","title":"bitwise_xor","text":"

    Implementations: bitwise_xor(x, y): -> return_type 0. bitwise_xor(DECIMAL<P1,0>, DECIMAL<P2,0>): ->

    max_precision = max(P1, P2)\nDECIMAL<max_precision, 0>  \n

    *Return the bitwise XOR result for two given decimal inputs. In inputs scale must be 0 (i.e. only integer types are allowed) *

    "},{"location":"extensions/functions_arithmetic_decimal/#sqrt","title":"sqrt","text":"

    Implementations: sqrt(x): -> return_type 0. sqrt(DECIMAL<P,S>): -> fp64

    Square root of the value. Sqrt of 0 is 0 and sqrt of negative values will raise an error.

    "},{"location":"extensions/functions_arithmetic_decimal/#factorial","title":"factorial","text":"

    Implementations: factorial(n): -> return_type 0. factorial(DECIMAL<P,0>): -> DECIMAL<38,0>

    *Return the factorial of a given decimal input. Scale should be 0 for factorial decimal input. The factorial of 0! is 1 by convention. Negative inputs will raise an error. Input which cause overflow of result will raise an error. *

    "},{"location":"extensions/functions_arithmetic_decimal/#power","title":"power","text":"

    Implementations: power(x, y, option:overflow, option:complex_number_result): -> return_type 0. power(DECIMAL<P1,S1>, DECIMAL<P2,S2>, option:overflow, option:complex_number_result): -> fp64

    Take the power with x as the base and y as exponent. Behavior for complex number result is indicated by option complex_number_result

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • complex_number_result ['NAN', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_arithmetic_decimal/#sum","title":"sum","text":"

    Implementations: sum(x, option:overflow): -> return_type 0. sum(DECIMAL<P, S>, option:overflow): -> DECIMAL?<38,S>

    Sum a set of values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#avg","title":"avg","text":"

    Implementations: avg(x, option:overflow): -> return_type 0. avg(DECIMAL<P,S>, option:overflow): -> DECIMAL<38,S>

    Average a set of values.

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_arithmetic_decimal/#min","title":"min","text":"

    Implementations: min(x): -> return_type 0. min(DECIMAL<P, S>): -> DECIMAL?<P, S>

    Min a set of values.

    "},{"location":"extensions/functions_arithmetic_decimal/#max","title":"max","text":"

    Implementations: max(x): -> return_type 0. max(DECIMAL<P,S>): -> DECIMAL?<P, S>

    Max a set of values.

    "},{"location":"extensions/functions_arithmetic_decimal/#sum0","title":"sum0","text":"

    Implementations: sum0(x, option:overflow): -> return_type 0. sum0(DECIMAL<P, S>, option:overflow): -> DECIMAL<38,S>

    *Sum a set of values. The sum of zero elements yields zero. Null values are ignored. *

    Options:
  • overflow ['SILENT', 'SATURATE', 'ERROR']
  • "},{"location":"extensions/functions_boolean/","title":"functions_boolean.yaml","text":"

    This document file is generated for functions_boolean.yaml

    "},{"location":"extensions/functions_boolean/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_boolean/#or","title":"or","text":"

    Implementations: or(a): -> return_type 0. or(boolean?): -> boolean?

    *The boolean or using Kleene logic. This function behaves as follows with nulls:

    true or null = true\n\nnull or true = true\n\nfalse or null = null\n\nnull or false = null\n\nnull or null = null\n

    In other words, in this context a null value really means \u201cunknown\u201d, and an unknown value or true is always true. Behavior for 0 or 1 inputs is as follows: or() -> false or(x) -> x *

    "},{"location":"extensions/functions_boolean/#and","title":"and","text":"

    Implementations: and(a): -> return_type 0. and(boolean?): -> boolean?

    *The boolean and using Kleene logic. This function behaves as follows with nulls:

    true and null = null\n\nnull and true = null\n\nfalse and null = false\n\nnull and false = false\n\nnull and null = null\n

    In other words, in this context a null value really means \u201cunknown\u201d, and an unknown value and false is always false. Behavior for 0 or 1 inputs is as follows: and() -> true and(x) -> x *

    "},{"location":"extensions/functions_boolean/#and_not","title":"and_not","text":"

    Implementations: and_not(a, b): -> return_type 0. and_not(boolean?, boolean?): -> boolean?

    *The boolean and of one value and the negation of the other using Kleene logic. This function behaves as follows with nulls:

    true and not null = null\n\nnull and not false = null\n\nfalse and not null = false\n\nnull and not true = false\n\nnull and not null = null\n

    In other words, in this context a null value really means \u201cunknown\u201d, and an unknown value and not true is always false, as is false and not an unknown value. *

    "},{"location":"extensions/functions_boolean/#xor","title":"xor","text":"

    Implementations: xor(a, b): -> return_type 0. xor(boolean?, boolean?): -> boolean?

    *The boolean xor of two values using Kleene logic. When a null is encountered in either input, a null is output. *

    "},{"location":"extensions/functions_boolean/#not","title":"not","text":"

    Implementations: not(a): -> return_type 0. not(boolean?): -> boolean?

    *The not of a boolean value. When a null is input, a null is output. *

    "},{"location":"extensions/functions_boolean/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_boolean/#bool_and","title":"bool_and","text":"

    Implementations: bool_and(a): -> return_type 0. bool_and(boolean): -> boolean?

    *If any value in the input is false, false is returned. If the input is empty or only contains nulls, null is returned. Otherwise, true is returned. *

    "},{"location":"extensions/functions_boolean/#bool_or","title":"bool_or","text":"

    Implementations: bool_or(a): -> return_type 0. bool_or(boolean): -> boolean?

    *If any value in the input is true, true is returned. If the input is empty or only contains nulls, null is returned. Otherwise, false is returned. *

    "},{"location":"extensions/functions_comparison/","title":"functions_comparison.yaml","text":"

    This document file is generated for functions_comparison.yaml

    "},{"location":"extensions/functions_comparison/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_comparison/#not_equal","title":"not_equal","text":"

    Implementations: not_equal(x, y): -> return_type 0. not_equal(any1, any1): -> boolean

    *Whether two values are not_equal. not_equal(x, y) := (x != y) If either/both of x and y are null, null is returned. *

    "},{"location":"extensions/functions_comparison/#equal","title":"equal","text":"

    Implementations: equal(x, y): -> return_type 0. equal(any1, any1): -> boolean

    *Whether two values are equal. equal(x, y) := (x == y) If either/both of x and y are null, null is returned. *

    "},{"location":"extensions/functions_comparison/#is_not_distinct_from","title":"is_not_distinct_from","text":"

    Implementations: is_not_distinct_from(x, y): -> return_type 0. is_not_distinct_from(any1, any1): -> boolean

    *Whether two values are equal. This function treats null values as comparable, so is_not_distinct_from(null, null) == True This is in contrast to equal, in which null values do not compare. *

    "},{"location":"extensions/functions_comparison/#is_distinct_from","title":"is_distinct_from","text":"

    Implementations: is_distinct_from(x, y): -> return_type 0. is_distinct_from(any1, any1): -> boolean

    *Whether two values are not equal. This function treats null values as comparable, so is_distinct_from(null, null) == False This is in contrast to equal, in which null values do not compare. *

    "},{"location":"extensions/functions_comparison/#lt","title":"lt","text":"

    Implementations: lt(x, y): -> return_type 0. lt(any1, any1): -> boolean

    *Less than. lt(x, y) := (x < y) If either/both of x and y are null, null is returned. *

    "},{"location":"extensions/functions_comparison/#gt","title":"gt","text":"

    Implementations: gt(x, y): -> return_type 0. gt(any1, any1): -> boolean

    *Greater than. gt(x, y) := (x > y) If either/both of x and y are null, null is returned. *

    "},{"location":"extensions/functions_comparison/#lte","title":"lte","text":"

    Implementations: lte(x, y): -> return_type 0. lte(any1, any1): -> boolean

    *Less than or equal to. lte(x, y) := (x <= y) If either/both of x and y are null, null is returned. *

    "},{"location":"extensions/functions_comparison/#gte","title":"gte","text":"

    Implementations: gte(x, y): -> return_type 0. gte(any1, any1): -> boolean

    *Greater than or equal to. gte(x, y) := (x >= y) If either/both of x and y are null, null is returned. *

    "},{"location":"extensions/functions_comparison/#between","title":"between","text":"

    Implementations: between(expression, low, high): -> return_type

  • expression: The expression to test for in the range defined by `low` and `high`.
  • low: The value to check if greater than or equal to.
  • high: The value to check if less than or equal to.
  • 0. between(any1, any1, any1): -> boolean

    Whether the expression is greater than or equal to low and less than or equal to high. expression BETWEEN low AND high If low, high, or expression are null, null is returned.

    "},{"location":"extensions/functions_comparison/#is_null","title":"is_null","text":"

    Implementations: is_null(x): -> return_type 0. is_null(any1): -> boolean

    Whether a value is null. NaN is not null.

    "},{"location":"extensions/functions_comparison/#is_not_null","title":"is_not_null","text":"

    Implementations: is_not_null(x): -> return_type 0. is_not_null(any1): -> boolean

    Whether a value is not null. NaN is not null.

    "},{"location":"extensions/functions_comparison/#is_nan","title":"is_nan","text":"

    Implementations: is_nan(x): -> return_type 0. is_nan(fp32): -> boolean 1. is_nan(fp64): -> boolean

    *Whether a value is not a number. If x is null, null is returned. *

    "},{"location":"extensions/functions_comparison/#is_finite","title":"is_finite","text":"

    Implementations: is_finite(x): -> return_type 0. is_finite(fp32): -> boolean 1. is_finite(fp64): -> boolean

    *Whether a value is finite (neither infinite nor NaN). If x is null, null is returned. *

    "},{"location":"extensions/functions_comparison/#is_infinite","title":"is_infinite","text":"

    Implementations: is_infinite(x): -> return_type 0. is_infinite(fp32): -> boolean 1. is_infinite(fp64): -> boolean

    *Whether a value is infinite. If x is null, null is returned. *

    "},{"location":"extensions/functions_comparison/#nullif","title":"nullif","text":"

    Implementations: nullif(x, y): -> return_type 0. nullif(any1, any1): -> any1

    If two values are equal, return null. Otherwise, return the first value.

    "},{"location":"extensions/functions_comparison/#coalesce","title":"coalesce","text":"

    Implementations: 0. coalesce(any1, any1): -> any1

    Evaluate arguments from left to right and return the first argument that is not null. Once a non-null argument is found, the remaining arguments are not evaluated. If all arguments are null, return null.

    "},{"location":"extensions/functions_comparison/#least","title":"least","text":"

    Implementations: 0. least(any1, any1): -> any1

    Evaluates each argument and returns the smallest one. The function will return null if any argument evaluates to null.

    "},{"location":"extensions/functions_comparison/#least_skip_null","title":"least_skip_null","text":"

    Implementations: 0. least_skip_null(any1, any1): -> any1

    Evaluates each argument and returns the smallest one. The function will return null only if all arguments evaluate to null.

    "},{"location":"extensions/functions_comparison/#greatest","title":"greatest","text":"

    Implementations: 0. greatest(any1, any1): -> any1

    Evaluates each argument and returns the largest one. The function will return null if any argument evaluates to null.

    "},{"location":"extensions/functions_comparison/#greatest_skip_null","title":"greatest_skip_null","text":"

    Implementations: 0. greatest_skip_null(any1, any1): -> any1

    Evaluates each argument and returns the largest one. The function will return null only if all arguments evaluate to null.

    "},{"location":"extensions/functions_datetime/","title":"functions_datetime.yaml","text":"

    This document file is generated for functions_datetime.yaml

    "},{"location":"extensions/functions_datetime/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_datetime/#extract","title":"extract","text":"

    Implementations: extract(component, x, timezone): -> return_type

  • x: Timezone string from IANA tzdb.
  • 0. extract(component, timestamp_tz, string): -> i64 1. extract(component, precision_timestamp_tz<P>, string): -> i64 2. extract(component, timestamp): -> i64 3. extract(component, precision_timestamp<P>): -> i64 4. extract(component, date): -> i64 5. extract(component, time): -> i64 6. extract(component, indexing, timestamp_tz, string): -> i64 7. extract(component, indexing, precision_timestamp_tz<P>, string): -> i64 8. extract(component, indexing, timestamp): -> i64 9. extract(component, indexing, precision_timestamp<P>): -> i64 10. extract(component, indexing, date): -> i64

    Extract portion of a date/time value. * YEAR Return the year. * ISO_YEAR Return the ISO 8601 week-numbering year. First week of an ISO year has the majority (4 or more) of its days in January. * US_YEAR Return the US epidemiological year. First week of US epidemiological year has the majority (4 or more) of its days in January. Last week of US epidemiological year has the year\u2019s last Wednesday in it. US epidemiological week starts on Sunday. * QUARTER Return the number of the quarter within the year. January 1 through March 31 map to the first quarter, April 1 through June 30 map to the second quarter, etc. * MONTH Return the number of the month within the year. * DAY Return the number of the day within the month. * DAY_OF_YEAR Return the number of the day within the year. January 1 maps to the first day, February 1 maps to the thirty-second day, etc. * MONDAY_DAY_OF_WEEK Return the number of the day within the week, from Monday (first day) to Sunday (seventh day). * SUNDAY_DAY_OF_WEEK Return the number of the day within the week, from Sunday (first day) to Saturday (seventh day). * MONDAY_WEEK Return the number of the week within the year. First week starts on first Monday of January. * SUNDAY_WEEK Return the number of the week within the year. First week starts on first Sunday of January. * ISO_WEEK Return the number of the ISO week within the ISO year. First ISO week has the majority (4 or more) of its days in January. ISO week starts on Monday. * US_WEEK Return the number of the US week within the US year. First US week has the majority (4 or more) of its days in January. US week starts on Sunday. * HOUR Return the hour (0-23). * MINUTE Return the minute (0-59). * SECOND Return the second (0-59). * MILLISECOND Return number of milliseconds since the last full second. * MICROSECOND Return number of microseconds since the last full millisecond. * NANOSECOND Return number of nanoseconds since the last full microsecond. * SUBSECOND Return number of microseconds since the last full second of the given timestamp. * UNIX_TIME Return number of seconds that have elapsed since 1970-01-01 00:00:00 UTC, ignoring leap seconds. * TIMEZONE_OFFSET Return number of seconds of timezone offset to UTC. The range of values returned for QUARTER, MONTH, DAY, DAY_OF_YEAR, MONDAY_DAY_OF_WEEK, SUNDAY_DAY_OF_WEEK, MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK depends on whether counting starts at 1 or 0. This is governed by the indexing option. When indexing is ONE: * QUARTER returns values in range 1-4 * MONTH returns values in range 1-12 * DAY returns values in range 1-31 * DAY_OF_YEAR returns values in range 1-366 * MONDAY_DAY_OF_WEEK and SUNDAY_DAY_OF_WEEK return values in range 1-7 * MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK return values in range 1-53 When indexing is ZERO: * QUARTER returns values in range 0-3 * MONTH returns values in range 0-11 * DAY returns values in range 0-30 * DAY_OF_YEAR returns values in range 0-365 * MONDAY_DAY_OF_WEEK and SUNDAY_DAY_OF_WEEK return values in range 0-6 * MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, and US_WEEK return values in range 0-52 The indexing option must be specified when the component is QUARTER, MONTH, DAY, DAY_OF_YEAR, MONDAY_DAY_OF_WEEK, SUNDAY_DAY_OF_WEEK, MONDAY_WEEK, SUNDAY_WEEK, ISO_WEEK, or US_WEEK. The indexing option cannot be specified when the component is YEAR, ISO_YEAR, US_YEAR, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, SUBSECOND, UNIX_TIME, or TIMEZONE_OFFSET. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    Options:
  • component ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'SUBSECOND', 'UNIX_TIME', 'TIMEZONE_OFFSET']
  • indexing ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'NANOSECOND', 'SUBSECOND', 'UNIX_TIME', 'TIMEZONE_OFFSET']
  • component ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'SUBSECOND', 'UNIX_TIME']
  • indexing ['YEAR', 'ISO_YEAR', 'US_YEAR', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'NANOSECOND', 'SUBSECOND', 'UNIX_TIME']
  • component ['YEAR', 'ISO_YEAR', 'US_YEAR', 'UNIX_TIME']
  • indexing ['HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND', 'SUBSECOND']
  • component ['QUARTER', 'MONTH', 'DAY', 'DAY_OF_YEAR', 'MONDAY_DAY_OF_WEEK', 'SUNDAY_DAY_OF_WEEK', 'MONDAY_WEEK', 'SUNDAY_WEEK', 'ISO_WEEK', 'US_WEEK']
  • indexing ['ONE', 'ZERO']
  • "},{"location":"extensions/functions_datetime/#extract_boolean","title":"extract_boolean","text":"

    Implementations: extract_boolean(component, x): -> return_type 0. extract_boolean(component, timestamp): -> boolean 1. extract_boolean(component, precision_timestamp<P>): -> boolean 2. extract_boolean(component, timestamp_tz, string): -> boolean 3. extract_boolean(component, precision_timestamp_tz<P>, string): -> boolean 4. extract_boolean(component, date): -> boolean

    *Extract boolean values of a date/time value. * IS_LEAP_YEAR Return true if year of the given value is a leap year and false otherwise. * IS_DST Return true if DST (Daylight Savings Time) is observed at the given value in the given timezone.

    Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.*

    Options:
  • component ['IS_LEAP_YEAR']
  • component ['IS_LEAP_YEAR', 'IS_DST']
  • "},{"location":"extensions/functions_datetime/#add","title":"add","text":"

    Implementations: add(x, y): -> return_type 0. add(timestamp, interval_year): -> timestamp 1. add(precision_timestamp<P>, interval_year): -> precision_timestamp<P> 2. add(timestamp_tz, interval_year, string): -> timestamp_tz 3. add(precision_timestamp_tz<P>, interval_year, string): -> precision_timestamp_tz<P> 4. add(date, interval_year): -> timestamp 5. add(timestamp, interval_day<P>): -> timestamp 6. add(precision_timestamp<P>, interval_day<P>): -> precision_timestamp<P> 7. add(timestamp_tz, interval_day<P>): -> timestamp_tz 8. add(precision_timestamp_tz<P>, interval_day<P>): -> precision_timestamp_tz<P> 9. add(date, interval_day<P>): -> timestamp

    Add an interval to a date/time type. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    "},{"location":"extensions/functions_datetime/#multiply","title":"multiply","text":"

    Implementations: multiply(x, y): -> return_type 0. multiply(i8, interval_day<P>): -> interval_day<P> 1. multiply(i16, interval_day<P>): -> interval_day<P> 2. multiply(i32, interval_day<P>): -> interval_day<P> 3. multiply(i64, interval_day<P>): -> interval_day<P> 4. multiply(i8, interval_year): -> interval_year 5. multiply(i16, interval_year): -> interval_year 6. multiply(i32, interval_year): -> interval_year 7. multiply(i64, interval_year): -> interval_year

    Multiply an interval by an integral number.

    "},{"location":"extensions/functions_datetime/#add_intervals","title":"add_intervals","text":"

    Implementations: add_intervals(x, y): -> return_type 0. add_intervals(interval_day<P>, interval_day<P>): -> interval_day<P> 1. add_intervals(interval_year, interval_year): -> interval_year

    Add two intervals together.

    "},{"location":"extensions/functions_datetime/#subtract","title":"subtract","text":"

    Implementations: subtract(x, y): -> return_type 0. subtract(timestamp, interval_year): -> timestamp 1. subtract(precision_timestamp<P>, interval_year): -> precision_timestamp<P> 2. subtract(timestamp_tz, interval_year): -> timestamp_tz 3. subtract(precision_timestamp_tz<P>, interval_year): -> precision_timestamp_tz<P> 4. subtract(timestamp_tz, interval_year, string): -> timestamp_tz 5. subtract(precision_timestamp_tz<P>, interval_year, string): -> precision_timestamp_tz<P> 6. subtract(date, interval_year): -> date 7. subtract(timestamp, interval_day<P>): -> timestamp 8. subtract(precision_timestamp<P>, interval_day<P>): -> precision_timestamp<P> 9. subtract(timestamp_tz, interval_day<P>): -> timestamp_tz 10. subtract(precision_timestamp_tz<P>, interval_day<P>): -> precision_timestamp_tz<P> 11. subtract(date, interval_day<P>): -> date

    Subtract an interval from a date/time type. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    "},{"location":"extensions/functions_datetime/#lte","title":"lte","text":"

    Implementations: lte(x, y): -> return_type 0. lte(timestamp, timestamp): -> boolean 1. lte(precision_timestamp<P>, precision_timestamp<P>): -> boolean 2. lte(timestamp_tz, timestamp_tz): -> boolean 3. lte(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean 4. lte(date, date): -> boolean 5. lte(interval_day<P>, interval_day<P>): -> boolean 6. lte(interval_year, interval_year): -> boolean

    less than or equal to

    "},{"location":"extensions/functions_datetime/#lt","title":"lt","text":"

    Implementations: lt(x, y): -> return_type 0. lt(timestamp, timestamp): -> boolean 1. lt(precision_timestamp<P>, precision_timestamp<P>): -> boolean 2. lt(timestamp_tz, timestamp_tz): -> boolean 3. lt(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean 4. lt(date, date): -> boolean 5. lt(interval_day<P>, interval_day<P>): -> boolean 6. lt(interval_year, interval_year): -> boolean

    less than

    "},{"location":"extensions/functions_datetime/#gte","title":"gte","text":"

    Implementations: gte(x, y): -> return_type 0. gte(timestamp, timestamp): -> boolean 1. gte(precision_timestamp<P>, precision_timestamp<P>): -> boolean 2. gte(timestamp_tz, timestamp_tz): -> boolean 3. gte(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean 4. gte(date, date): -> boolean 5. gte(interval_day<P>, interval_day<P>): -> boolean 6. gte(interval_year, interval_year): -> boolean

    greater than or equal to

    "},{"location":"extensions/functions_datetime/#gt","title":"gt","text":"

    Implementations: gt(x, y): -> return_type 0. gt(timestamp, timestamp): -> boolean 1. gt(precision_timestamp<P>, precision_timestamp<P>): -> boolean 2. gt(timestamp_tz, timestamp_tz): -> boolean 3. gt(precision_timestamp_tz<P>, precision_timestamp_tz<P>): -> boolean 4. gt(date, date): -> boolean 5. gt(interval_day<P>, interval_day<P>): -> boolean 6. gt(interval_year, interval_year): -> boolean

    greater than

    "},{"location":"extensions/functions_datetime/#assume_timezone","title":"assume_timezone","text":"

    Implementations: assume_timezone(x, timezone): -> return_type

  • x: Timezone string from IANA tzdb.
  • 0. assume_timezone(timestamp, string): -> timestamp_tz 1. assume_timezone(precision_timestamp<P>, string): -> precision_timestamp_tz<P> 2. assume_timezone(date, string): -> timestamp_tz

    Convert local timestamp to UTC-relative timestamp_tz using given local time\u2019s timezone. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    "},{"location":"extensions/functions_datetime/#local_timestamp","title":"local_timestamp","text":"

    Implementations: local_timestamp(x, timezone): -> return_type

  • x: Timezone string from IANA tzdb.
  • 0. local_timestamp(timestamp_tz, string): -> timestamp 1. local_timestamp(precision_timestamp_tz<P>, string): -> precision_timestamp<P>

    Convert UTC-relative timestamp_tz to local timestamp using given local time\u2019s timezone. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    "},{"location":"extensions/functions_datetime/#strptime_time","title":"strptime_time","text":"

    Implementations: strptime_time(time_string, format): -> return_type 0. strptime_time(string, string): -> time

    Parse string into time using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference.

    "},{"location":"extensions/functions_datetime/#strptime_date","title":"strptime_date","text":"

    Implementations: strptime_date(date_string, format): -> return_type 0. strptime_date(string, string): -> date

    Parse string into date using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference.

    "},{"location":"extensions/functions_datetime/#strptime_timestamp","title":"strptime_timestamp","text":"

    Implementations: strptime_timestamp(timestamp_string, format, timezone): -> return_type

  • timestamp_string: Timezone string from IANA tzdb.
  • 0. strptime_timestamp(string, string, string): -> timestamp_tz 1. strptime_timestamp(string, string): -> timestamp_tz

    Parse string into timestamp using provided format, see https://man7.org/linux/man-pages/man3/strptime.3.html for reference. If timezone is present in timestamp and provided as parameter an error is thrown. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is supplied as parameter and present in the parsed string the parsed timezone is used. If parameter supplied timezone is invalid an error is thrown.

    "},{"location":"extensions/functions_datetime/#strftime","title":"strftime","text":"

    Implementations: strftime(x, format): -> return_type 0. strftime(timestamp, string): -> string 1. strftime(precision_timestamp<P>, string): -> string 2. strftime(timestamp_tz, string, string): -> string 3. strftime(precision_timestamp_tz<P>, string, string): -> string 4. strftime(date, string): -> string 5. strftime(time, string): -> string

    Convert timestamp/date/time to string using provided format, see https://man7.org/linux/man-pages/man3/strftime.3.html for reference. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    "},{"location":"extensions/functions_datetime/#round_temporal","title":"round_temporal","text":"

    Implementations: round_temporal(x, rounding, unit, multiple, origin): -> return_type 0. round_temporal(timestamp, rounding, unit, i64, timestamp): -> timestamp 1. round_temporal(precision_timestamp<P>, rounding, unit, i64, precision_timestamp<P>): -> precision_timestamp<P> 2. round_temporal(timestamp_tz, rounding, unit, i64, string, timestamp_tz): -> timestamp_tz 3. round_temporal(precision_timestamp_tz<P>, rounding, unit, i64, string, precision_timestamp_tz<P>): -> precision_timestamp_tz<P> 4. round_temporal(date, rounding, unit, i64, date): -> date 5. round_temporal(time, rounding, unit, i64, time): -> time

    Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple from the origin in the given timezone, the resulting point is chosen as one of the two nearest multiples. Which of these is chosen is governed by rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    Options:
  • rounding ['FLOOR', 'CEIL', 'ROUND_TIE_DOWN', 'ROUND_TIE_UP']
  • unit ['YEAR', 'MONTH', 'WEEK', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • rounding ['YEAR', 'MONTH', 'WEEK', 'DAY']
  • unit ['HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • "},{"location":"extensions/functions_datetime/#round_calendar","title":"round_calendar","text":"

    Implementations: round_calendar(x, rounding, unit, origin, multiple): -> return_type 0. round_calendar(timestamp, rounding, unit, origin, i64): -> timestamp 1. round_calendar(precision_timestamp<P>, rounding, unit, origin, i64): -> precision_timestamp<P> 2. round_calendar(timestamp_tz, rounding, unit, origin, i64, string): -> timestamp_tz 3. round_calendar(precision_timestamp_tz<P>, rounding, unit, origin, i64, string): -> precision_timestamp_tz<P> 4. round_calendar(date, rounding, unit, origin, i64, date): -> date 5. round_calendar(time, rounding, unit, origin, i64, time): -> time

    Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple from the last origin unit in the given timezone, the resulting point is chosen as one of the two nearest multiples. Which of these is chosen is governed by rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). Examples: \u201cPacific/Marquesas\u201d, \u201cEtc/GMT+1\u201d. If timezone is invalid an error is thrown.

    Options:
  • rounding ['FLOOR', 'CEIL', 'ROUND_TIE_DOWN', 'ROUND_TIE_UP']
  • unit ['YEAR', 'MONTH', 'WEEK', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • origin ['YEAR', 'MONTH', 'MONDAY_WEEK', 'SUNDAY_WEEK', 'ISO_WEEK', 'US_WEEK', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND']
  • rounding ['YEAR', 'MONTH', 'WEEK', 'DAY']
  • unit ['YEAR', 'MONTH', 'MONDAY_WEEK', 'SUNDAY_WEEK', 'ISO_WEEK', 'US_WEEK', 'DAY']
  • origin ['DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND', 'MICROSECOND']
  • rounding ['DAY', 'HOUR', 'MINUTE', 'SECOND', 'MILLISECOND']
  • "},{"location":"extensions/functions_datetime/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_datetime/#min","title":"min","text":"

    Implementations: min(x): -> return_type 0. min(date): -> date? 1. min(time): -> time? 2. min(timestamp): -> timestamp? 3. min(precision_timestamp<P>): -> precision_timestamp?<P> 4. min(timestamp_tz): -> timestamp_tz? 5. min(precision_timestamp_tz<P>): -> precision_timestamp_tz?<P> 6. min(interval_day<P>): -> interval_day?<P> 7. min(interval_year): -> interval_year?

    Min a set of values.

    "},{"location":"extensions/functions_datetime/#max","title":"max","text":"

    Implementations: max(x): -> return_type 0. max(date): -> date? 1. max(time): -> time? 2. max(timestamp): -> timestamp? 3. max(timestamp_tz): -> timestamp_tz? 4. max(precision_timestamp_tz<P>): -> precision_timestamp_tz?<P> 5. max(interval_day<P>): -> interval_day?<P> 6. max(interval_year): -> interval_year?

    Max a set of values.

    "},{"location":"extensions/functions_geometry/","title":"functions_geometry.yaml","text":"

    This document file is generated for functions_geometry.yaml

    "},{"location":"extensions/functions_geometry/#data-types","title":"Data Types","text":"

    name: geometry structure: BINARY

    "},{"location":"extensions/functions_geometry/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_geometry/#point","title":"point","text":"

    Implementations: point(x, y): -> return_type 0. point(fp64, fp64): -> u!geometry

    *Returns a 2D point with the given x and y coordinate values. *

    "},{"location":"extensions/functions_geometry/#make_line","title":"make_line","text":"

    Implementations: make_line(geom1, geom2): -> return_type 0. make_line(u!geometry, u!geometry): -> u!geometry

    *Returns a linestring connecting the endpoint of geometry geom1 to the begin point of geometry geom2. Repeated points at the beginning of input geometries are collapsed to a single point. A linestring can be closed or simple. A closed linestring starts and ends on the same point. A simple linestring does not cross or touch itself. *

    "},{"location":"extensions/functions_geometry/#x_coordinate","title":"x_coordinate","text":"

    Implementations: x_coordinate(point): -> return_type 0. x_coordinate(u!geometry): -> fp64

    *Return the x coordinate of the point. Return null if not available. *

    "},{"location":"extensions/functions_geometry/#y_coordinate","title":"y_coordinate","text":"

    Implementations: y_coordinate(point): -> return_type 0. y_coordinate(u!geometry): -> fp64

    *Return the y coordinate of the point. Return null if not available. *

    "},{"location":"extensions/functions_geometry/#num_points","title":"num_points","text":"

    Implementations: num_points(geom): -> return_type 0. num_points(u!geometry): -> i64

    *Return the number of points in the geometry. The geometry should be an linestring or circularstring. *

    "},{"location":"extensions/functions_geometry/#is_empty","title":"is_empty","text":"

    Implementations: is_empty(geom): -> return_type 0. is_empty(u!geometry): -> boolean

    *Return true is the geometry is an empty geometry. *

    "},{"location":"extensions/functions_geometry/#is_closed","title":"is_closed","text":"

    Implementations: is_closed(geom): -> return_type 0. is_closed(u!geometry): -> boolean

    *Return true if the geometry\u2019s start and end points are the same. *

    "},{"location":"extensions/functions_geometry/#is_simple","title":"is_simple","text":"

    Implementations: is_simple(geom): -> return_type 0. is_simple(u!geometry): -> boolean

    *Return true if the geometry does not self intersect. *

    "},{"location":"extensions/functions_geometry/#is_ring","title":"is_ring","text":"

    Implementations: is_ring(geom): -> return_type 0. is_ring(u!geometry): -> boolean

    *Return true if the geometry\u2019s start and end points are the same and it does not self intersect. *

    "},{"location":"extensions/functions_geometry/#geometry_type","title":"geometry_type","text":"

    Implementations: geometry_type(geom): -> return_type 0. geometry_type(u!geometry): -> string

    *Return the type of geometry as a string. *

    "},{"location":"extensions/functions_geometry/#envelope","title":"envelope","text":"

    Implementations: envelope(geom): -> return_type 0. envelope(u!geometry): -> u!geometry

    *Return the minimum bounding box for the input geometry as a geometry. The returned geometry is defined by the corner points of the bounding box. If the input geometry is a point or a line, the returned geometry can also be a point or line. *

    "},{"location":"extensions/functions_geometry/#dimension","title":"dimension","text":"

    Implementations: dimension(geom): -> return_type 0. dimension(u!geometry): -> i8

    *Return the dimension of the input geometry. If the input is a collection of geometries, return the largest dimension from the collection. Dimensionality is determined by the complexity of the input and not the coordinate system being used. Type dimensions: POINT - 0 LINE - 1 POLYGON - 2 *

    "},{"location":"extensions/functions_geometry/#is_valid","title":"is_valid","text":"

    Implementations: is_valid(geom): -> return_type 0. is_valid(u!geometry): -> boolean

    *Return true if the input geometry is a valid 2D geometry. For 3 dimensional and 4 dimensional geometries, the validity is still only tested in 2 dimensions. *

    "},{"location":"extensions/functions_geometry/#collection_extract","title":"collection_extract","text":"

    Implementations: collection_extract(geom_collection): -> return_type 0. collection_extract(u!geometry): -> u!geometry 1. collection_extract(u!geometry, i8): -> u!geometry

    *Given the input geometry collection, return a homogenous multi-geometry. All geometries in the multi-geometry will have the same dimension. If type is not specified, the multi-geometry will only contain geometries of the highest dimension. If type is specified, the multi-geometry will only contain geometries of that type. If there are no geometries of the specified type, an empty geometry is returned. Only points, linestrings, and polygons are supported. Type numbers: POINT - 0 LINE - 1 POLYGON - 2 *

    "},{"location":"extensions/functions_geometry/#flip_coordinates","title":"flip_coordinates","text":"

    Implementations: flip_coordinates(geom_collection): -> return_type 0. flip_coordinates(u!geometry): -> u!geometry

    *Return a version of the input geometry with the X and Y axis flipped. This operation can be performed on geometries with more than 2 dimensions. However, only X and Y axis will be flipped. *

    "},{"location":"extensions/functions_geometry/#remove_repeated_points","title":"remove_repeated_points","text":"

    Implementations: remove_repeated_points(geom): -> return_type 0. remove_repeated_points(u!geometry): -> u!geometry 1. remove_repeated_points(u!geometry, fp64): -> u!geometry

    *Return a version of the input geometry with duplicate consecutive points removed. If the tolerance argument is provided, consecutive points within the tolerance distance of one another are considered to be duplicates. *

    "},{"location":"extensions/functions_geometry/#buffer","title":"buffer","text":"

    Implementations: buffer(geom, buffer_radius): -> return_type 0. buffer(u!geometry, fp64): -> u!geometry

    *Compute and return an expanded version of the input geometry. All the points of the returned geometry are at a distance of buffer_radius away from the points of the input geometry. If a negative buffer_radius is provided, the geometry will shrink instead of expand. A negative buffer_radius may shrink the geometry completely, in which case an empty geometry is returned. For input the geometries of points or lines, a negative buffer_radius will always return an emtpy geometry. *

    "},{"location":"extensions/functions_geometry/#centroid","title":"centroid","text":"

    Implementations: centroid(geom): -> return_type 0. centroid(u!geometry): -> u!geometry

    *Return a point which is the geometric center of mass of the input geometry. *

    "},{"location":"extensions/functions_geometry/#minimum_bounding_circle","title":"minimum_bounding_circle","text":"

    Implementations: minimum_bounding_circle(geom): -> return_type 0. minimum_bounding_circle(u!geometry): -> u!geometry

    *Return the smallest circle polygon that contains the input geometry. *

    "},{"location":"extensions/functions_logarithmic/","title":"functions_logarithmic.yaml","text":"

    This document file is generated for functions_logarithmic.yaml

    "},{"location":"extensions/functions_logarithmic/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_logarithmic/#ln","title":"ln","text":"

    Implementations: ln(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type 0. ln(i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 1. ln(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32 2. ln(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 3. ln(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    Natural logarithm of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • "},{"location":"extensions/functions_logarithmic/#log10","title":"log10","text":"

    Implementations: log10(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type 0. log10(i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 1. log10(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32 2. log10(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 3. log10(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    Logarithm to base 10 of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • "},{"location":"extensions/functions_logarithmic/#log2","title":"log2","text":"

    Implementations: log2(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type 0. log2(i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 1. log2(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32 2. log2(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 3. log2(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    Logarithm to base 2 of the value

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • "},{"location":"extensions/functions_logarithmic/#logb","title":"logb","text":"

    Implementations: logb(x, base, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type

  • x: The number `x` to compute the logarithm of
  • base: The logarithm base `b` to use
  • 0. logb(i64, i64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 1. logb(fp32, fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32 2. logb(fp64, fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 3. logb(decimal<P1,S1>, decimal<P1,S1>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    *Logarithm of the value with the given base logb(x, b) => log_{b} (x) *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • "},{"location":"extensions/functions_logarithmic/#log1p","title":"log1p","text":"

    Implementations: log1p(x, option:rounding, option:on_domain_error, option:on_log_zero): -> return_type 0. log1p(fp32, option:rounding, option:on_domain_error, option:on_log_zero): -> fp32 1. log1p(fp64, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64 2. log1p(decimal<P,S>, option:rounding, option:on_domain_error, option:on_log_zero): -> fp64

    *Natural logarithm (base e) of 1 + x log1p(x) => log(1+x) *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR']
  • on_domain_error ['NAN', 'NULL', 'ERROR']
  • on_log_zero ['NAN', 'ERROR', 'MINUS_INFINITY']
  • "},{"location":"extensions/functions_rounding/","title":"functions_rounding.yaml","text":"

    This document file is generated for functions_rounding.yaml

    "},{"location":"extensions/functions_rounding/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_rounding/#ceil","title":"ceil","text":"

    Implementations: ceil(x): -> return_type 0. ceil(fp32): -> fp32 1. ceil(fp64): -> fp64

    *Rounding to the ceiling of the value x. *

    "},{"location":"extensions/functions_rounding/#floor","title":"floor","text":"

    Implementations: floor(x): -> return_type 0. floor(fp32): -> fp32 1. floor(fp64): -> fp64

    *Rounding to the floor of the value x. *

    "},{"location":"extensions/functions_rounding/#round","title":"round","text":"

    Implementations: round(x, s, option:rounding): -> return_type

  • x: Numerical expression to be rounded.
  • s: Number of decimal places to be rounded to. When `s` is a positive number, nothing will happen since `x` is an integer value. When `s` is a negative number, the rounding is performed to the nearest multiple of `10^(-s)`.
  • 0. round(i8, i32, option:rounding): -> i8? 1. round(i16, i32, option:rounding): -> i16? 2. round(i32, i32, option:rounding): -> i32? 3. round(i64, i32, option:rounding): -> i64? 4. round(fp32, i32, option:rounding): -> fp32? 5. round(fp64, i32, option:rounding): -> fp64?

    *Rounding the value x to s decimal places. *

    Options:
  • rounding ['TIE_TO_EVEN', 'TIE_AWAY_FROM_ZERO', 'TRUNCATE', 'CEILING', 'FLOOR', 'AWAY_FROM_ZERO', 'TIE_DOWN', 'TIE_UP', 'TIE_TOWARDS_ZERO', 'TIE_TO_ODD']
  • "},{"location":"extensions/functions_set/","title":"functions_set.yaml","text":"

    This document file is generated for functions_set.yaml

    "},{"location":"extensions/functions_set/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_set/#index_in","title":"index_in","text":"

    Implementations: index_in(needle, haystack, option:nan_equality): -> return_type 0. index_in(any1, list<any1>, option:nan_equality): -> i64?

    *Checks the membership of a value in a list of values Returns the first 0-based index value of some input needle if needle is equal to any element in haystack. Returns NULL if not found. If needle is NULL, returns NULL. If needle is NaN: - Returns 0-based index of NaN in input (default) - Returns NULL (if NAN_IS_NOT_NAN is specified) *

    Options:
  • nan_equality ['NAN_IS_NAN', 'NAN_IS_NOT_NAN']
  • "},{"location":"extensions/functions_string/","title":"functions_string.yaml","text":"

    This document file is generated for functions_string.yaml

    "},{"location":"extensions/functions_string/#scalar-functions","title":"Scalar Functions","text":""},{"location":"extensions/functions_string/#concat","title":"concat","text":"

    Implementations: concat(input, option:null_handling): -> return_type 0. concat(varchar<L1>, option:null_handling): -> varchar<L1> 1. concat(string, option:null_handling): -> string

    Concatenate strings. The null_handling option determines whether or not null values will be recognized by the function. If null_handling is set to IGNORE_NULLS, null value arguments will be ignored when strings are concatenated. If set to ACCEPT_NULLS, the result will be null if any argument passed to the concat function is null.

    Options:
  • null_handling ['IGNORE_NULLS', 'ACCEPT_NULLS']
  • "},{"location":"extensions/functions_string/#like","title":"like","text":"

    Implementations: like(input, match, option:case_sensitivity): -> return_type

  • input: The input string.
  • match: The string to match against the input string.
  • 0. like(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean 1. like(string, string, option:case_sensitivity): -> boolean

    Are two strings like each other. The case_sensitivity option applies to the match argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#substring","title":"substring","text":"

    Implementations: substring(input, start, length, option:negative_start): -> return_type 0. substring(varchar<L1>, i32, i32, option:negative_start): -> varchar<L1> 1. substring(string, i32, i32, option:negative_start): -> string 2. substring(fixedchar<l1>, i32, i32, option:negative_start): -> string 3. substring(varchar<L1>, i32, option:negative_start): -> varchar<L1> 4. substring(string, i32, option:negative_start): -> string 5. substring(fixedchar<l1>, i32, option:negative_start): -> string

    Extract a substring of a specified length starting from position start. A start value of 1 refers to the first characters of the string. When length is not specified the function will extract a substring starting from position start and ending at the end of the string. The negative_start option applies to the start parameter. WRAP_FROM_END means the index will start from the end of the input and move backwards. The last character has an index of -1, the second to last character has an index of -2, and so on. LEFT_OF_BEGINNING means the returned substring will start from the left of the first character. A start of -1 will begin 2 characters left of the the input, while a start of 0 begins 1 character left of the input.

    Options:
  • negative_start ['WRAP_FROM_END', 'LEFT_OF_BEGINNING', 'ERROR']
  • negative_start ['WRAP_FROM_END', 'LEFT_OF_BEGINNING']
  • "},{"location":"extensions/functions_string/#regexp_match_substring","title":"regexp_match_substring","text":"

    Implementations: regexp_match_substring(input, pattern, position, occurrence, group, option:case_sensitivity, option:multiline, option:dotall): -> return_type 0. regexp_match_substring(varchar<L1>, varchar<L2>, i64, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> varchar<L1> 1. regexp_match_substring(string, string, i64, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> string

    Extract a substring that matches the given regular expression pattern. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The occurrence of the pattern to be extracted is specified using the occurrence argument. Specifying 1 means the first occurrence will be extracted, 2 means the second occurrence, and so on. The occurrence argument should be a positive non-zero integer. The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The regular expression capture group can be specified using the group argument. Specifying 0 will return the substring matching the full regular expression. Specifying 1 will return the substring matching only the first capture group, and so on. The group argument should be a non-negative integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the occurrence value is out of range, the position value is out of range, or the group value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#regexp_match_substring_1","title":"regexp_match_substring","text":"

    Implementations: regexp_match_substring(input, pattern, option:case_sensitivity, option:multiline, option:dotall): -> return_type 0. regexp_match_substring(string, string, option:case_sensitivity, option:multiline, option:dotall): -> string

    Extract a substring that matches the given regular expression pattern. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The first occurrence of the pattern from the beginning of the string is extracted. It returns the substring matching the full regular expression. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#regexp_match_substring_all","title":"regexp_match_substring_all","text":"

    Implementations: regexp_match_substring_all(input, pattern, position, group, option:case_sensitivity, option:multiline, option:dotall): -> return_type 0. regexp_match_substring_all(varchar<L1>, varchar<L2>, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> List<varchar<L1>> 1. regexp_match_substring_all(string, string, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> List<string>

    Extract all substrings that match the given regular expression pattern. This will return a list of extracted strings with one value for each occurrence of a match. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The regular expression capture group can be specified using the group argument. Specifying 0 will return substrings matching the full regular expression. Specifying 1 will return substrings matching only the first capture group, and so on. The group argument should be a non-negative integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the position value is out of range, or the group value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#starts_with","title":"starts_with","text":"

    Implementations: starts_with(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. starts_with(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean 1. starts_with(varchar<L1>, string, option:case_sensitivity): -> boolean 2. starts_with(varchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean 3. starts_with(string, string, option:case_sensitivity): -> boolean 4. starts_with(string, varchar<L1>, option:case_sensitivity): -> boolean 5. starts_with(string, fixedchar<L1>, option:case_sensitivity): -> boolean 6. starts_with(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean 7. starts_with(fixedchar<L1>, string, option:case_sensitivity): -> boolean 8. starts_with(fixedchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean

    Whether the input string starts with the substring. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#ends_with","title":"ends_with","text":"

    Implementations: ends_with(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. ends_with(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean 1. ends_with(varchar<L1>, string, option:case_sensitivity): -> boolean 2. ends_with(varchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean 3. ends_with(string, string, option:case_sensitivity): -> boolean 4. ends_with(string, varchar<L1>, option:case_sensitivity): -> boolean 5. ends_with(string, fixedchar<L1>, option:case_sensitivity): -> boolean 6. ends_with(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean 7. ends_with(fixedchar<L1>, string, option:case_sensitivity): -> boolean 8. ends_with(fixedchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean

    Whether input string ends with the substring. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#contains","title":"contains","text":"

    Implementations: contains(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. contains(varchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean 1. contains(varchar<L1>, string, option:case_sensitivity): -> boolean 2. contains(varchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean 3. contains(string, string, option:case_sensitivity): -> boolean 4. contains(string, varchar<L1>, option:case_sensitivity): -> boolean 5. contains(string, fixedchar<L1>, option:case_sensitivity): -> boolean 6. contains(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> boolean 7. contains(fixedchar<L1>, string, option:case_sensitivity): -> boolean 8. contains(fixedchar<L1>, varchar<L2>, option:case_sensitivity): -> boolean

    Whether the input string contains the substring. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#strpos","title":"strpos","text":"

    Implementations: strpos(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to search for.
  • 0. strpos(string, string, option:case_sensitivity): -> i64 1. strpos(varchar<L1>, varchar<L1>, option:case_sensitivity): -> i64 2. strpos(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> i64

    Return the position of the first occurrence of a string in another string. The first character of the string is at position 1. If no occurrence is found, 0 is returned. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#regexp_strpos","title":"regexp_strpos","text":"

    Implementations: regexp_strpos(input, pattern, position, occurrence, option:case_sensitivity, option:multiline, option:dotall): -> return_type 0. regexp_strpos(varchar<L1>, varchar<L2>, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64 1. regexp_strpos(string, string, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64

    Return the position of an occurrence of the given regular expression pattern in a string. The first character of the string is at position 1. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. Which occurrence to return the position of is specified using the occurrence argument. Specifying 1 means the position first occurrence will be returned, 2 means the position of the second occurrence, and so on. The occurrence argument should be a positive non-zero integer. If no occurrence is found, 0 is returned. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the occurrence value is out of range, or the position value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#count_substring","title":"count_substring","text":"

    Implementations: count_substring(input, substring, option:case_sensitivity): -> return_type

  • input: The input string.
  • substring: The substring to count.
  • 0. count_substring(string, string, option:case_sensitivity): -> i64 1. count_substring(varchar<L1>, varchar<L2>, option:case_sensitivity): -> i64 2. count_substring(fixedchar<L1>, fixedchar<L2>, option:case_sensitivity): -> i64

    Return the number of non-overlapping occurrences of a substring in an input string. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#regexp_count_substring","title":"regexp_count_substring","text":"

    Implementations: regexp_count_substring(input, pattern, position, option:case_sensitivity, option:multiline, option:dotall): -> return_type 0. regexp_count_substring(string, string, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64 1. regexp_count_substring(varchar<L1>, varchar<L2>, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64 2. regexp_count_substring(fixedchar<L1>, fixedchar<L2>, i64, option:case_sensitivity, option:multiline, option:dotall): -> i64

    Return the number of non-overlapping occurrences of a regular expression pattern in an input string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile or the position value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#regexp_count_substring_1","title":"regexp_count_substring","text":"

    Implementations: regexp_count_substring(input, pattern, option:case_sensitivity, option:multiline, option:dotall): -> return_type 0. regexp_count_substring(string, string, option:case_sensitivity, option:multiline, option:dotall): -> i64

    Return the number of non-overlapping occurrences of a regular expression pattern in an input string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The match starts at the first character of the input string. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#replace","title":"replace","text":"

    Implementations: replace(input, substring, replacement, option:case_sensitivity): -> return_type

  • input: Input string.
  • substring: The substring to replace.
  • replacement: The replacement string.
  • 0. replace(string, string, string, option:case_sensitivity): -> string 1. replace(varchar<L1>, varchar<L2>, varchar<L3>, option:case_sensitivity): -> varchar<L1>

    Replace all occurrences of the substring with the replacement string. The case_sensitivity option applies to the substring argument.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • "},{"location":"extensions/functions_string/#concat_ws","title":"concat_ws","text":"

    Implementations: concat_ws(separator, string_arguments): -> return_type

  • separator: Character to separate strings by.
  • string_arguments: Strings to be concatenated.
  • 0. concat_ws(string, string): -> string 1. concat_ws(varchar<L2>, varchar<L1>): -> varchar<L1>

    Concatenate strings together separated by a separator.

    "},{"location":"extensions/functions_string/#repeat","title":"repeat","text":"

    Implementations: repeat(input, count): -> return_type 0. repeat(string, i64): -> string 1. repeat(varchar<L1>, i64, i64): -> varchar<L1>

    Repeat a string count number of times.

    "},{"location":"extensions/functions_string/#reverse","title":"reverse","text":"

    Implementations: reverse(input): -> return_type 0. reverse(string): -> string 1. reverse(varchar<L1>): -> varchar<L1> 2. reverse(fixedchar<L1>): -> fixedchar<L1>

    Returns the string in reverse order.

    "},{"location":"extensions/functions_string/#replace_slice","title":"replace_slice","text":"

    Implementations: replace_slice(input, start, length, replacement): -> return_type

  • input: Input string.
  • start: The position in the string to start deleting/inserting characters.
  • length: The number of characters to delete from the input string.
  • replacement: The new string to insert at the start position.
  • 0. replace_slice(string, i64, i64, string): -> string 1. replace_slice(varchar<L1>, i64, i64, varchar<L2>): -> varchar<L1>

    Replace a slice of the input string. A specified \u2018length\u2019 of characters will be deleted from the input string beginning at the \u2018start\u2019 position and will be replaced by a new string. A start value of 1 indicates the first character of the input string. If start is negative or zero, or greater than the length of the input string, a null string is returned. If \u2018length\u2019 is negative, a null string is returned. If \u2018length\u2019 is zero, inserting of the new string occurs at the specified \u2018start\u2019 position and no characters are deleted. If \u2018length\u2019 is greater than the input string, deletion will occur up to the last character of the input string.

    "},{"location":"extensions/functions_string/#lower","title":"lower","text":"

    Implementations: lower(input, option:char_set): -> return_type 0. lower(string, option:char_set): -> string 1. lower(varchar<L1>, option:char_set): -> varchar<L1> 2. lower(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Transform the string to lower case characters. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • "},{"location":"extensions/functions_string/#upper","title":"upper","text":"

    Implementations: upper(input, option:char_set): -> return_type 0. upper(string, option:char_set): -> string 1. upper(varchar<L1>, option:char_set): -> varchar<L1> 2. upper(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Transform the string to upper case characters. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • "},{"location":"extensions/functions_string/#swapcase","title":"swapcase","text":"

    Implementations: swapcase(input, option:char_set): -> return_type 0. swapcase(string, option:char_set): -> string 1. swapcase(varchar<L1>, option:char_set): -> varchar<L1> 2. swapcase(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Transform the string\u2019s lowercase characters to uppercase and uppercase characters to lowercase. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • "},{"location":"extensions/functions_string/#capitalize","title":"capitalize","text":"

    Implementations: capitalize(input, option:char_set): -> return_type 0. capitalize(string, option:char_set): -> string 1. capitalize(varchar<L1>, option:char_set): -> varchar<L1> 2. capitalize(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Capitalize the first character of the input string. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • "},{"location":"extensions/functions_string/#title","title":"title","text":"

    Implementations: title(input, option:char_set): -> return_type 0. title(string, option:char_set): -> string 1. title(varchar<L1>, option:char_set): -> varchar<L1> 2. title(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Converts the input string into titlecase. Capitalize the first character of each word in the input string except for articles (a, an, the). Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • "},{"location":"extensions/functions_string/#initcap","title":"initcap","text":"

    Implementations: initcap(input, option:char_set): -> return_type 0. initcap(string, option:char_set): -> string 1. initcap(varchar<L1>, option:char_set): -> varchar<L1> 2. initcap(fixedchar<L1>, option:char_set): -> fixedchar<L1>

    Capitalizes the first character of each word in the input string, including articles, and lowercases the rest. Implementation should follow the utf8_unicode_ci collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.

    Options:
  • char_set ['UTF8', 'ASCII_ONLY']
  • "},{"location":"extensions/functions_string/#char_length","title":"char_length","text":"

    Implementations: char_length(input): -> return_type 0. char_length(string): -> i64 1. char_length(varchar<L1>): -> i64 2. char_length(fixedchar<L1>): -> i64

    Return the number of characters in the input string. The length includes trailing spaces.

    "},{"location":"extensions/functions_string/#bit_length","title":"bit_length","text":"

    Implementations: bit_length(input): -> return_type 0. bit_length(string): -> i64 1. bit_length(varchar<L1>): -> i64 2. bit_length(fixedchar<L1>): -> i64

    Return the number of bits in the input string.

    "},{"location":"extensions/functions_string/#octet_length","title":"octet_length","text":"

    Implementations: octet_length(input): -> return_type 0. octet_length(string): -> i64 1. octet_length(varchar<L1>): -> i64 2. octet_length(fixedchar<L1>): -> i64

    Return the number of bytes in the input string.

    "},{"location":"extensions/functions_string/#regexp_replace","title":"regexp_replace","text":"

    Implementations: regexp_replace(input, pattern, replacement, position, occurrence, option:case_sensitivity, option:multiline, option:dotall): -> return_type

  • input: The input string.
  • pattern: The regular expression to search for within the input string.
  • replacement: The replacement string.
  • position: The position to start the search.
  • occurrence: Which occurrence of the match to replace.
  • 0. regexp_replace(string, string, string, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> string 1. regexp_replace(varchar<L1>, varchar<L2>, varchar<L3>, i64, i64, option:case_sensitivity, option:multiline, option:dotall): -> varchar<L1>

    Search a string for a substring that matches a given regular expression pattern and replace it with a replacement string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github .io/icu/userguide/strings/regexp.html). The occurrence of the pattern to be replaced is specified using the occurrence argument. Specifying 1 means only the first occurrence will be replaced, 2 means the second occurrence, and so on. Specifying 0 means all occurrences will be replaced. The number of characters from the beginning of the string to begin starting to search for pattern matches can be specified using the position argument. Specifying 1 means to search for matches starting at the first character of the input string, 2 means the second character, and so on. The position argument should be a positive non-zero integer. The replacement string can capture groups using numbered backreferences. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile, the replacement contains an illegal back-reference, the occurrence value is out of range, or the position value is out of range.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#regexp_replace_1","title":"regexp_replace","text":"

    Implementations: regexp_replace(input, pattern, replacement, option:case_sensitivity, option:multiline, option:dotall): -> return_type

  • input: The input string.
  • pattern: The regular expression to search for within the input string.
  • replacement: The replacement string.
  • 0. regexp_replace(string, string, string, option:case_sensitivity, option:multiline, option:dotall): -> string

    Search a string for a substring that matches a given regular expression pattern and replace it with a replacement string. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github .io/icu/userguide/strings/regexp.html). The replacement string can capture groups using numbered backreferences. All occurrences of the pattern will be replaced. The search for matches start at the first character of the input. The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string. Behavior is undefined if the regex fails to compile or the replacement contains an illegal back-reference.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#ltrim","title":"ltrim","text":"

    Implementations: ltrim(input, characters): -> return_type

  • input: The string to remove characters from.
  • characters: The set of characters to remove.
  • 0. ltrim(varchar<L1>, varchar<L2>): -> varchar<L1> 1. ltrim(string, string): -> string

    Remove any occurrence of the characters from the left side of the string. If no characters are specified, spaces are removed.

    "},{"location":"extensions/functions_string/#rtrim","title":"rtrim","text":"

    Implementations: rtrim(input, characters): -> return_type

  • input: The string to remove characters from.
  • characters: The set of characters to remove.
  • 0. rtrim(varchar<L1>, varchar<L2>): -> varchar<L1> 1. rtrim(string, string): -> string

    Remove any occurrence of the characters from the right side of the string. If no characters are specified, spaces are removed.

    "},{"location":"extensions/functions_string/#trim","title":"trim","text":"

    Implementations: trim(input, characters): -> return_type

  • input: The string to remove characters from.
  • characters: The set of characters to remove.
  • 0. trim(varchar<L1>, varchar<L2>): -> varchar<L1> 1. trim(string, string): -> string

    Remove any occurrence of the characters from the left and right sides of the string. If no characters are specified, spaces are removed.

    "},{"location":"extensions/functions_string/#lpad","title":"lpad","text":"

    Implementations: lpad(input, length, characters): -> return_type

  • input: The string to pad.
  • length: The length of the output string.
  • characters: The string of characters to use for padding.
  • 0. lpad(varchar<L1>, i32, varchar<L2>): -> varchar<L1> 1. lpad(string, i32, string): -> string

    Left-pad the input string with the string of \u2018characters\u2019 until the specified length of the string has been reached. If the input string is longer than \u2018length\u2019, remove characters from the right-side to shorten it to \u2018length\u2019 characters. If the string of \u2018characters\u2019 is longer than the remaining \u2018length\u2019 needed to be filled, only pad until \u2018length\u2019 has been reached. If \u2018characters\u2019 is not specified, the default value is a single space.

    "},{"location":"extensions/functions_string/#rpad","title":"rpad","text":"

    Implementations: rpad(input, length, characters): -> return_type

  • input: The string to pad.
  • length: The length of the output string.
  • characters: The string of characters to use for padding.
  • 0. rpad(varchar<L1>, i32, varchar<L2>): -> varchar<L1> 1. rpad(string, i32, string): -> string

    Right-pad the input string with the string of \u2018characters\u2019 until the specified length of the string has been reached. If the input string is longer than \u2018length\u2019, remove characters from the left-side to shorten it to \u2018length\u2019 characters. If the string of \u2018characters\u2019 is longer than the remaining \u2018length\u2019 needed to be filled, only pad until \u2018length\u2019 has been reached. If \u2018characters\u2019 is not specified, the default value is a single space.

    "},{"location":"extensions/functions_string/#center","title":"center","text":"

    Implementations: center(input, length, character, option:padding): -> return_type

  • input: The string to pad.
  • length: The length of the output string.
  • character: The character to use for padding.
  • 0. center(varchar<L1>, i32, varchar<L1>, option:padding): -> varchar<L1> 1. center(string, i32, string, option:padding): -> string

    Center the input string by padding the sides with a single character until the specified length of the string has been reached. By default, if the length will be reached with an uneven number of padding, the extra padding will be applied to the right side. The side with extra padding can be controlled with the padding option. Behavior is undefined if the number of characters passed to the character argument is not 1.

    Options:
  • padding ['RIGHT', 'LEFT']
  • "},{"location":"extensions/functions_string/#left","title":"left","text":"

    Implementations: left(input, count): -> return_type 0. left(varchar<L1>, i32): -> varchar<L1> 1. left(string, i32): -> string

    Extract count characters starting from the left of the string.

    "},{"location":"extensions/functions_string/#right","title":"right","text":"

    Implementations: right(input, count): -> return_type 0. right(varchar<L1>, i32): -> varchar<L1> 1. right(string, i32): -> string

    Extract count characters starting from the right of the string.

    "},{"location":"extensions/functions_string/#string_split","title":"string_split","text":"

    Implementations: string_split(input, separator): -> return_type

  • input: The input string.
  • separator: A character used for splitting the string.
  • 0. string_split(varchar<L1>, varchar<L2>): -> List<varchar<L1>> 1. string_split(string, string): -> List<string>

    Split a string into a list of strings, based on a specified separator character.

    "},{"location":"extensions/functions_string/#regexp_string_split","title":"regexp_string_split","text":"

    Implementations: regexp_string_split(input, pattern, option:case_sensitivity, option:multiline, option:dotall): -> return_type

  • input: The input string.
  • pattern: The regular expression to search for within the input string.
  • 0. regexp_string_split(varchar<L1>, varchar<L2>, option:case_sensitivity, option:multiline, option:dotall): -> List<varchar<L1>> 1. regexp_string_split(string, string, option:case_sensitivity, option:multiline, option:dotall): -> List<string>

    Split a string into a list of strings, based on a regular expression pattern. The substrings matched by the pattern will be used as the separators to split the input string and will not be included in the resulting list. The regular expression pattern should follow the International Components for Unicode implementation (https://unicode-org.github.io/icu/userguide/strings/regexp.html). The case_sensitivity option specifies case-sensitive or case-insensitive matching. Enabling the multiline option will treat the input string as multiple lines. This makes the ^ and $ characters match at the beginning and end of any line, instead of just the beginning and end of the input string. Enabling the dotall option makes the . character match line terminator characters in a string.

    Options:
  • case_sensitivity ['CASE_SENSITIVE', 'CASE_INSENSITIVE', 'CASE_INSENSITIVE_ASCII']
  • multiline ['MULTILINE_DISABLED', 'MULTILINE_ENABLED']
  • dotall ['DOTALL_DISABLED', 'DOTALL_ENABLED']
  • "},{"location":"extensions/functions_string/#aggregate-functions","title":"Aggregate Functions","text":""},{"location":"extensions/functions_string/#string_agg","title":"string_agg","text":"

    Implementations: string_agg(input, separator): -> return_type

  • input: Column of string values.
  • separator: Separator for concatenated strings
  • 0. string_agg(string, string): -> string

    Concatenates a column of string values with a separator.

    "},{"location":"relations/basics/","title":"Basics","text":"

    Substrait is designed to allow a user to describe arbitrarily complex data transformations. These transformations are composed of one or more relational operations. Relational operations are well-defined transformation operations that work by taking zero or more input datasets and transforming them into zero or more output transformations. Substrait defines a core set of transformations, but users are also able to extend the operations with their own specialized operations.

    "},{"location":"relations/basics/#plans","title":"Plans","text":"

    A plan is a tree of relations. The root of the tree is the final output of the plan. Each node in the tree is a relational operation. The children of a node are the inputs to the operation. The leaves of the tree are the input datasets to the plan.

    Plans can be composed together using reference relations. This allows for the construction of common plans that can be reused in multiple places. If a plan has no cycles (there is only one plan or each reference relation only references later plans) then the plan will form a DAG (Directed Acyclic Graph).

    "},{"location":"relations/basics/#relational-operators","title":"Relational Operators","text":"

    Each relational operation is composed of several properties. Common properties for relational operations include the following:

    Property Description Type Emit The set of columns output from this operation and the order of those columns. Logical & Physical Hints A set of optionally provided, optionally consumed information about an operation that better informs execution. These might include estimated number of input and output records, estimated record size, likely filter reduction, estimated dictionary size, etc. These can also include implementation specific pieces of execution information. Physical Constraint A set of runtime constraints around the operation, limiting its consumption based on real-world resources (CPU, memory) as well as virtual resources like number of records produced, the largest record size, etc. Physical"},{"location":"relations/basics/#relational-signatures","title":"Relational Signatures","text":"

    In functions, function signatures are declared externally to the use of those signatures (function bindings). In the case of relational operations, signatures are declared directly in the specification. This is due to the speed of change and number of total operations. Relational operations in the specification are expected to be <100 for several years with additions being infrequent. On the other hand, there is an expectation of both a much larger number of functions (1,000s) and a much higher velocity of additions.

    Each relational operation must declare the following:

    • Transformation logic around properties of the data. For example, does a relational operation maintain sortedness of a field? Does an operation change the distribution of data?
    • How many input relations does an operation require?
    • Does the operator produce an output (by specification, we limit relational operations to a single output at this time)
    • What is the schema and field ordering of an output (see emit below)?
    "},{"location":"relations/basics/#emit-output-ordering","title":"Emit: Output Ordering","text":"

    A relational operation uses field references to access specific fields of the input stream. Field references are always ordinal based on the order of the incoming streams. Each relational operation must declare the order of its output data. To simplify things, each relational operation can be in one of two modes:

    1. Direct output: The order of outputs is based on the definition declared by the relational operation.
    2. Remap: A listed ordering of the direct outputs. This remapping can be also used to drop columns no longer used (such as a filter field or join keys after a join). Note that remapping/exclusion can only be done at the outputs root struct. Filtering of compound values or extracting subsets must be done through other operation types (e.g. projection).
    "},{"location":"relations/basics/#relation-properties","title":"Relation Properties","text":"

    There are a number of predefined properties that exist in Substrait relations. These include the following.

    "},{"location":"relations/basics/#distribution","title":"Distribution","text":"

    When data is partitioned across multiple sibling sets, distribution describes that set of properties that apply to any one partition. This is based on a set of distribution expression properties. A distribution is declared as a set of one or more fields and a distribution type across all fields.

    Property Description Required Distribution Fields List of fields references that describe distribution (e.g. [0,2:4,5:0:0]). The order of these references do not impact results. Required for partitioned distribution type. Disallowed for singleton distribution type. Distribution Type PARTITIONED: For a discrete tuple of values for the declared distribution fields, all records with that tuple are located in the same partition. SINGLETON: there will only be a single partition for this operation. Required"},{"location":"relations/basics/#orderedness","title":"Orderedness","text":"

    A guarantee that data output from this operation is provided with a sort order. The sort order will be declared based on a set of sort field definitions based on the emitted output of this operation.

    Property Description Required Sort Fields A list of fields that the data are ordered by. The list is in order of the sort. If we sort by [0,1] then this means we only consider the data for field 1 to be ordered within each discrete value of field 0. At least one required. Per - Sort Field A field reference that the data is sorted by. Required Per - Sort Direction The direction of the data. See direction options below. Required"},{"location":"relations/basics/#ordering-directions","title":"Ordering Directions","text":"Direction Descriptions Nulls Position Ascending Returns data in ascending order based on the quality function associated with the type. Nulls are included before any values. First Descending Returns data in descending order based on the quality function associated with the type. Nulls are included before any values. First Ascending Returns data in ascending order based on the quality function associated with the type. Nulls are included after any values. Last Descending Returns data in descending order based on the quality function associated with the type. Nulls are included after any values. Last Custom function identifier Returns data using a custom function that returns -1, 0, or 1 depending on the order of the data. Per Function Clustered Ensures that all equal values are coalesced (but no ordering between values is defined). E.g. for values 1,2,3,1,2,3, output could be any of the following: 1,1,2,2,3,3 or 1,1,3,3,2,2 or 2,2,1,1,3,3 or 2,2,3,3,1,1 or 3,3,1,1,2,2 or 3,3,2,2,1,1. N/A, may appear anywhere but will be coalesced. Discussion Points
    • Should read definition types be more extensible in the same way that function signatures are? Are extensible read definition types necessary if we have custom relational operators?
    • How are decomposed reads expressed? For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object.
    "},{"location":"relations/common_fields/","title":"Common Fields","text":"

    Every relation contains a common section containing optional hints and emit behavior.

    "},{"location":"relations/common_fields/#emit","title":"Emit","text":"

    A relation which has a direct emit kind outputs the relation\u2019s output without reordering or selection. A relation that specifies an emit output mapping can output its output columns in any order and may leave output columns out.

    Relation Output
    • Many relations (such as Project) by default provide as their output the list of all their input columns plus any generated columns as its output columns. Review each relation to understand its specific output default.
    "},{"location":"relations/common_fields/#hints","title":"Hints","text":"

    Hints provide information that can improve performance but cannot be used to control the behavior. Table statistics, runtime constraints, name hints, and saved computations all fall into this category.

    Hint Design
    • If a hint is not present or has incorrect data the consumer should be able to ignore it and still arrive at the correct result.
    "},{"location":"relations/common_fields/#saved-computations","title":"Saved Computations","text":"

    Computations can be used to save a data structure to use elsewhere. For instance, let\u2019s say we have a plan with a HashEquiJoin and an AggregateDistinct operation. The HashEquiJoin could save its hash table as part of saved computation id number 1 and the AggregateDistinct could read in computation id number 1.

    "},{"location":"relations/embedded_relations/","title":"Embedded Relations","text":"

    Pending.

    Embedded relations allow a Substrait producer to define a set operation that will be embedded in the plan.

    TODO: define lots of details about what interfaces, languages, formats, etc. Should reasonably be an extension of embedded user defined table functions.

    "},{"location":"relations/logical_relations/","title":"Logical Relations","text":""},{"location":"relations/logical_relations/#read-operator","title":"Read Operator","text":"

    The read operator is an operator that produces one output. A simple example would be the reading of a Parquet file. It is expected that many types of reads will be added over time.

    Signature Value Inputs 0 Outputs 1 Property Maintenance N/A (no inputs) Direct Output Order Defaults to the schema of the data read after the optional projection (masked complex expression) is applied."},{"location":"relations/logical_relations/#read-properties","title":"Read Properties","text":"Property Description Required Definition The contents of the read property definition. Required Direct Schema Defines the schema of the output of the read (before any projection or emit remapping/hiding). Required Filter A boolean Substrait expression that describes a filter that must be applied to the data. The filter should be interpreted against the direct schema. Optional, defaults to none. Best Effort Filter A boolean Substrait expression that describes a filter that may be applied to the data. The filter should be interpreted against the direct schema. Optional, defaults to none. Projection A masked complex expression describing the portions of the content that should be read Optional, defaults to all of schema Output Properties Declaration of orderedness and/or distribution properties this read produces. Optional, defaults to no properties. Properties A list of name/value pairs associated with the read. Optional, defaults to empty"},{"location":"relations/logical_relations/#read-filtering","title":"Read Filtering","text":"

    The read relation has two different filter properties. A filter, which must be satisfied by the operator and a best effort filter, which does not have to be satisfied. This reflects the way that consumers are often implemented. A consumer is often only able to fully apply a limited set of operations in the scan. There can then be an extended set of operations which a consumer can apply in a best effort fashion. A producer, when setting these two fields, should take care to only use expressions that the consumer is capable of handling.

    As an example, a consumer may only be able to fully apply (in the read relation) <, =, and > on integral types. The consumer may be able to apply <, =, and > in a best effort fashion on decimal and string types. Consider the filter expression my_int < 10 && my_string < \"x\" && upper(my_string) > \"B\". In this case the filter should be set to my_int < 10 and the best_effort_filter should be set to my_string < \"x\" and the remaining portion (upper(my_string) > \"B\") should be put into a filter relation.

    A filter expression must be interpreted against the direct schema before the projection expression has been applied. As a result, fields may be referenced by the filter expression which are not included in the relation\u2019s output.

    "},{"location":"relations/logical_relations/#read-definition-types","title":"Read Definition Types","text":"Adding new Read Definition Types

    If you have a read definition that\u2019s not covered here, see the process for adding new read definition types.

    Read definition types (like the rest of the features in Substrait) are built by the community and added to the specification.

    "},{"location":"relations/logical_relations/#virtual-table","title":"Virtual Table","text":"

    A virtual table is a table whose contents are embedded in the plan itself. The table data is encoded as records consisting of literal values or expressions that can be resolved without referencing any input data. For example, a literal, a function call involving literals, or any other expression that does not require input.

    Property Description Required Data Required Required"},{"location":"relations/logical_relations/#named-table","title":"Named Table","text":"

    A named table is a reference to data defined elsewhere. For example, there may be a catalog of tables with unique names that both the producer and consumer agree on. This catalog would provide the consumer with more information on how to retrieve the data.

    Property Description Required Names A list of namespaced strings that, together, form the table name Required (at least one)"},{"location":"relations/logical_relations/#files-type","title":"Files Type","text":"Property Description Required Items An array of Items (path or path glob) associated with the read. Required Format per item Enumeration of available formats. Only current option is PARQUET. Required Slicing parameters per item Information to use when reading a slice of a file. Optional"},{"location":"relations/logical_relations/#slicing-files","title":"Slicing Files","text":"

    A read operation is allowed to only read part of a file. This is convenient, for example, when distributing a read operation across several nodes. The slicing parameters are specified as byte offsets into the file.

    Many file formats consist of indivisible \u201cchunks\u201d of data (e.g. Parquet row groups). If this happens the consumer can determine which slice a particular chunk belongs to. For example, one possible approach is that a chunk should only be read if the midpoint of the chunk (dividing by 2 and rounding down) is contained within the asked-for byte range.

    ReadRel Message
    message ReadRel {\n  RelCommon common = 1;\n  NamedStruct base_schema = 2;\n  Expression filter = 3;\n  Expression best_effort_filter = 11;\n  Expression.MaskExpression projection = 4;\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n  // Definition of which type of scan operation is to be performed\n  oneof read_type {\n    VirtualTable virtual_table = 5;\n    LocalFiles local_files = 6;\n    NamedTable named_table = 7;\n    ExtensionTable extension_table = 8;\n  }\n\n  // A base table. The list of string is used to represent namespacing (e.g., mydb.mytable).\n  // This assumes shared catalog between systems exchanging a message.\n  message NamedTable {\n    repeated string names = 1;\n    substrait.extensions.AdvancedExtension advanced_extension = 10;\n  }\n\n  // A table composed of expressions.\n  message VirtualTable {\n    repeated Expression.Literal.Struct values = 1 [deprecated = true];\n    repeated Expression.Nested.Struct expressions = 2;\n  }\n\n  // A stub type that can be used to extend/introduce new table types outside\n  // the specification.\n  message ExtensionTable {\n    google.protobuf.Any detail = 1;\n  }\n\n  // Represents a list of files in input of a scan operation\n  message LocalFiles {\n    repeated FileOrFiles items = 1;\n    substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n    // Many files consist of indivisible chunks (e.g. parquet row groups\n    // or CSV rows).  If a slice partially selects an indivisible chunk\n    // then the consumer should employ some rule to decide which slice to\n    // include the chunk in (e.g. include it in the slice that contains\n    // the midpoint of the chunk)\n    message FileOrFiles {\n      oneof path_type {\n        // A URI that can refer to either a single folder or a single file\n        string uri_path = 1;\n        // A URI where the path portion is a glob expression that can\n        // identify zero or more paths.\n        // Consumers should support the POSIX syntax.  The recursive\n        // globstar (**) may not be supported.\n        string uri_path_glob = 2;\n        // A URI that refers to a single file\n        string uri_file = 3;\n        // A URI that refers to a single folder\n        string uri_folder = 4;\n      }\n\n      // Original file format enum, superseded by the file_format oneof.\n      reserved 5;\n      reserved \"format\";\n\n      // The index of the partition this item belongs to\n      uint64 partition_index = 6;\n\n      // The start position in byte to read from this item\n      uint64 start = 7;\n\n      // The length in byte to read from this item\n      uint64 length = 8;\n\n      message ParquetReadOptions {}\n      message ArrowReadOptions {}\n      message OrcReadOptions {}\n      message DwrfReadOptions {}\n      message DelimiterSeparatedTextReadOptions {\n        // Delimiter separated files may be compressed.  The reader should\n        // autodetect this and decompress as needed.\n\n        // The character(s) used to separate fields.  Common values are comma,\n        // tab, and pipe.  Multiple characters are allowed.\n        string field_delimiter = 1;\n        // The maximum number of bytes to read from a single line.  If a line\n        // exceeds this limit the resulting behavior is undefined.\n        uint64 max_line_size = 2;\n        // The character(s) used to quote strings.  Common values are single\n        // and double quotation marks.\n        string quote = 3;\n        // The number of lines to skip at the beginning of the file.\n        uint64 header_lines_to_skip = 4;\n        // The character used to escape characters in strings.  Backslash is\n        // a common value.  Note that a double quote mark can also be used as an\n        // escape character but the external quotes should be removed first.\n        string escape = 5;\n        // If this value is encountered (including empty string), the resulting\n        // value is null instead.  Leave unset to disable.  If this value is\n        // provided, the effective schema of this file is comprised entirely of\n        // nullable strings.  If not provided, the effective schema is instead\n        // made up of non-nullable strings.\n        optional string value_treated_as_null = 6;\n      }\n\n      // The format of the files along with options for reading those files.\n      oneof file_format {\n        ParquetReadOptions parquet = 9;\n        ArrowReadOptions arrow = 10;\n        OrcReadOptions orc = 11;\n        google.protobuf.Any extension = 12;\n        DwrfReadOptions dwrf = 13;\n        DelimiterSeparatedTextReadOptions text = 14;\n      }\n    }\n  }\n\n}\n
    "},{"location":"relations/logical_relations/#filter-operation","title":"Filter Operation","text":"

    The filter operator eliminates one or more records from the input data based on a boolean filter expression.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Orderedness, Distribution, remapped by emit Direct Output Order The field order as the input."},{"location":"relations/logical_relations/#filter-properties","title":"Filter Properties","text":"Property Description Required Input The relational input. Required Expression A boolean expression which describes which records are included/excluded. Required FilterRel Message
    message FilterRel {\n  RelCommon common = 1;\n  Rel input = 2;\n  Expression condition = 3;\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n}\n
    "},{"location":"relations/logical_relations/#sort-operation","title":"Sort Operation","text":"

    The sort operator reorders a dataset based on one or more identified sort fields and a sorting function for each.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Will update orderedness property to the output of the sort operation. Distribution property only remapped based on emit. Direct Output Order The field order of the input."},{"location":"relations/logical_relations/#sort-properties","title":"Sort Properties","text":"Property Description Required Input The relational input. Required Sort Fields List of one or more fields to sort by. Uses the same properties as the orderedness property. One sort field required SortRel Message
    message SortRel {\n  RelCommon common = 1;\n  Rel input = 2;\n  repeated SortField sorts = 3;\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n}\n
    "},{"location":"relations/logical_relations/#project-operation","title":"Project Operation","text":"

    The project operation will produce one or more additional expressions based on the inputs of the dataset.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Distribution maintained, mapped by emit. Orderedness: Maintained if no window operations. Extended to include projection fields if fields are direct references. If window operations are present, no orderedness is maintained. Direct Output Order The field order of the input + the list of new expressions in the order they are declared in the expressions list."},{"location":"relations/logical_relations/#project-properties","title":"Project Properties","text":"Property Description Required Input The relational input. Required Expressions List of one or more expressions to add to the input. At least one expression required ProjectRel Message
    message ProjectRel {\n  RelCommon common = 1;\n  Rel input = 2;\n  repeated Expression expressions = 3;\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n}\n
    "},{"location":"relations/logical_relations/#cross-product-operation","title":"Cross Product Operation","text":"

    The cross product operation will combine two separate inputs into a single output. It pairs every record from the left input with every record of the right input.

    Signature Value Inputs 2 Outputs 1 Property Maintenance Distribution is maintained. Orderedness is empty post operation. Direct Output Order The emit order of the left input followed by the emit order of the right input."},{"location":"relations/logical_relations/#cross-product-properties","title":"Cross Product Properties","text":"Property Description Required Left Input A relational input. Required Right Input A relational input. Required CrossRel Message
    message CrossRel {\n  RelCommon common = 1;\n  Rel left = 2;\n  Rel right = 3;\n\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n}\n
    "},{"location":"relations/logical_relations/#join-operation","title":"Join Operation","text":"

    The join operation will combine two separate inputs into a single output, based on a join expression. A common subtype of joins is an equality join where the join expression is constrained to a list of equality (or equality + null equality) conditions between the two inputs of the join.

    Signature Value Inputs 2 Outputs 1 Property Maintenance Distribution is maintained. Orderedness is empty post operation. Physical relations may provide better property maintenance. Direct Output Order The emit order of the left input followed by the emit order of the right input."},{"location":"relations/logical_relations/#join-properties","title":"Join Properties","text":"Property Description Required Left Input A relational input. Required Right Input A relational input. Required Join Expression A boolean condition that describes whether each record from the left set \u201cmatch\u201d the record from the right set. Field references correspond to the direct output order of the data. Required. Can be the literal True. Post-Join Filter A boolean condition to be applied to each result record after the inputs have been joined, yielding only the records that satisfied the condition. Optional Join Type One of the join types defined below. Required"},{"location":"relations/logical_relations/#join-types","title":"Join Types","text":"Type Description Inner Return records from the left side only if they match the right side. Return records from the right side only when they match the left side. For each cross input match, return a record including the data from both sides. Non-matching records are ignored. Outer Return all records from both the left and right inputs. For each cross input match, return a record including the data from both sides. For any remaining non-match records, return the record from the corresponding input along with nulls for the opposite input. Left Return all records from the left input. For each cross input match, return a record including the data from both sides. For any remaining non-matching records from the left input, return the left record along with nulls for the right input. Right Return all records from the right input. For each cross input match, return a record including the data from both sides. For any remaining non-matching records from the right input, return the right record along with nulls for the left input. Left Semi Returns records from the left input. These are returned only if the records have a join partner on the right side. Right Semi Returns records from the right input. These are returned only if the records have a join partner on the left side. Left Anti Return records from the left input. These are returned only if the records do not have a join partner on the right side. Right Anti Return records from the right input. These are returned only if the records do not have a join partner on the left side. Left Single Return all records from the left input with no join expansion. If at least one record from the right input matches the left, return one arbitrary matching record from the right input. For any left records without matching right records, return the left record along with nulls for the right input. Similar to a left outer join but only returns one right match at most. Useful for nested sub-queries where we need exactly one record in output (or throw exception). See Section 3.2 of https://15721.courses.cs.cmu.edu/spring2018/papers/16-optimizer2/hyperjoins-btw2017.pdf for more information. Right Single Same as left single except that the right and left inputs are switched. Left Mark Returns one record for each record from the left input. Appends one additional \u201cmark\u201d column to the output of the join. The new column will be listed after all columns from both sides and will be of type nullable boolean. If there is at least one join partner in the right input where the join condition evaluates to true then the mark column will be set to true. Otherwise, if there is at least one join partner in the right input where the join condition evaluates to NULL then the mark column will be set to NULL. Otherwise the mark column will be set to false. Right Mark Returns records from the right input. Appends one additional \u201cmark\u201d column to the output of the join. The new column will be listed after all columns from both sides and will be of type nullable boolean. If there is at least one join partner in the left input where the join condition evaluates to true then the mark column will be set to true. Otherwise, if there is at least one join partner in the left input where the join condition evaluates to NULL then the mark column will be set to NULL. Otherwise the mark column will be set to false. JoinRel Message
    message JoinRel {\n  RelCommon common = 1;\n  Rel left = 2;\n  Rel right = 3;\n  Expression expression = 4;\n  Expression post_join_filter = 5;\n\n  JoinType type = 6;\n\n  enum JoinType {\n    JOIN_TYPE_UNSPECIFIED = 0;\n    JOIN_TYPE_INNER = 1;\n    JOIN_TYPE_OUTER = 2;\n    JOIN_TYPE_LEFT = 3;\n    JOIN_TYPE_RIGHT = 4;\n    JOIN_TYPE_LEFT_SEMI = 5;\n    JOIN_TYPE_LEFT_ANTI = 6;\n    JOIN_TYPE_LEFT_SINGLE = 7;\n    JOIN_TYPE_RIGHT_SEMI = 8;\n    JOIN_TYPE_RIGHT_ANTI = 9;\n    JOIN_TYPE_RIGHT_SINGLE = 10;\n    JOIN_TYPE_LEFT_MARK = 11;\n    JOIN_TYPE_RIGHT_MARK = 12;\n  }\n\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n}\n
    "},{"location":"relations/logical_relations/#set-operation","title":"Set Operation","text":"

    The set operation encompasses several set-level operations that support combining datasets, possibly excluding records based on various types of record level matching.

    Signature Value Inputs 2 or more Outputs 1 Property Maintenance Maintains distribution if all inputs have the same ordinal distribution. Orderedness is not maintained. Direct Output Order The field order of the inputs. All inputs must have identical field types, but field nullabilities may vary."},{"location":"relations/logical_relations/#set-properties","title":"Set Properties","text":"Property Description Required Primary Input The primary input of the dataset. Required Secondary Inputs One or more relational inputs. At least one required Set Operation Type From list below. Required"},{"location":"relations/logical_relations/#set-operation-types","title":"Set Operation Types","text":"

    The set operation type determines both the records that are emitted and the type of the output record.

    For some set operations, whether a specific record is included in the output and if it appears more than once depends on the number of times it occurs across all inputs. In the following table, treat: * m: the number of time a records occurs in the primary input (p) * n1: the number of times a record occurs in the 1st secondary input (s1) * n2: the number of times a record occurs in the 2nd secondary input (s2) * \u2026 * n: the number of times a record occurs in the nth secondary input

    Operation Description Examples Output Nullability Minus (Primary) Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicates.Each value is treated as a unique member of the set, so duplicates in the first set don\u2019t affect the result.This operation maps to SQL EXCEPT DISTINCT. MINUS\u00a0\u00a0p: {1, 2, 2, 3, 3, 3, 4}\u00a0\u00a0s1: {1, 2}\u00a0\u00a0s2: {3}YIELDS{4} The same as the primary input. Minus (Primary All) Returns all records from the primary input excluding any matching records from secondary inputs.For each specific record returned, the output contains max(0, m - sum(n1, n2, \u2026, n)) copies.This operation maps to SQL EXCEPT ALL. MINUS ALL\u00a0\u00a0p: {1, 2, 2, 3, 3, 3, 3}\u00a0\u00a0s1: {1, 2, 3, 4}\u00a0\u00a0s2: {3}YIELDS{2, 3, 3} The same as the primary input. Minus (Multiset) Returns all records from the primary input excluding any records that are included in all secondary inputs.This operation does not have a direct SQL mapping. MINUS MULTISET\u00a0\u00a0p: {1, 2, 3, 4}\u00a0\u00a0s1: {1, 2}\u00a0\u00a0s2: {1, 2, 3}YIELDS{3, 4} The same as the primary input. Intersection (Primary) Returns all records from the primary input that are present in any secondary input, removing duplicates.This operation does not have a direct SQL mapping. INTERSECT\u00a0\u00a0p: {1, 2, 2, 3, 3, 3, 4}\u00a0\u00a0s1: {1, 2, 3, 5}\u00a0\u00a0s2: {2, 3, 6}YIELDS{1, 2, 3} If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output. Intersection (Multiset) Returns all records from the primary input that match at least one record from all secondary inputs.This operation maps to SQL INTERSECT DISTINCT INTERSECT MULTISET\u00a0\u00a0p: {1, 2, 3, 4}\u00a0\u00a0s1: {2, 3}\u00a0\u00a0s2: {3, 4}YIELDS{3} If a field is required in any of the inputs, it is required in the output. Intersection (Multiset All) Returns all records from the primary input that are present in every secondary input.For each specific record returned, the output contains min(m, n1, n2, \u2026, n) copies.This operation maps to SQL INTERSECT ALL. INTERSECT ALL\u00a0\u00a0p: {1, 2, 2, 3, 3, 3, 4}\u00a0\u00a0s1: {1, 2, 3, 3, 5}\u00a0\u00a0s2: {2, 3, 3, 6}YIELDS{2, 3, 3} If a field is required in any of the inputs, it is required in the output. Union Distinct Returns all records from each set, removing duplicates.This operation maps to SQL UNION DISTINCT. UNION\u00a0\u00a0p: {1, 2, 2, 3, 3, 3, 4}\u00a0\u00a0s1: {2, 3, 5}\u00a0\u00a0s2: {1, 6}YIELDS{1, 2, 3, 4, 5, 6} If a field is nullable in any of the inputs, it is nullable in the output. Union All Returns all records from all inputs.For each specific record returned, the output contains (m + n1 + n2 + \u2026 + n) copies.This operation maps to SQL UNION ALL. UNION ALL\u00a0\u00a0p: {1, 2, 2, 3, 3, 3, 4}\u00a0\u00a0s1: {2, 3, 5}\u00a0\u00a0s2: {1, 6}YIELDS{1, 2, 2, 3, 3, 3, 4, 2, 3, 5, 1, 6} If a field is nullable in any of the inputs, it is nullable in the output.

    Note that for set operations, NULL matches NULL. That is

    {NULL, 1, 3} MINUS          {NULL, 2, 4} === (1), (3)\n{NULL, 1, 3} INTERSECTION   {NULL, 2, 3} === (NULL)\n{NULL, 1, 3} UNION DISTINCT {NULL, 2, 4} === (NULL), (1), (2), (3), (4)\n

    "},{"location":"relations/logical_relations/#output-type-derivation-examples","title":"Output Type Derivation Examples","text":"

    Given the following inputs, where R is Required and N is Nullable:

    Input 1: (R, R, R, R, N, N, N, N)  Primary Input\nInput 2: (R, R, N, N, R, R, N, N)  Secondary Input\nInput 3: (R, N, R, N, R, N, R, N)  Secondary Input\n

    The output type is as follows for the various operations

    Property Output Type Minus (Primary) (R, R, R, R, N, N, N, N) Minus (Primary All) (R, R, R, R, N, N, N, N) Minus (Multiset) (R, R, R, R, N, N, N, N) Intersection (Primary) (R, R, R, R, R, N, N, N) Intersection (Multiset) (R, R, R, R, R, R, R, N) Intersection (Multiset All) (R, R, R, R, R, R, R, N) Union Distinct (R, N, N, N, N, N, N, N) Union All (R, N, N, N, N, N, N, N) SetRel Message
    message SetRel {\n  RelCommon common = 1;\n  // The first input is the primary input, the remaining are secondary\n  // inputs.  There must be at least two inputs.\n  repeated Rel inputs = 2;\n  SetOp op = 3;\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n  enum SetOp {\n    SET_OP_UNSPECIFIED = 0;\n    SET_OP_MINUS_PRIMARY = 1;\n    SET_OP_MINUS_PRIMARY_ALL = 7;\n    SET_OP_MINUS_MULTISET = 2;\n    SET_OP_INTERSECTION_PRIMARY = 3;\n    SET_OP_INTERSECTION_MULTISET = 4;\n    SET_OP_INTERSECTION_MULTISET_ALL = 8;\n    SET_OP_UNION_DISTINCT = 5;\n    SET_OP_UNION_ALL = 6;\n  }\n\n}\n
    "},{"location":"relations/logical_relations/#fetch-operation","title":"Fetch Operation","text":"

    The fetch operation eliminates records outside a desired window. Typically corresponds to a fetch/offset SQL clause. Will only returns records between the start offset and the end offset.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution and orderedness. Direct Output Order Unchanged from input."},{"location":"relations/logical_relations/#fetch-properties","title":"Fetch Properties","text":"Property Description Required Input A relational input, typically with a desired orderedness property. Required Offset Expression An expression which evaluates to a non-negative integer or null (recommended type is i64). Declares the offset for retrieval of records. An expression evaluating to null is treated as 0. Optional, defaults to a 0 literal. Count Expression An expression which evaluates to a non-negative integer or null (recommended type is i64). Declares the number of records that should be returned. An expression evaluating to null indicates that all records should be returned. Optional, defaults to a null literal. FetchRel Message
    message FetchRel {\n  RelCommon common = 1;\n  Rel input = 2;\n  // Note: A oneof field is inherently optional, whereas individual fields\n  // within a oneof cannot be marked as optional. The unset state of offset\n  // should therefore be checked at the oneof level. Unset is treated as 0.\n  oneof offset_mode {\n    // the offset expressed in number of records\n    // Deprecated: use `offset_expr` instead\n    int64 offset = 3 [deprecated = true];\n    // Expression evaluated into a non-negative integer specifying the number\n    // of records to skip. An expression evaluating to null is treated as 0.\n    // Evaluating to a negative integer should result in an error.\n    // Recommended type for offset is int64.\n    Expression offset_expr = 5;\n  }\n  // Note: A oneof field is inherently optional, whereas individual fields\n  // within a oneof cannot be marked as optional. The unset state of count\n  // should therefore be checked at the oneof level. Unset is treated as ALL.\n  oneof count_mode {\n    // the amount of records to return\n    // use -1 to signal that ALL records should be returned\n    // Deprecated: use `count_expr` instead\n    int64 count = 4 [deprecated = true];\n    // Expression evaluated into a non-negative integer specifying the number\n    // of records to return. An expression evaluating to null signals that ALL\n    // records should be returned.\n    // Evaluating to a negative integer should result in an error.\n    // Recommended type for count is int64.\n    Expression count_expr = 6;\n  }\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n}\n
    "},{"location":"relations/logical_relations/#aggregate-operation","title":"Aggregate Operation","text":"

    The aggregate operation groups input data on one or more sets of grouping keys, calculating each measure for each combination of grouping key.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution if all distribution fields are contained in every grouping set. No orderedness guaranteed. Direct Output Order The list of grouping expressions in declaration order followed by the list of measures in declaration order, followed by an i32 describing the associated particular grouping set the value is derived from (if applicable).

    In its simplest form, an aggregation has only measures. In this case, all records are folded into one, and a column is returned for each aggregate expression in the measures list.

    Grouping sets can be used for finer-grained control over which records are folded. A grouping set consists of zero or more references to the list of grouping expressions. Within a grouping set, two records will be folded together if and only if they have the same values for each of the expressions in the grouping set. The values returned by the grouping expressions will be returned as columns to the left of the columns for the aggregate expressions. Each of the grouping expressions must occur in at least one of the grouping sets. If a grouping set contains no grouping expressions, all rows will be folded for that grouping set. (Having a single grouping set with no grouping expressions is thus equivalent to not having any grouping sets.)

    It is possible to specify multiple grouping sets in a single aggregate operation. The grouping sets behave more or less independently, with each returned record belonging to one of the grouping sets. The values for the grouping expression columns that are not part of the grouping set for a particular record will be set to null. The columns for grouping expressions that do not appear in all grouping sets will be nullable (regardless of the nullability of the type returned by the grouping expression) to accomodate the null insertion.

    To further disambiguate which record belongs to which grouping set, an aggregate relation with more than one grouping set receives an extra i32 column on the right-hand side. The value of this field will be the zero-based index of the grouping set that yielded the record.

    If at least one grouping expression is present, the aggregation is allowed to not have any aggregate expressions. An aggregate relation is invalid if it would yield zero columns.

    "},{"location":"relations/logical_relations/#aggregate-properties","title":"Aggregate Properties","text":"Property Description Required Input The relational input. Required Grouping Sets One or more grouping sets. Optional, required if no measures. Per Grouping Set A list of expression grouping that the aggregation measured should be calculated for. Optional. Measures A list of one or more aggregate expressions along with an optional filter. Optional, required if no grouping sets. AggregateRel Message
    message AggregateRel {\n  RelCommon common = 1;\n\n  // Input of the aggregation\n  Rel input = 2;\n\n  // A list of zero or more grouping sets that the aggregation measures should\n  // be calculated for. There must be at least one grouping set if there are no\n  // measures (but it can be the empty grouping set).\n  repeated Grouping groupings = 3;\n\n  // A list of one or more aggregate expressions along with an optional filter.\n  // Required if there are no groupings.\n  repeated Measure measures = 4;\n\n  // A list of zero or more grouping expressions that grouping sets (i.e.,\n  // `Grouping` messages in the `groupings` field) can reference. Each\n  // expression in this list must be referred to by at least one\n  // `Grouping.expression_references`.\n  repeated Expression grouping_expressions = 5;\n\n  substrait.extensions.AdvancedExtension advanced_extension = 10;\n\n  message Grouping {\n    // Deprecated in favor of `expression_references` below.\n    repeated Expression grouping_expressions = 1 [deprecated = true];\n\n    // A list of zero or more references to grouping expressions, i.e., indices\n    // into the `grouping_expression` list.\n    repeated uint32 expression_references = 2;\n  }\n\n  message Measure {\n    AggregateFunction measure = 1;\n\n    // An optional boolean expression that acts to filter which records are\n    // included in the measure. True means include this record for calculation\n    // within the measure.\n    // Helps to support SUM(<c>) FILTER(WHERE...) syntax without masking opportunities for optimization\n    Expression filter = 2;\n  }\n\n}\n
    "},{"location":"relations/logical_relations/#reference-operator","title":"Reference Operator","text":"

    The reference operator is used to construct DAGs of operations. In a Plan we can have multiple Rel representing various computations with potentially multiple outputs. The ReferenceRel is used to express the fact that multiple Rel might be sharing subtrees of computation. This can be used to express arbitrary DAGs as well as represent multi-query optimizations.

    As a concrete example think about two queries SELECT * FROM A JOIN B JOIN C and SELECT * FROM A JOIN B JOIN D, We could use the ReferenceRel to highlight the shared A JOIN B between the two queries, by creating a plan with 3 Rel. One expressing A JOIN B (in position 0 in the plan), one using reference as follows: ReferenceRel(0) JOIN C and a third one doing ReferenceRel(0) JOIN D. This allows to avoid the redundancy of A JOIN B.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains all properties of the input Direct Output Order Maintains order"},{"location":"relations/logical_relations/#reference-properties","title":"Reference Properties","text":"Property Description Required Referred Rel A zero-indexed positional reference to a Rel defined within the same Plan. Required ReferenceRel Message
    message ReferenceRel {\n  int32 subtree_ordinal = 1;\n\n}\n
    "},{"location":"relations/logical_relations/#write-operator","title":"Write Operator","text":"

    The write operator is an operator that consumes one input and writes it to storage. This can range from writing to a Parquet file, to INSERT/DELETE/UPDATE in a database.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Output depends on OutputMode (none, or modified records) Direct Output Order Unchanged from input"},{"location":"relations/logical_relations/#write-properties","title":"Write Properties","text":"Property Description Required Write Type Definition of which object we are operating on (e.g., a fully-qualified table name). Required CTAS Schema The names of all the columns and their type for a CREATE TABLE AS. Required only for CTAS Write Operator Which type of operation we are performing (INSERT/DELETE/UPDATE/CTAS). Required Rel Input The Rel representing which records we will be operating on (e.g., VALUES for an INSERT, or which records to DELETE, or records and after-image of their values for UPDATE). Required Create Mode This determines what should happen if the table already exists (ERROR/REPLACE/IGNORE) Required only for CTAS Output Mode For views that modify a DB it is important to control which records to \u201creturn\u201d. Common default is NO_OUTPUT where we return nothing. Alternatively, we can return MODIFIED_RECORDS, that can be further manipulated by layering more rels ontop of this WriteRel (e.g., to \u201ccount how many records were updated\u201d). This also allows to return the after-image of the change. To return before-image (or both) one can use the reference mechanisms and have multiple return values. Required for VIEW CREATE/CREATE_OR_REPLACE/ALTER"},{"location":"relations/logical_relations/#write-definition-types","title":"Write Definition Types","text":"Adding new Write Definition Types

    If you have a write definition that\u2019s not covered here, see the process for adding new write definition types.

    Write definition types are built by the community and added to the specification.

    WriteRel Message
    message WriteRel {\n  // Definition of which TABLE we are operating on\n  oneof write_type {\n    NamedObjectWrite named_table = 1;\n    ExtensionObject extension_table = 2;\n  }\n\n  // The schema of the table (must align with Rel input (e.g., number of leaf fields must match))\n  NamedStruct table_schema = 3;\n\n  // The type of operation to perform\n  WriteOp op = 4;\n\n  // The relation that determines the records to add/remove/modify\n  // the schema must match with table_schema. Default values must be explicitly stated\n  // in a ProjectRel at the top of the input. The match must also\n  // occur in case of DELETE to ensure multi-engine plans are unequivocal.\n  Rel input = 5;\n\n  CreateMode create_mode = 8; // Used with CTAS to determine what to do if the table already exists\n\n  // Output mode determines what is the output of executing this rel\n  OutputMode output = 6;\n  RelCommon common = 7;\n\n  enum WriteOp {\n    WRITE_OP_UNSPECIFIED = 0;\n    // The insert of new records in a table\n    WRITE_OP_INSERT = 1;\n    // The removal of records from a table\n    WRITE_OP_DELETE = 2;\n    // The modification of existing records within a table\n    WRITE_OP_UPDATE = 3;\n    // The Creation of a new table, and the insert of new records in the table\n    WRITE_OP_CTAS = 4;\n  }\n\n  enum CreateMode {\n    CREATE_MODE_UNSPECIFIED = 0;\n    CREATE_MODE_APPEND_IF_EXISTS = 1; // Append the data to the table if it already exists\n    CREATE_MODE_REPLACE_IF_EXISTS = 2; // Replace the table if it already exists (\"OR REPLACE\")\n    CREATE_MODE_IGNORE_IF_EXISTS = 3; // Ignore the request if the table already exists (\"IF NOT EXISTS\")\n    CREATE_MODE_ERROR_IF_EXISTS = 4; // Throw an error if the table already exists (default behavior)\n  }\n\n  enum OutputMode {\n    OUTPUT_MODE_UNSPECIFIED = 0;\n    // return no records at all\n    OUTPUT_MODE_NO_OUTPUT = 1;\n    // this mode makes the operator return all the record INSERTED/DELETED/UPDATED by the operator.\n    // The operator returns the AFTER-image of any change. This can be further manipulated by operators upstreams\n    // (e.g., retunring the typical \"count of modified records\").\n    // For scenarios in which the BEFORE image is required, the user must implement a spool (via references to\n    // subplans in the body of the Rel input) and return those with anounter PlanRel.relations.\n    OUTPUT_MODE_MODIFIED_RECORDS = 2;\n  }\n\n}\n
    "},{"location":"relations/logical_relations/#virtual-table_1","title":"Virtual Table","text":"Property Description Required Name The in-memory name to give the dataset. Required Pin Whether it is okay to remove this dataset from memory or it should be kept in memory. Optional, defaults to false."},{"location":"relations/logical_relations/#files-type_1","title":"Files Type","text":"Property Description Required Path A URI to write the data to. Supports the inclusion of field references that are listed as available in properties as a \u201crotation description field\u201d. Required Format Enumeration of available formats. Only current option is PARQUET. Required"},{"location":"relations/logical_relations/#update-operator","title":"Update Operator","text":"

    The update operator applies a set of column transformations on a named table and writes to a storage.

    Signature Value Inputs 0 Outputs 1 Property Maintenance Output is number of modified records"},{"location":"relations/logical_relations/#update-properties","title":"Update Properties","text":"Property Description Required Update Type Definition of which object we are operating on (e.g., a fully-qualified table name). Required Table Schema The names and types of all the columns of the input table Required Update Condition The condition that must be met for a record to be updated. Required Update Transformations The set of column updates to be applied to the table. Required UpdateRel Message
    message UpdateRel {\n  oneof update_type {\n    NamedTable named_table = 1;\n  }\n\n  NamedStruct table_schema = 2; // The full schema of the named_table\n  Expression condition = 3; // condition to be met for the update to be applied on a record\n\n  // The list of transformations to apply to the columns of the named_table\n  repeated TransformExpression transformations = 4;\n\n  message TransformExpression {\n    Expression transformation = 1; // the transformation to apply\n    int32 column_target = 2; // index of the column to apply the transformation to\n  }\n\n}\n
    "},{"location":"relations/logical_relations/#ddl-data-definition-language-operator","title":"DDL (Data Definition Language) Operator","text":"

    The operator that defines modifications of a database schema (CREATE/DROP/ALTER for TABLE and VIEWS).

    Signature Value Inputs 1 Outputs 0 Property Maintenance N/A (no output) Direct Output Order N/A"},{"location":"relations/logical_relations/#ddl-properties","title":"DDL Properties","text":"Property Description Required Write Type Definition of which type of object we are operating on. Required Table Schema The names of all the columns and their type. Required (except for DROP operations) Table Defaults The set of default values for this table. Required (except for DROP operations) DDL Object Which type of object we are operating on (e.g., TABLE or VIEW). Required DDL Operator The operation to be performed (e.g., CREATE/ALTER/DROP). Required View Definition A Rel representing the \u201cbody\u201d of a VIEW. Required for VIEW CREATE/CREATE_OR_REPLACE/ALTER DdlRel Message
    message DdlRel {\n  // Definition of which type of object we are operating on\n  oneof write_type {\n    NamedObjectWrite named_object = 1;\n    ExtensionObject extension_object = 2;\n  }\n\n  // The columns that will be modified (representing after-image of a schema change)\n  NamedStruct table_schema = 3;\n  // The default values for the columns (representing after-image of a schema change)\n  // E.g., in case of an ALTER TABLE that changes some of the column default values, we expect\n  // the table_defaults Struct to report a full list of default values reflecting the result of applying\n  // the ALTER TABLE operator successfully\n  Expression.Literal.Struct table_defaults = 4;\n\n  // Which type of object we operate on\n  DdlObject object = 5;\n\n  // The type of operation to perform\n  DdlOp op = 6;\n\n  // The body of the CREATE VIEW\n  Rel view_definition = 7;\n  RelCommon common = 8;\n\n  enum DdlObject {\n    DDL_OBJECT_UNSPECIFIED = 0;\n    // A Table object in the system\n    DDL_OBJECT_TABLE = 1;\n    // A View object in the system\n    DDL_OBJECT_VIEW = 2;\n  }\n\n  enum DdlOp {\n    DDL_OP_UNSPECIFIED = 0;\n    // A create operation (for any object)\n    DDL_OP_CREATE = 1;\n    // A create operation if the object does not exist, or replaces it (equivalent to a DROP + CREATE) if the object already exists\n    DDL_OP_CREATE_OR_REPLACE = 2;\n    // An operation that modifies the schema (e.g., column names, types, default values) for the target object\n    DDL_OP_ALTER = 3;\n    // An operation that removes an object from the system\n    DDL_OP_DROP = 4;\n    // An operation that removes an object from the system (without throwing an exception if the object did not exist)\n    DDL_OP_DROP_IF_EXIST = 5;\n  }\n  //TODO add PK/constraints/indexes/etc..?\n\n}\n
    Discussion Points
    • How should correlated operations be handled?
    "},{"location":"relations/physical_relations/","title":"Physical Relations","text":"

    There is no true distinction between logical and physical operations in Substrait. By convention, certain operations are classified as physical, but all operations can be potentially used in any kind of plan. A particular set of transformations or target operators may (by convention) be considered the \u201cphysical plan\u201d but this is a characteristic of the system consuming substrait as opposed to a definition within Substrait.

    "},{"location":"relations/physical_relations/#hash-equijoin-operator","title":"Hash Equijoin Operator","text":"

    The hash equijoin join operator will build a hash table out of the right input based on a set of join keys. It will then probe that hash table for incoming inputs, finding matches.

    Signature Value Inputs 2 Outputs 1 Property Maintenance Distribution is maintained. Orderedness of the left set is maintained in INNER join cases, otherwise it is eliminated. Direct Output Order Same as the Join operator."},{"location":"relations/physical_relations/#hash-equijoin-properties","title":"Hash Equijoin Properties","text":"Property Description Required Left Input A relational input.(Probe-side) Required Right Input A relational input.(Build-side) Required Left Keys References to the fields to join on in the left input. Required Right Keys References to the fields to join on in the right input. Required Post Join Predicate An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. Optional, defaults true. Join Type One of the join types defined in the Join operator. Required"},{"location":"relations/physical_relations/#nlj-nested-loop-join-operator","title":"NLJ (Nested Loop Join) Operator","text":"

    The nested loop join operator does a join by holding the entire right input and then iterating over it using the left input, evaluating the join expression on the Cartesian product of all rows, only outputting rows where the expression is true. Will also include non-matching rows in the OUTER, LEFT and RIGHT operations per the join type requirements.

    Signature Value Inputs 2 Outputs 1 Property Maintenance Distribution is maintained. Orderedness is eliminated. Direct Output Order Same as the Join operator."},{"location":"relations/physical_relations/#nlj-properties","title":"NLJ Properties","text":"Property Description Required Left Input A relational input. Required Right Input A relational input. Required Join Expression A boolean condition that describes whether each record from the left set \u201cmatch\u201d the record from the right set. Optional. Defaults to true (a Cartesian join). Join Type One of the join types defined in the Join operator. Required"},{"location":"relations/physical_relations/#merge-equijoin-operator","title":"Merge Equijoin Operator","text":"

    The merge equijoin does a join by taking advantage of two sets that are sorted on the join keys. This allows the join operation to be done in a streaming fashion.

    Signature Value Inputs 2 Outputs 1 Property Maintenance Distribution is maintained. Orderedness is eliminated. Direct Output Order Same as the Join operator."},{"location":"relations/physical_relations/#merge-join-properties","title":"Merge Join Properties","text":"Property Description Required Left Input A relational input. Required Right Input A relational input. Required Left Keys References to the fields to join on in the left input. Required Right Keys References to the fields to join on in the right input. Reauired Post Join Predicate An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. Optional, defaults true. Join Type One of the join types defined in the Join operator. Required"},{"location":"relations/physical_relations/#exchange-operator","title":"Exchange Operator","text":"

    The exchange operator will redistribute data based on an exchange type definition. Applying this operation will lead to an output that presents the desired distribution.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Orderedness is maintained. Distribution is overwritten based on configuration. Direct Output Order Order of the input."},{"location":"relations/physical_relations/#exchange-types","title":"Exchange Types","text":"Type Description Scatter Distribute data using a system defined hashing function that considers one or more fields. For the same type of fields and same ordering of values, the same partition target should be identified for different ExchangeRels Single Bucket Define an expression that provides a single i32 bucket number. Optionally define whether the expression will only return values within the valid number of partition counts. If not, the system should modulo the return value to determine a target partition. Multi Bucket Define an expression that provides a List<i32> of bucket numbers. Optionally define whether the expression will only return values within the valid number of partition counts. If not, the system should modulo the return value to determine a target partition. The records should be sent to all bucket numbers provided by the expression. Broadcast Send all records to all partitions. Round Robin Send records to each target in sequence. Can follow either exact or approximate behavior. Approximate will attempt to balance the number of records sent to each destination but may not exactly distribute evenly and may send batches of records to each target before moving to the next."},{"location":"relations/physical_relations/#exchange-properties","title":"Exchange Properties","text":"Property Description Required Input The relational input. Required. Distribution Type One of the distribution types defined above. Required. Partition Count The number of partitions targeted for output. Optional. If not defined, implementation system should decide the number of partitions. Note that when not defined, single or multi bucket expressions should not be constrained to count. Expression Mapping Describes a relationship between each partition ID and the destination that partition should be sent to. Optional. A partition may be sent to 0..N locations. Value can either be a URI or arbitrary value."},{"location":"relations/physical_relations/#merging-capture","title":"Merging Capture","text":"

    A receiving operation that will merge multiple ordered streams to maintain orderedness.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Orderedness and distribution are maintained. Direct Output Order Order of the input."},{"location":"relations/physical_relations/#merging-capture-properties","title":"Merging Capture Properties","text":"Property Description Required Blocking Whether the merging should block incoming data. Blocking should be used carefully, based on whether a deadlock can be produced. Optional, defaults to false"},{"location":"relations/physical_relations/#simple-capture","title":"Simple Capture","text":"

    A receiving operation that will merge multiple streams in an arbitrary order.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Orderness is empty after this operation. Distribution are maintained. Direct Output Order Order of the input."},{"location":"relations/physical_relations/#naive-capture-properties","title":"Naive Capture Properties","text":"Property Description Required Input The relational input. Required"},{"location":"relations/physical_relations/#top-n-operation","title":"Top-N Operation","text":"

    The top-N operator reorders a dataset based on one or more identified sort fields as well as a sorting function. Rather than sort the entire dataset, the top-N will only maintain the total number of records required to ensure a limited output. A top-n is a combination of a logical sort and logical fetch operations.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Will update orderedness property to the output of the sort operation. Distribution property only remapped based on emit. Direct Output Order The field order of the input."},{"location":"relations/physical_relations/#top-n-properties","title":"Top-N Properties","text":"Property Description Required Input The relational input. Required Sort Fields List of one or more fields to sort by. Uses the same properties as the orderedness property. One sort field required Offset A positive integer. Declares the offset for retrieval of records. Optional, defaults to 0. Count A positive integer. Declares the number of records that should be returned. Required"},{"location":"relations/physical_relations/#hash-aggregate-operation","title":"Hash Aggregate Operation","text":"

    The hash aggregate operation maintains a hash table for each grouping set to coalesce equivalent tuples.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution if all distribution fields are contained in every grouping set. No orderness guaranteed. Direct Output Order Same as defined by Aggregate operation."},{"location":"relations/physical_relations/#hash-aggregate-properties","title":"Hash Aggregate Properties","text":"Property Description Required Input The relational input. Required Grouping Sets One or more grouping sets. Optional, required if no measures. Per Grouping Set A list of expression grouping that the aggregation measured should be calculated for. Optional, defaults to 0. Measures A list of one or more aggregate expressions. Implementations may or may not support aggregate ordering expressions. Optional, required if no grouping sets."},{"location":"relations/physical_relations/#streaming-aggregate-operation","title":"Streaming Aggregate Operation","text":"

    The streaming aggregate operation leverages data ordered by the grouping expressions to calculate data each grouping set tuple-by-tuple in streaming fashion. All grouping sets and orderings requested on each aggregate must be compatible to allow multiple grouping sets or aggregate orderings.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution if all distribution fields are contained in every grouping set. Maintains input ordering. Direct Output Order Same as defined by Aggregate operation."},{"location":"relations/physical_relations/#streaming-aggregate-properties","title":"Streaming Aggregate Properties","text":"Property Description Required Input The relational input. Required Grouping Sets One or more grouping sets. If multiple grouping sets are declared, sets must all be compatible with the input sortedness. Optional, required if no measures. Per Grouping Set A list of expression grouping that the aggregation measured should be calculated for. Optional, defaults to 0. Measures A list of one or more aggregate expressions. Aggregate expressions ordering requirements must be compatible with expected ordering. Optional, required if no grouping sets."},{"location":"relations/physical_relations/#consistent-partition-window-operation","title":"Consistent Partition Window Operation","text":"

    A consistent partition window operation is a special type of project operation where every function is a window function and all of the window functions share the same sorting and partitioning. This allows for the sort and partition to be calculated once and shared between the various function evaluations.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution and ordering. Direct Output Order Same as Project operator (input followed by each window expression)."},{"location":"relations/physical_relations/#window-properties","title":"Window Properties","text":"Property Description Required Input The relational input. Required Window Functions One or more window functions. At least one required."},{"location":"relations/physical_relations/#expand-operation","title":"Expand Operation","text":"

    The expand operation creates duplicates of input records based on the Expand Fields. Each Expand Field can be a Switching Field or an expression. Switching Fields are described below. If an Expand Field is an expression then its value is consistent across all duplicate rows.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Distribution is maintained if all the distribution fields are consistent fields with direct references. Ordering can only be maintained down to the level of consistent fields that are kept. Direct Output Order The expand fields followed by an i32 column describing the index of the duplicate that the row is derived from."},{"location":"relations/physical_relations/#expand-properties","title":"Expand Properties","text":"Property Description Required Input The relational input. Required Direct Fields Expressions describing the output fields. These refer to the schema of the input. Each Direct Field must be an expression or a Switching Field Required"},{"location":"relations/physical_relations/#switching-field-properties","title":"Switching Field Properties","text":"

    A switching field is a field whose value is different in each duplicated row. All switching fields in an Expand Operation must have the same number of duplicates.

    Property Description Required Duplicates List of one or more expressions. The output will contain a row for each expression. Required"},{"location":"relations/physical_relations/#hashing-window-operation","title":"Hashing Window Operation","text":"

    A window aggregate operation that will build hash tables for each distinct partition expression.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution. Eliminates ordering. Direct Output Order Same as Project operator (input followed by each window expression)."},{"location":"relations/physical_relations/#hashing-window-properties","title":"Hashing Window Properties","text":"Property Description Required Input The relational input. Required Window Expressions One or more window expressions. At least one required."},{"location":"relations/physical_relations/#streaming-window-operation","title":"Streaming Window Operation","text":"

    A window aggregate operation that relies on a partition/ordering sorted input.

    Signature Value Inputs 1 Outputs 1 Property Maintenance Maintains distribution. Eliminates ordering. Direct Output Order Same as Project operator (input followed by each window expression)."},{"location":"relations/physical_relations/#streaming-window-properties","title":"Streaming Window Properties","text":"Property Description Required Input The relational input. Required Window Expressions One or more window expressions. Must be supported by the sortedness of the input. At least one required."},{"location":"relations/user_defined_relations/","title":"User Defined Relations","text":"

    Pending

    "},{"location":"serialization/basics/","title":"Basics","text":"

    Substrait is designed to be serialized into various different formats. Currently we support a binary serialization for transmission of plans between programs (e.g. IPC or network communication) and a text serialization for debugging and human readability. Other formats may be added in the future.

    These formats serialize a collection of plans. Substrait does not define how a collection of plans is to be interpreted. For example, the following scenarios are all valid uses of a collection of plans:

    • A query engine receives a plan and executes it. It receives a collection of plans with a single root plan. The top-level node of the root plan defines the output of the query. Non-root plans may be included as common subplans which are referenced from the root plan.
    • A transpiler may convert plans from one dialect to another. It could take, as input, a single root plan. Then it could output a serialized binary containing multiple root plans. Each root plan is a representation of the input plan in a different dialect.
    • A distributed scheduler might expect 1+ root plans. Each root plan describes a different stage of computation.

    Libraries should make sure to thoroughly describe the way plan collections will be produced or consumed.

    "},{"location":"serialization/basics/#root-plans","title":"Root plans","text":"

    We often refer to query plans as a graph of nodes (typically a DAG unless the query is recursive). However, we encode this graph as a collection of trees with a single root tree that references other trees (which may also transitively reference other trees). Plan serializations all have some way to indicate which plan(s) are \u201croot\u201d plans. Any plan that is not a root plan and is not referenced (directly or transitively) by some root plan can safely be ignored.

    "},{"location":"serialization/binary_serialization/","title":"Binary Serialization","text":"

    Substrait can be serialized into a protobuf-based binary representation. The proto schema/IDL files can be found on GitHub. Proto files are place in the io.substrait namespace for C++/Java and the Substrait.Protobuf namespace for C#.

    "},{"location":"serialization/binary_serialization/#plan","title":"Plan","text":"

    The main top-level object used to communicate a Substrait plan using protobuf is a Plan message (see the ExtendedExpression for an alternative other top-level object). The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees.

    Plan Message
    message Plan {\n  // Substrait version of the plan. Optional up to 0.17.0, required for later\n  // versions.\n  Version version = 6;\n\n  // a list of yaml specifications this plan may depend on\n  repeated substrait.extensions.SimpleExtensionURI extension_uris = 1;\n\n  // a list of extensions this plan may depend on\n  repeated substrait.extensions.SimpleExtensionDeclaration extensions = 2;\n\n  // one or more relation trees that are associated with this plan.\n  repeated PlanRel relations = 3;\n\n  // additional extensions associated with this plan.\n  substrait.extensions.AdvancedExtension advanced_extensions = 4;\n\n  // A list of com.google.Any entities that this plan may use. Can be used to\n  // warn if some embedded message types are unknown. Note that this list may\n  // include message types that are ignorable (optimizations) or that are\n  // unused. In many cases, a consumer may be able to work with a plan even if\n  // one or more message types defined here are unknown.\n  repeated string expected_type_urls = 5;\n\n}\n
    "},{"location":"serialization/binary_serialization/#extensions","title":"Extensions","text":"

    Protobuf supports both simple and advanced extensions. Simple extensions are declared at the plan level and advanced extensions are declared at multiple levels of messages within the plan.

    "},{"location":"serialization/binary_serialization/#simple-extensions","title":"Simple Extensions","text":"

    For simple extensions, a plan references the URIs associated with the simple extensions to provide additional plan capabilities. These URIs will list additional relevant information for the plan.

    Simple extensions within a plan are split into three components: an extension URI, an extension declaration and a number of references.

    • Extension URI: A unique identifier for the extension pointing to a YAML document specifying one or more specific extensions. Declares an anchor that can be used in extension declarations.
    • Extension Declaration: A specific extension within a single YAML document. The declaration combines a reference to the associated Extension URI along with a unique key identifying the specific item within that YAML document (see Function Signature Compound Names). It also defines a declaration anchor. The anchor is a plan-specific unique value that the producer creates as a key to be referenced elsewhere.
    • Extension Reference: A specific instance or use of an extension declaration within the plan body.

    Extension URIs and declarations are encapsulated in the top level of the plan. Extension declarations are then referenced throughout the body of the plan itself. The exact structure of these references will depend on the extension point being used, but they will always include the extension\u2019s anchor (or key). For example, all scalar function expressions contain references to an extension declaration which defines the semantics of the function.

    Simple Extension URI
    message SimpleExtensionURI {\n  // A surrogate key used in the context of a single plan used to reference the\n  // URI associated with an extension.\n  uint32 extension_uri_anchor = 1;\n\n  // The URI where this extension YAML can be retrieved. This is the \"namespace\"\n  // of this extension.\n  string uri = 2;\n\n}\n

    Once the YAML file URI anchor is defined, the anchor will be referenced by zero or more SimpleExtensionDefinitions. For each simple extension definition, an anchor is defined for that specific extension entity. This anchor is then referenced to within lower-level primitives (functions, etc.) to reference that specific extension. Message properties are named *_anchor where the anchor is defined and *_reference when referencing the anchor. For example function_anchor and function_reference.

    Simple Extension Declaration
    message SimpleExtensionDeclaration {\n  oneof mapping_type {\n    ExtensionType extension_type = 1;\n    ExtensionTypeVariation extension_type_variation = 2;\n    ExtensionFunction extension_function = 3;\n  }\n\n  // Describes a Type\n  message ExtensionType {\n    // references the extension_uri_anchor defined for a specific extension URI.\n    uint32 extension_uri_reference = 1;\n\n    // A surrogate key used in the context of a single plan to reference a\n    // specific extension type\n    uint32 type_anchor = 2;\n\n    // the name of the type in the defined extension YAML.\n    string name = 3;\n  }\n\n  message ExtensionTypeVariation {\n    // references the extension_uri_anchor defined for a specific extension URI.\n    uint32 extension_uri_reference = 1;\n\n    // A surrogate key used in the context of a single plan to reference a\n    // specific type variation\n    uint32 type_variation_anchor = 2;\n\n    // the name of the type in the defined extension YAML.\n    string name = 3;\n  }\n\n  message ExtensionFunction {\n    // references the extension_uri_anchor defined for a specific extension URI.\n    uint32 extension_uri_reference = 1;\n\n    // A surrogate key used in the context of a single plan to reference a\n    // specific function\n    uint32 function_anchor = 2;\n\n    // A function signature compound name\n    string name = 3;\n  }\n\n}\n

    Note

    Anchors only have meaning within a single plan and exist simply to reduce plan size. They are not some form of global identifier. Different plans may use different anchors for the same specific functions, types, type variations, etc.

    Note

    It is valid for a plan to include SimpleExtensionURIs and/or SimpleExtensionDeclarations that are not referenced directly.

    "},{"location":"serialization/binary_serialization/#advanced-extensions","title":"Advanced Extensions","text":"

    Substrait protobuf exposes a special object in multiple places in the representation to expose extension capabilities. Extensions are done via this object. Extensions are separated into main concepts:

    Advanced Extension Type Description Optimization A change to the plan that may help some consumers work more efficiently with the plan. These properties should be propagated through plan pipelines where possible but do not impact the meaning of the plan. A consumer can safely ignore these properties. Enhancement A change to the plan that functionally changes the behavior of the plan. Use these sparingly as they will impact plan interoperability. Advanced Extension Protobuf
    message AdvancedExtension {\n  // An optimization is helpful information that don't influence semantics. May\n  // be ignored by a consumer.\n  repeated google.protobuf.Any optimization = 1;\n\n  // An enhancement alter semantics. Cannot be ignored by a consumer.\n  google.protobuf.Any enhancement = 2;\n\n}\n
    "},{"location":"serialization/binary_serialization/#capabilities","title":"Capabilities","text":"

    When two systems exchanging Substrait plans want to understand each other\u2019s capabilities, they may exchange a Capabilities message. The capabilities message provides information on the set of simple and advanced extensions that the system supports.

    Capabilities Message
    message Capabilities {\n  // List of Substrait versions this system supports\n  repeated string substrait_versions = 1;\n\n  // list of com.google.Any message types this system supports for advanced\n  // extensions.\n  repeated string advanced_extension_type_urls = 2;\n\n  // list of simple extensions this system supports.\n  repeated SimpleExtension simple_extensions = 3;\n\n  message SimpleExtension {\n    string uri = 1;\n    repeated string function_keys = 2;\n    repeated string type_keys = 3;\n    repeated string type_variation_keys = 4;\n  }\n\n}\n
    "},{"location":"serialization/binary_serialization/#protobuf-rationale","title":"Protobuf Rationale","text":"

    The binary format of Substrait is designed to be easy to work with in many languages. A key requirement is that someone can take the binary format IDL and use standard tools to build a set of primitives that are easy to work with in any of a number of languages. This allows communities to build and use Substrait using only a binary IDL and the specification (and allows the Substrait project to avoid being required to build libraries for each language to work with the specification).

    There are several binary IDLs that exist today. The key requirements for Substrait are the following:

    • Strongly typed IDL schema language
    • High-quality well-supported and idiomatic bindings/compilers for key languages (Python, Javascript, C++, Go, Rust, Java)
    • Compact serial representation

    The primary formats that exist that roughly qualify under these requirements include: Protobuf, Thrift, Flatbuf, Avro, Cap\u2019N\u2019Proto. Protobuf was chosen due to its clean typing system and large number of high quality language bindings.

    The binary serialization IDLs can be found on GitHub and are sampled throughout the documentation.

    "},{"location":"serialization/text_serialization/","title":"Text Serialization","text":"

    To maximize the new user experience, it is important for Substrait to have a text representation of plans. This allows people to experiment with basic tooling. Building simple CLI tools that do things like SQL > Plan and Plan > SQL or REPL plan construction can all be done relatively straightforwardly with a text representation.

    The recommended text serialization format is JSON. Since the text format is not designed for performance, the format can be produced to maximize readability. This also allows nice symmetry between the construction of plans and the configuration of various extensions such as function signatures and user defined types.

    To ensure the JSON is valid, the object will be defined using the OpenApi 3.1 specification. This not only allows strong validation, the OpenApi specification enables code generators to be easily used to produce plans in many languages.

    While JSON will be used for much of the plan serialization, Substrait uses a custom simplistic grammar for record level expressions. While one can construct an equation such as (10 + 5)/2 using a tree of function and literal objects, it is much more human-readable to consume a plan when the information is written similarly to the way one typically consumes scalar expressions. This grammar will be maintained in an ANTLR grammar (targetable to multiple programming languages) and is also planned to be supported via JSON schema definition format tag so that the grammar can be validated as part of the schema validation.

    "},{"location":"spec/extending/","title":"Extending","text":"

    Substrait is a community project and requires consensus about new additions to the specification in order to maintain consistency. The best way to get consensus is to discuss ideas. The main ways to communicate are:

    • Substrait Mailing List
    • Substrait Slack
    • Community Meeting
    "},{"location":"spec/extending/#minor-changes","title":"Minor changes","text":"

    Simple changes like typos and bug fixes do not require as much effort. File an issue or send a PR and we can discuss it there.

    "},{"location":"spec/extending/#complex-changes","title":"Complex changes","text":"

    For complex features it is useful to discuss the change first. It will be useful to gather some background information to help get everyone on the same page.

    "},{"location":"spec/extending/#outline-the-issue","title":"Outline the issue","text":""},{"location":"spec/extending/#language","title":"Language","text":"

    Every engine has its own terminology. Every Spark user probably knows what an \u201cattribute\u201d is. Velox users will know what a \u201cRowVector\u201d means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible.

    "},{"location":"spec/extending/#motivation","title":"Motivation","text":"

    What problems does this relation solve? If it is a more logical relation then how does it allow users to express new capabilities? If it is more of an internal relation then how does it map to existing logical relations? How is it different than other existing relations? Why do we need this?

    "},{"location":"spec/extending/#examples","title":"Examples","text":"

    Provide example input and output for the relation. Show example plans. Try and motivate your examples, as best as possible, with something that looks like a real world problem. These will go a long ways towards helping others understand the purpose of a relation.

    "},{"location":"spec/extending/#alternatives","title":"Alternatives","text":"

    Discuss what alternatives are out there. Are there other ways to achieve similar results? Do some systems handle this problem differently?

    "},{"location":"spec/extending/#survey-existing-implementation","title":"Survey existing implementation","text":"

    It\u2019s unlikely that this is the first time that this has been done. Figuring out

    "},{"location":"spec/extending/#prototype-the-feature","title":"Prototype the feature","text":"

    Novel approaches should be implemented as an extension first.

    "},{"location":"spec/extending/#substrait-design-principles","title":"Substrait design principles","text":"

    Substrait is designed around interoperability so a feature only used by a single system may not be accepted. But don\u2019t dispair! Substrait has a highly developed extension system for this express purpose.

    "},{"location":"spec/extending/#you-dont-have-to-do-it-alone","title":"You don\u2019t have to do it alone","text":"

    If you are hoping to add a feature and these criteria seem intimidating then feel free to start a mailing list discussion before you have all the information and ask for help. Investigating other implementations, in particular, is something that can be quite difficult to do on your own.

    "},{"location":"spec/specification/","title":"Specification","text":""},{"location":"spec/specification/#status","title":"Status","text":"

    The specification has passed the initial design phase and is now in the final stages of being fleshed out. The community is encouraged to identify (and address) any perceived gaps in functionality using GitHub issues and PRs. Once all of the planned implementations have been completed all deprecated fields will be eliminated and version 1.0 will be released.

    "},{"location":"spec/specification/#components-complete","title":"Components (Complete)","text":"Section Description Simple Types A way to describe the set of basic types that will be operated on within a plan. Only includes simple types such as integers and doubles (nothing configurable or compound). Compound Types Expression of types that go beyond simple scalar values. Key concepts here include: configurable types such as fixed length and numeric types as well as compound types such as structs, maps, lists, etc. Type Variations Physical variations to base types. User Defined Types Extensions that can be defined for specific IR producers/consumers. Field References Expressions to identify which portions of a record should be operated on. Scalar Functions Description of how functions are specified. Concepts include arguments, variadic functions, output type derivation, etc. Scalar Function List A list of well-known canonical functions in YAML format. Specialized Record Expressions Specialized expression types that are more naturally expressed outside the function paradigm. Examples include items such as if/then/else and switch statements. Aggregate Functions Functions that are expressed in aggregation operations. Examples include things such as SUM, COUNT, etc. Operations take many records and collapse them into a single (possibly compound) value. Window Functions Functions that relate a record to a set of encompassing records. Examples in SQL include RANK, NTILE, etc. User Defined Functions Reusable named functions that are built beyond the core specification. Implementations are typically registered thorough external means (drop a file in a directory, send a special command with implementation, etc.) Embedded Functions Functions implementations embedded directly within the plan. Frequently used in data science workflows where business logic is interspersed with standard operations. Relation Basics Basic concepts around relational algebra, record emit and properties. Logical Relations Common relational operations used in compute plans including project, join, aggregation, etc. Text Serialization A human producible & consumable representation of the plan specification. Binary Serialization A high performance & compact binary representation of the plan specification."},{"location":"spec/specification/#components-designed-but-not-implemented","title":"Components (Designed but not Implemented)","text":"Section Description Table Functions Functions that convert one or more values from an input record into 0..N output records. Example include operations such as explode, pos-explode, etc. User Defined Relations Installed and reusable relational operations customized to a particular platform. Embedded Relations Relational operations where plans contain the \u201cmachine code\u201d to directly execute the necessary operations. Physical Relations Specific execution sub-variations of common relational operations that describe have multiple unique physical variants associated with a single logical operation. Examples include hash join, merge join, nested loop join, etc."},{"location":"spec/technology_principles/","title":"Technology Principles","text":"
    • Provide a good suite of well-specified common functionality in databases and data science applications.
    • Make it easy for users to privately or publicly extend the representation to support specialized/custom operations.
    • Produce something that is language agnostic and requires minimal work to start developing against in a new language.
    • Drive towards a common format that avoids specialization for single favorite producer or consumer.
    • Establish clear delineation between specifications that MUST be respected to and those that can be optionally ignored.
    • Establish a forgiving compatibility approach and versioning scheme that supports cross-version compatibility in maximum number of cases.
    • Minimize the need for consumer intelligence by excluding concepts like overloading, type coercion, implicit casting, field name handling, etc. (Note: this is weak and should be better stated.)
    • Decomposability/severability: A particular producer or consumer should be able to produce or consume only a subset of the specification and interact well with any other Substrait system as long the specific operations requested fit within the subset of specification supported by the counter system.
    "},{"location":"spec/versioning/","title":"Versioning","text":"

    As an interface specification, the goal of Substrait is to reach a point where (breaking) changes will never need to happen again, or at least be few and far between. By analogy, Apache Arrow\u2019s in-memory format specification has stayed functionally constant, despite many major library versions being released. However, we\u2019re not there yet. When we believe that we\u2019ve reached this point, we will signal this by releasing version 1.0.0. Until then, we will remain in the 0.x.x version regime.

    Despite this, we strive to maintain backward compatibility for both the binary representation and the text representation by means of deprecation. When a breaking change cannot be reasonably avoided, we may remove previously deprecated fields. All deprecated fields will be removed for the 1.0.0 release.

    Substrait uses semantic versioning for its version numbers, with the addition that, during 0.x.y, we increment the x digit for breaking changes and new features, and the y digit for fixes and other nonfunctional changes. The release process is currently automated and makes a new release every week, provided something has changed on the main branch since the previous release. This release cadence will likely be slowed down as stability increases over time. Conventional commits are used to distinguish between breaking changes, new features, and fixes, and GitHub actions are used to verify that there are indeed no breaking protobuf changes in a commit, unless the commit message states this.

    "},{"location":"tools/producer_tools/","title":"Producer Tools","text":""},{"location":"tools/producer_tools/#isthmus","title":"Isthmus","text":"

    Isthmus is an application that serializes SQL to Substrait Protobuf via the Calcite SQL compiler.

    "},{"location":"tools/substrait_validator/","title":"Substrait Validator","text":"

    The Substrait Validator is a tool used to validate substrait plans as well as print diagnostics information regarding the plan validity.

    "},{"location":"tools/third_party_tools/","title":"Third Party Tools","text":""},{"location":"tools/third_party_tools/#substrait-tools","title":"Substrait-tools","text":"

    The substrait-tools python package provides a command line interface for producing/consuming substrait plans by leveraging the APIs from different producers and consumers.

    "},{"location":"tools/third_party_tools/#substrait-fiddle","title":"Substrait Fiddle","text":"

    Substrait Fiddle is an online tool to share, debug, and prototype Substrait plans.

    The Substrait Fiddle Source is available allowing it to be run in any environment.

    "},{"location":"tutorial/sql_to_substrait/","title":"SQL to Substrait tutorial","text":"

    This is an introductory tutorial to learn the basics of Substrait for readers already familiar with SQL. We will look at how to construct a Substrait plan from an example query.

    We\u2019ll present the Substrait in JSON form to make it relatively readable to newcomers. Typically Substrait is exchanged as a protobuf message, but for debugging purposes it is often helpful to look at a serialized form. Plus, it\u2019s not uncommon for unit tests to represent plans as JSON strings. So if you are developing with Substrait, it\u2019s useful to have experience reading them.

    Note

    Substrait is currently only defined with Protobuf. The JSON provided here is the Protobuf JSON output, but it is not the official Substrait text format. Eventually, Substrait will define it\u2019s own human-readable text format, but for now this tutorial will make do with what Protobuf provides.

    Substrait is designed to communicate plans (mostly logical plans). Those plans contain types, schemas, expressions, extensions, and relations. We\u2019ll look at them in that order, going from simplest to most complex until we can construct full plans.

    This tutorial won\u2019t cover all the details of each piece, but it will give you an idea of how they connect together. For a detailed reference of each individual field, the best place to look is reading the protobuf definitions. They represent the source-of-truth of the spec and are well-commented to address ambiguities.

    "},{"location":"tutorial/sql_to_substrait/#problem-set-up","title":"Problem Set up","text":"

    To learn Substrait, we\u2019ll build up to a specific query. We\u2019ll be using the tables:

    CREATE TABLE orders (\n  product_id: i64 NOT NULL,\n  quantity: i32 NOT NULL,\n  order_date: date NOT NULL,\n  price: decimal(10, 2)\n);\n
    CREATE TABLE products (\n  product_id: i64 NOT NULL,\n  categories: list<string NOT NULL> NOT NULL,\n  details: struct<manufacturer: string, year_created: int32>,\n  product_name: string\n);\n

    This orders table represents events where products were sold, recording how many (quantity) and at what price (price). The products table provides details for each product, with product_id as the primary key.

    And we\u2019ll try to create the query:

    SELECT\n  product_name,\n  product_id,\n  sum(quantity * price) as sales\nFROM\n  orders\nINNER JOIN\n  products\nON\n  orders.product_id = products.product_id\nWHERE\n  -- categories does not contain \"Computers\"\n  INDEX_IN(\"Computers\", categories) IS NULL\nGROUP BY\n  product_name,\n  product_id\n

    The query asked the question: For products that aren\u2019t in the \"Computer\" category, how much has each product generated in sales?

    However, Substrait doesn\u2019t correspond to SQL as much as it does to logical plans. So to be less ambiguous, the plan we are aiming for looks like:

    |-+ Aggregate({sales = sum(quantity_price)}, group_by=(product_name, product_id))\n  |-+ InnerJoin(on=orders.product_id = products.product_id)\n    |- ReadTable(orders)\n    |-+ Filter(INDEX_IN(\"Computers\", categories) IS NULL)\n      |- ReadTable(products)\n
    "},{"location":"tutorial/sql_to_substrait/#types-and-schemas","title":"Types and Schemas","text":"

    As part of the Substrait plan, we\u2019ll need to embed the data types of the input tables. In Substrait, each type is a distinct message, which at a minimum contains a field for nullability. For example, a string field looks like:

    {\n  \"string\": {\n    \"nullability\": \"NULLABILITY_NULLABLE\"\n  }\n}\n

    Nullability is an enum not a boolean, since Substrait allows NULLABILITY_UNSPECIFIED as an option, in addition to NULLABILITY_NULLABLE (nullable) and NULLABILITY_REQUIRED (not nullable).

    Other types such as VarChar and Decimal have other parameters. For example, our orders.price column will be represented as:

    {\n  \"decimal\": {\n    \"precision\": 10,\n    \"scale\": 2,\n    \"nullability\": \"NULLABILITY_NULLABLE\"\n  }\n}\n

    Finally, there are nested compound types such as structs and list types that have other types as parameters. For example, the products.categories column is a list of strings, so can be represented as:

    {\n  \"list\": {\n    \"type\": {\n      \"string\": {\n        \"nullability\": \"NULLABILITY_REQUIRED\"\n      }\n    },\n    \"nullability\": \"NULLABILITY_REQUIRED\"\n  }\n}\n

    To know what parameters each type can take, refer to the Protobuf definitions in type.proto.

    Schemas of tables can be represented with a NamedStruct message, which is the combination of a struct type containing all the columns and a list of column names. For the orders table, this will look like:

    {\n  \"names\": [\n    \"product_id\",\n    \"quantity\",\n    \"order_date\",\n    \"price\"\n  ],\n  \"struct\": {\n    \"types\": [\n      {\n        \"i64\": {\n          \"nullability\": \"NULLABILITY_REQUIRED\"\n        }\n      },\n      {\n        \"i32\": {\n          \"nullability\": \"NULLABILITY_REQUIRED\"\n        }\n      },\n      {\n        \"date\": {\n          \"nullability\": \"NULLABILITY_REQUIRED\"\n        }\n      },\n      {\n        \"decimal\": {\n          \"precision\": 10,\n          \"scale\": 2,\n          \"nullability\": \"NULLABILITY_NULLABLE\"\n        }\n      }\n    ],\n    \"nullability\": \"NULLABILITY_REQUIRED\"\n  }\n}\n

    Here, names is the names of all fields. In nested schemas, this includes the names of subfields in depth-first order. So for the products table, the details struct field will be included as well as the two subfields (manufacturer and year_created) right after. And because it\u2019s depth first, these subfields appear before product_name. The full schema looks like:

    {\n  \"names\": [\n    \"product_id\",\n    \"categories\",\n    \"details\",\n    \"manufacturer\",\n    \"year_created\",\n    \"product_name\"\n  ],\n  \"struct\": {\n    \"types\": [\n      {\n        \"i64\": {\n          \"nullability\": \"NULLABILITY_REQUIRED\"\n        }\n      },\n      {\n        \"list\": {\n          \"type\": {\n            \"string\": {\n              \"nullability\": \"NULLABILITY_REQUIRED\"\n            }\n          },\n          \"nullability\": \"NULLABILITY_REQUIRED\"\n        }\n      },\n      {\n        \"struct\": {\n          \"types\": [\n            {\n              \"string\": {\n                \"nullability\": \"NULLABILITY_NULLABLE\"\n              },\n              \"i32\": {\n                \"nullability\": \"NULLABILITY_NULLABLE\"\n              }\n            }\n          ],\n          \"nullability\": \"NULLABILITY_NULLABLE\"\n        }\n      },\n      {\n        \"string\": {\n          \"nullability\": \"NULLABILITY_NULLABLE\"\n        }\n      }\n    ],\n    \"nullability\": \"NULLABILITY_REQUIRED\"\n  }\n}\n
    "},{"location":"tutorial/sql_to_substrait/#expressions","title":"Expressions","text":"

    The next basic building block we will need is expressions. Expressions can be one of several things, including:

    • Field references
    • Literal values
    • Functions
    • Subqueries
    • Window Functions

    Since some expressions such as functions can contain other expressions, expressions can be represented as a tree. Literal values and field references typically are the leaf nodes.

    For the expression INDEX_IN(categories, \"Computers\") IS NULL, we have a field reference categories, a literal string \"Computers\", and two functions\u2014 INDEX_IN and IS NULL.

    The field reference for categories is represented by:

    {\n  \"selection\": {\n    \"directReference\": {\n      \"structField\": {\n        \"field\": 1\n      }\n    },\n    \"rootReference\": {}\n  }\n}\n

    Whereas SQL references field by names, Substrait always references fields numerically. This means that a Substrait expression only makes sense relative to a certain schema. As we\u2019ll see later when we discuss relations, for a filter relation this will be relative to the input schema, so the 1 here is referring to the second field of products.

    Note

    Protobuf may not serialize fields with integer type and value 0, since 0 is the default. So if you instead saw \"structField\": {}, know that is is equivalent to \"structField\": { \"field\": 0 }.

    \"Computers\" will be translated to a literal expression:

    {\n  \"literal\": {\n    \"string\": \"Computers\"\n  }\n}\n

    Both IS NULL and INDEX_IN will be scalar function expressions. Available functions in Substrait are defined in extension YAML files contained in https://github.com/substrait-io/substrait/tree/main/extensions. Additional extensions may be created elsewhere. IS NULL is defined as a is_null function in functions_comparison.yaml and INDEX_IN is defined as index_in function in functions_set.yaml.

    First, the expression for INDEX_IN(\"Computers\", categories) is:

    {\n  \"scalarFunction\": {\n    \"functionReference\": 1,\n    \"outputType\": {\n      \"i64\": {\n        \"nullability\": \"NULLABILITY_NULLABLE\"\n      }\n    },\n    \"arguments\": [\n      {\n        \"value\": {\n          \"literal\": {\n            \"string\": \"Computers\"\n          }\n        }\n      },\n      {\n        \"value\": {\n          \"selection\": {\n            \"directReference\": {\n              \"structField\": {\n                \"field\": 1\n              }\n            },\n            \"rootReference\": {}\n          }\n        }\n      }\n    ]\n  }\n}\n

    functionReference will be explained later in the plans section. For now, understand that it\u2019s a ID that corresponds to an entry in a list of function definitions that we will create later.

    outputType defines the type the function outputs. We know this is a nullable i64 type since that is what the function definition declares in the YAML file.

    arguments defines the arguments being passed into the function, which are all done positionally based on the function definition in the YAML file. The two arguments will be familiar as the literal and the field reference we constructed earlier.

    To create the final expression, we just need to wrap this in another scalar function expression for IS NULL.

    {\n  \"scalarFunction\": {\n    \"functionReference\": 2,\n    \"outputType\": {\n      \"bool\": {\n        \"nullability\": \"NULLABILITY_REQUIRED\"\n      }\n    },\n    \"arguments\": [\n      {\n        \"value\": {\n          \"scalarFunction\": {\n            \"functionReference\": 1,\n            \"outputType\": {\n              \"i64\": {\n                \"nullability\": \"NULLABILITY_NULLABLE\"\n              }\n            },\n            \"arguments\": [\n              {\n                \"value\": {\n                  \"literal\": {\n                    \"string\": \"Computers\"\n                  }\n                }\n              },\n              {\n                \"value\": {\n                  \"selection\": {\n                    \"directReference\": {\n                      \"structField\": {\n                        \"field\": 1\n                      }\n                    },\n                    \"rootReference\": {}\n                  }\n                }\n              }\n            ]\n          }\n        }\n      }\n    ]\n  }\n}\n

    To see what other types of expressions are available and what fields they take, see the Expression proto definition in algebra.proto.

    "},{"location":"tutorial/sql_to_substrait/#relations","title":"Relations","text":"

    In most SQL engines, a logical or physical plan is represented as a tree of nodes, such as filter, project, scan, or join. The left diagram below may be a familiar representation of our plan, where nodes feed data into each other moving from left to right. In Substrait, each of these nodes is a Relation.

    A relation that takes another relation as input will contain (or refer to) that relation. This is usually a field called input, but sometimes different names are used in relations that take multiple inputs. For example, join relations take two inputs, with field names left and right. In JSON, the rough layout for the relations in our plan will look like:

    {\n    \"aggregate\": {\n        \"input\": {\n            \"join\": {\n                \"left\": {\n                    \"filter\": {\n                        \"input\": {\n                            \"read\": {\n                                ...\n                            }\n                        },\n                        ...\n                    }\n                },\n                \"right\": {\n                    \"read\": {\n                        ...\n                    }\n                },\n                ...\n            }\n        },\n        ...\n    }\n}\n

    For our plan, we need to define the read relations for each table, a filter relation to exclude the \"Computer\" category from the products table, a join relation to perform the inner join, and finally an aggregate relation to compute the total sales.

    The read relations are composed of a baseSchema and a namedTable field. The type of read is a named table, so the namedTable field is present with names containing the list of name segments (my_database.my_table). Other types of reads include virtual tables (a table of literal values embedded in the plan) and a list of files. See Read Definition Types for more details. The baseSchema is the schemas we defined earlier and namedTable are just the names of the tables. So for reading the orders table, the relation looks like:

    {\n  \"read\": {\n    \"namedTable\": {\n      \"names\": [\n        \"orders\"\n      ]\n    },\n    \"baseSchema\": {\n      \"names\": [\n        \"product_id\",\n        \"quantity\",\n        \"order_date\",\n        \"price\"\n      ],\n      \"struct\": {\n        \"types\": [\n          {\n            \"i64\": {\n              \"nullability\": \"NULLABILITY_REQUIRED\"\n            }\n          },\n          {\n            \"i32\": {\n              \"nullability\": \"NULLABILITY_REQUIRED\"\n            }\n          },\n          {\n            \"date\": {\n              \"nullability\": \"NULLABILITY_REQUIRED\"\n            }\n          },\n          {\n            \"decimal\": {\n              \"scale\": 10,\n              \"precision\": 2,\n              \"nullability\": \"NULLABILITY_NULLABLE\"\n            }\n          }\n        ],\n        \"nullability\": \"NULLABILITY_REQUIRED\"\n      }\n    }\n  }\n}\n

    Read relations are leaf nodes. Leaf nodes don\u2019t depend on any other node for data and usually represent a source of data in our plan. Leaf nodes are then typically used as input for other nodes that manipulate the data. For example, our filter node will take the products read relation as an input.

    The filter node will also take a condition field, which will just be the expression we constructed earlier.

    {\n  \"filter\": {\n    \"input\": {\n      \"read\": { ... }\n    },\n    \"condition\": {\n      \"scalarFunction\": {\n        \"functionReference\": 2,\n        \"outputType\": {\n          \"bool\": {\n            \"nullability\": \"NULLABILITY_REQUIRED\"\n          }\n        },\n        \"arguments\": [\n          {\n            \"value\": {\n              \"scalarFunction\": {\n                \"functionReference\": 1,\n                \"outputType\": {\n                  \"i64\": {\n                    \"nullability\": \"NULLABILITY_NULLABLE\"\n                  }\n                },\n                \"arguments\": [\n                  {\n                    \"value\": {\n                      \"literal\": {\n                        \"string\": \"Computers\"\n                      }\n                    }\n                  },\n                  {\n                    \"value\": {\n                      \"selection\": {\n                        \"directReference\": {\n                          \"structField\": {\n                            \"field\": 1\n                          }\n                        },\n                        \"rootReference\": {}\n                      }\n                    }\n                  }\n                ]\n              }\n            }\n          }\n        ]\n      }\n    }\n  }\n}\n

    The join relation will take two inputs. In the left field will be the read relation for orders and in the right field will be the filter relation (from products). The type field is an enum that allows us to specify we want an inner join. Finally, the expression field contains the expression to use in the join. Since we haven\u2019t used the equals() function yet, we use the reference number 3 here. (Again, we\u2019ll see at the end with plans how these functions are resolved.) The arguments refer to fields 0 and 4, which are indices into the combined schema formed from the left and right inputs. We\u2019ll discuss later in Field Indices where these come from.

    {\n  \"join\": {\n    \"left\": { ... },\n    \"right\": { ... },\n    \"type\": \"JOIN_TYPE_INNER\",\n    \"expression\": {\n      \"scalarFunction\": {\n        \"functionReference\": 3,\n        \"outputType\": {\n          \"bool\": {\n            \"nullability\": \"NULLABILITY_NULLABLE\"\n          }\n        },\n        \"arguments\": [\n          {\n            \"value\": {\n              \"selection\": {\n                \"directReference\": {\n                  \"structField\": {\n                    \"field\": 0\n                  }\n                },\n                \"rootReference\": {}\n              }\n            }\n          },\n          {\n            \"value\": {\n              \"selection\": {\n                \"directReference\": {\n                  \"structField\": {\n                    \"field\": 4\n                  }\n                },\n                \"rootReference\": {}\n              }\n            }\n          }\n        ]\n      }\n    }\n  }\n}\n

    The final aggregation requires two things, other than the input. First is the groupings. We\u2019ll use a single grouping expression containing the references to the fields product_name and product_id. (Multiple grouping expressions can be used to do cube aggregations.)

    For measures, we\u2019ll need to define sum(quantity * price) as sales. Substrait is stricter about data types, and quantity is an integer while price is a decimal. So we\u2019ll first need to cast quantity to a decimal, making the Substrait expression more like sum(multiply(cast(decimal(10, 2), quantity), price)). Both sum() and multiply() are functions, defined in functions_arithmetic_demical.yaml. However cast() is a special expression type in Substrait, rather than a function.

    Finally, the naming with as sales will be handled at the end as part of the plan, so that\u2019s not part of the relation. Since we are always using field indices to refer to fields, Substrait doesn\u2019t record any intermediate field names.

    {\n  \"aggregate\": {\n    \"input\": { ... },\n    \"groupings\": [\n      {\n        \"groupingExpressions\": [\n          {\n            \"value\": {\n              \"selection\": {\n                \"directReference\": {\n                  \"structField\": {\n                    \"field\": 0\n                  }\n                },\n                \"rootReference\": {}\n              }\n            }\n          },\n          {\n            \"value\": {\n              \"selection\": {\n                \"directReference\": {\n                  \"structField\": {\n                    \"field\": 7\n                  }\n                },\n                \"rootReference\": {}\n              }\n            }\n          },\n        ]\n      }\n    ],\n    \"measures\": [\n      {\n        \"measure\": {\n          \"functionReference\": 4,\n          \"outputType\": {\n            \"decimal\": {\n              \"precision\": 38,\n              \"scale\": 2,\n              \"nullability\": \"NULLABILITY_NULLABLE\"\n            }\n          },\n          \"arguments\": [\n            {\n              \"value\": {\n                \"scalarFunction\": {\n                  \"functionReference\": 5,\n                  \"outputType\": {\n                    \"decimal\": {\n                      \"precision\": 38,\n                      \"scale\": 2,\n                      \"nullability\": \"NULLABILITY_NULLABLE\"\n                    }\n                  },\n                  \"arguments\": [\n                    {\n                      \"value\": {\n                        \"cast\": {\n                          \"type\": {\n                            \"decimal\": {\n                              \"precision\": 10,\n                              \"scale\": 2,\n                              \"nullability\": \"NULLABILITY_REQUIRED\"\n                            }\n                          },\n                          \"input\": {\n                            \"selection\": {\n                              \"directReference\": {\n                                \"structField\": {\n                                  \"field\": 1\n                                }\n                              },\n                              \"rootReference\": {}\n                            }\n                          }\n                        }\n                      }\n                    },\n                    {\n                      \"value\": {\n                        \"selection\": {\n                          \"directReference\": {\n                            \"structField\": {\n                              \"field\": 3\n                            }\n                          },\n                          \"rootReference\": {}\n                        }\n                      }\n                    }\n                  ]\n                }\n              }\n            }\n          ]\n        }\n      }\n    ]\n  }\n}\n
    "},{"location":"tutorial/sql_to_substrait/#field-indices","title":"Field indices","text":"

    So far, we have glossed over the field indices. Now that we\u2019ve built up each of the relations, it will be a bit easier to explain them.

    Throughout the plan, data always has some implicit schema, which is modified by each relation. Often, the schema can change within a relation\u2013we\u2019ll discuss an example in the next section. Each relation has it\u2019s own rules in how schemas are modified, called the output order or emit order. For the purposes of our query, the relevant rules are:

    • For Read relations, their output schema is the schema of the table.
    • For Filter relations, the output schema is the same as in the input schema.
    • For Joins relations, the input schema is the concatenation of the left and then the right schemas. The output schema is the same.
    • For Aggregate relations, the output schema is the group by fields followed by the measures.

    Note

    Sometimes it can be hard to tell what the implicit schema is. For help determining that, consider using the substrait-validator tool, described in Next Steps.

    The diagram below shows the mapping of field indices within each relation and how each of the field references show up in each relations properties.

    "},{"location":"tutorial/sql_to_substrait/#column-selection-and-emit","title":"Column selection and emit","text":"

    As written, the aggregate output schema will be:

    0: product_id: i64\n1: product_name: string\n2: sales: decimal(32, 8)\n

    But we want product_name to come before product_id in our output. How do we reorder those columns?

    You might be tempted to add a Project relation at the end. However, the project relation only adds columns; it is not responsible for subsetting or reordering columns.

    Instead, any relation can reorder or subset columns through the emit property. By default, it is set to direct, which outputs all columns \u201cas is\u201d. But it can also be specified as a sequence of field indices.

    For simplicity, we will add this to the final aggregate relation. We could also add it to all relations, only selecting the fields we strictly need in later relations. Indeed, a good optimizer would probably do that to our plan. And for some engines, the emit property is only valid within a project relation, so in those cases we would need to add that relation in combination with emit. But to keep things simple, we\u2019ll limit the columns at the end within the aggregation relation.

    For our final column selection, we\u2019ll modify the top-level relation to be:

    {\n  \"aggregate\": {\n    \"input\": { ... },\n    \"groupings\": [ ... ],\n    \"measures\": [ ... ],\n    \"common\": {\n      \"emit\": {\n        \"outputMapping\": [1, 0, 2]\n      }\n    }\n}\n
    "},{"location":"tutorial/sql_to_substrait/#plans","title":"Plans","text":"

    Now that we\u2019ve constructed our relations, we can put it all into a plan. Substrait plans are the only messages that can be sent and received on their own. Recall that earlier, we had function references to those YAML files, but so far there\u2019s been no place to tell a consumer what those function reference IDs mean or which extensions we are using. That information belongs at the plan level.

    The overall layout for a plan is

    {\n  \"extensionUris\": [ ... ],\n  \"extensions\": [ ... ],\n  \"relations\": [\n    {\n      \"root\": {\n        \"names\": [\n          \"product_name\",\n          \"product_id\",\n          \"sales\"\n        ],\n        \"input\": { ... }\n      }\n    }\n  ]\n}\n

    The relations field is a list of Root relations. Most queries only have one root relation, but the spec allows for multiple so a common plan could be referenced by other plans, sort of like a CTE (Common Table Expression) from SQL. The root relation provides the final column names for our query. The input to this relation is our aggregate relation (which contains all the other relations as children).

    For extensions, we need to provide extensionUris with the locations of the YAML files we used and extensions with the list of functions we used and which extension they come from.

    In our query, we used:

    • index_in (1), from functions_set.yaml,
    • is_null (2), from functions_comparison.yaml,
    • equal (3), from functions_comparison.yaml,
    • sum (4), from functions_arithmetic_decimal.yaml,
    • multiply (5), from functions_arithmetic_decimal.yaml.

    So first we can create the three extension uris:

    [\n  {\n    \"extensionUriAnchor\": 1,\n    \"uri\": \"https://github.com/substrait-io/substrait/blob/main/extensions/functions_set.yaml\"\n  },\n  {\n    \"extensionUriAnchor\": 2,\n    \"uri\": \"https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml\"\n  },\n  {\n    \"extensionUriAnchor\": 3,\n    \"uri\": \"https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic_decimal.yaml\"\n  }\n]\n

    Then we can create the extensions:

    [\n  {\n    \"extensionFunction\": {\n      \"extensionUriReference\": 1,\n      \"functionAnchor\": 1,\n      \"name\": \"index_in\"\n    }\n  },\n  {\n    \"extensionFunction\": {\n      \"extensionUriReference\": 2,\n      \"functionAnchor\": 2,\n      \"name\": \"is_null\"\n    }\n  },\n  {\n    \"extensionFunction\": {\n      \"extensionUriReference\": 2,\n      \"functionAnchor\": 3,\n      \"name\": \"equal\"\n    }\n  },\n  {\n    \"extensionFunction\": {\n      \"extensionUriReference\": 3,\n      \"functionAnchor\": 4,\n      \"name\": \"sum\"\n    }\n  },\n  {\n    \"extensionFunction\": {\n      \"extensionUriReference\": 3,\n      \"functionAnchor\": 5,\n      \"name\": \"multiply\"\n    }\n  }\n]\n

    Once we\u2019ve added our extensions, the plan is complete. Our plan outputted in full is: final_plan.json.

    "},{"location":"tutorial/sql_to_substrait/#next-steps","title":"Next steps","text":"

    Validate and introspect plans using substrait-validator. Amongst other things, this tool can show what the current schema and column indices are at each point in the plan. Try downloading the final plan JSON above and generating an HTML report on the plan with:

    substrait-validator final_plan.json --out-file output.html\n
    "},{"location":"types/named_structs/","title":"Named Structs","text":"

    A Named Struct is a special type construct that combines: * A Struct type * A list of names for the fields in the Struct, in depth-first search order

    The depth-first search order for names arises from the the ability to nest Structs within other types. All Struct fields must be named, even nested fields.

    Named Structs are most commonly used to model the schema of Read relations.

    "},{"location":"types/named_structs/#determining-names","title":"Determining Names","text":"

    When producing/consuming names for a NamedStruct, some types require special handling:

    "},{"location":"types/named_structs/#struct","title":"Struct","text":"

    A struct has names for each of its inner fields.

    For example, the following Struct

    struct<i64, i64>\n       \u2191    \u2191\n       a    b\n
    has 2 names, one for each of its inner fields.

    "},{"location":"types/named_structs/#structs-within-compound-types","title":"Structs within Compound Types","text":"

    Struct types nested in compound types must also be be named.

    "},{"location":"types/named_structs/#structs-within-maps","title":"Structs within Maps","text":"

    If a Map contains Structs, either as keys or values or both, the Struct fields must be named. Keys are named before values. For example the following Map

    map<struct<i64, i64>, struct<i64, i64, i64>>\n           \u2191    \u2191            \u2191    \u2191    \u2191\n           a    b            c    d    e\n
    has 5 named fields * 2 names [a, b] for the struct fields used as a key * 3 names [c, d, e] for the struct fields used as a value

    "},{"location":"types/named_structs/#structs-within-list","title":"Structs within List","text":"

    If a List contains Structs, the Struct fields must be named. For example the following List

    list<struct<i64, i64>>\n            \u2191    \u2191\n            a    b\n
    has 2 named fields [a, b] for the struct fields.

    "},{"location":"types/named_structs/#structs-within-struct","title":"Structs within Struct","text":"

    Structs can also be embedded within Structs.

    A Struct like

    struct<struct<i64, i64>, struct<i64, i64, i64>>\n       \u2191      \u2191    \u2191     \u2191      \u2191    \u2191    \u2191\n       a      b    c     d      e    f    g\n
    has 7 names * 1 name [a] for the 1st nested struct field * 2 names [b, c] for the fields within the 1st nested struct * 1 name [d] the for the 2nd nested struct field * 3 names [e, f, g] for the fields within the 2nd nested struct

    "},{"location":"types/named_structs/#putting-it-all-together","title":"Putting It All Together","text":""},{"location":"types/named_structs/#simple-named-struct","title":"Simple Named Struct","text":"
    NamedStruct {\n    names: [a, b, c, d]\n    struct: struct<i64, list<i64>, map<i64, i64>, i64>\n                   \u2191    \u2191          \u2191              \u2191\n                   a    b          c              d\n}\n
    "},{"location":"types/named_structs/#structs-in-compound-types","title":"Structs in Compound Types","text":"
    NamedStruct {\n    names: [a, b, c, d, e, f, g, h]\n    struct: struct<i64, list<struct<i64, i64>>, map<i64, struct<i64, i64>>, i64>\n                   \u2191    \u2191          \u2191     \u2191      \u2191               \u2191    \u2191      \u2191\n                   a    b          c     d      e               f    g      h\n}\n
    "},{"location":"types/named_structs/#structs-in-structs","title":"Structs in Structs","text":"
    NamedStruct {\n    names: [a, b, c, d, e, f, g, h, i]\n    struct: struct<i64, struct<i64, struct<i64, i64>, i64, struct<i64, i64>>>>\n                   \u2191    \u2191      \u2191    \u2191      \u2191    \u2191     \u2191    \u2191      \u2191    \u2191\n                   a    b      c    d      e    f     g    h      i    j\n}\n
    "},{"location":"types/type_classes/","title":"Type Classes","text":"

    In Substrait, the \u201cclass\u201d of a type, not to be confused with the concept from object-oriented programming, defines the set of non-null values that instances of a type may assume.

    Implementations of a Substrait type must support at least this set of values, but may include more; for example, an i8 could be represented using the same in-memory format as an i32, as long as functions operating on i8 values within [-128..127] behave as specified (in this case, this means 8-bit overflow must work as expected). Operating on values outside the specified range is unspecified behavior.

    "},{"location":"types/type_classes/#simple-types","title":"Simple Types","text":"

    Simple type classes are those that don\u2019t support any form of configuration. For simplicity, any generic type that has only a small number of discrete implementations is declared directly, as opposed to via configuration.

    Type Name Description Protobuf representation for literals boolean A value that is either True or False. bool i8 A signed integer within [-128..127], typically represented as an 8-bit two\u2019s complement number. int32 i16 A signed integer within [-32,768..32,767], typically represented as a 16-bit two\u2019s complement number. int32 i32 A signed integer within [-2147483648..2,147,483,647], typically represented as a 32-bit two\u2019s complement number. int32 i64 A signed integer within [\u22129,223,372,036,854,775,808..9,223,372,036,854,775,807], typically represented as a 64-bit two\u2019s complement number. int64 fp32 A 4-byte single-precision floating point number with the same range and precision as defined for the IEEE 754 32-bit floating-point format. float fp64 An 8-byte double-precision floating point number with the same range and precision as defined for the IEEE 754 64-bit floating-point format. double string A unicode string of text, [0..2,147,483,647] UTF-8 bytes in length. string binary A binary value, [0..2,147,483,647] bytes in length. binary timestamp A naive timestamp with microsecond precision. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. int64 microseconds since 1970-01-01 00:00:00.000000 (in an unspecified timezone) timestamp_tz A timezone-aware timestamp with microsecond precision. Similar to aware datetime in Python. int64 microseconds since 1970-01-01 00:00:00.000000 UTC date A date within [1000-01-01..9999-12-31]. int32 days since 1970-01-01 time A time since the beginning of any day. Range of [0..86,399,999,999] microseconds; leap seconds need not be supported. int64 microseconds past midnight interval_year Interval year to month. Supports a range of [-10,000..10,000] years with month precision (= [-120,000..120,000] months). Usually stored as separate integers for years and months, but only the total number of months is significant, i.e. 1y 0m is considered equal to 0y 12m or 1001y -12000m. int32 years and int32 months, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. -10000y 200000m is not allowed) uuid A universally-unique identifier composed of 128 bits. Typically presented to users in the following hexadecimal format: c48ffa9e-64f4-44cb-ae47-152b4e60e77b. Any 128-bit value is allowed, without specific adherence to RFC4122. 16-byte binary"},{"location":"types/type_classes/#compound-types","title":"Compound Types","text":"

    Compound type classes are type classes that need to be configured by means of a parameter pack.

    Type Name Description Protobuf representation for literals FIXEDCHAR<L> A fixed-length unicode string of L characters. L must be within [1..2,147,483,647]. L-character string VARCHAR<L> A unicode string of at most L characters.L must be within [1..2,147,483,647]. string with at most L characters FIXEDBINARY<L> A binary string of L bytes. When casting, values shorter than L are padded with zeros, and values longer than L are right-trimmed. L-byte bytes DECIMAL<P, S> A fixed-precision decimal value having precision (P, number of digits) <= 38 and scale (S, number of fractional digits) 0 <= S <= P. 16-byte bytes representing a little-endian 128-bit integer, to be divided by 10^S to get the decimal value STRUCT<T1,\u2026,Tn> A list of types in a defined order. repeated Literal, types matching T1..Tn NSTRUCT<N:T1,\u2026,N:Tn> Pseudo-type: A struct that maps unique names to value types. Each name is a UTF-8-encoded string. Each value can have a distinct type. Note that NSTRUCT is actually a pseudo-type, because Substrait\u2019s core type system is based entirely on ordinal positions, not named fields. Nonetheless, when working with systems outside Substrait, names are important. n/a LIST<T> A list of values of type T. The list can be between [0..2,147,483,647] values in length. repeated Literal, all types matching T MAP<K, V> An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. repeated KeyValue (in turn two Literals), all key types matching K and all value types matching V PRECISIONTIMESTAMP<P> A timestamp with fractional second precision (P, number of digits) 0 <= P <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. int64 seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone) PRECISIONTIMESTAMPTZ<P> A timezone-aware timestamp, with fractional second precision (P, number of digits) 0 <= P <= 9. Similar to aware datetime in Python. int64 seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 UTC INTERVAL_DAY<P> Interval day to second. Supports a range of [-3,650,000..3,650,000] days with fractional second precision (P, number of digits) 0 <= P <= 9. Usually stored as separate integers for various components, but only the total number of fractional seconds is significant, i.e. 1d 0s is considered equal to 0d 86400s. int32 days, int32 seconds, and int64 fractional seconds, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. 3650001d -86400s 0us is not allowed) INTERVAL_COMPOUND<P> A compound interval type that is composed of elements of the underlying elements and rules of both interval_month and interval_day to express arbitrary durations across multiple grains. Substrait gives no definition for the conversion of values between independent grains (e.g. months to days)."},{"location":"types/type_classes/#user-defined-types","title":"User-Defined Types","text":"

    User-defined type classes are defined as part of simple extensions. An extension can declare an arbitrary number of user-defined extension types. Once a type has been declared, it can be used in function declarations.

    For example, the following declares a type named point (namespaced to the associated YAML file) and two scalar functions that operate on it.

    types:\n  - name: \"point\"\n\nscalar_functions:\n  - name: \"lat\"\n    impls:\n      - args:\n        - name: p\n        - value: u!point\n    return: fp64\n  - name: \"lon\"\n    impls:\n      - args:\n        - name: p\n        - value: u!point\n    return: fp64\n
    "},{"location":"types/type_classes/#handling-user-defined-types","title":"Handling User-Defined Types","text":"

    Systems without support for a specific user-defined type: * Cannot generate values of the type. * Cannot implement functions operating on the type. * May support consuming and emitting values of the type without modifying them.

    "},{"location":"types/type_classes/#communicating-user-defined-types","title":"Communicating User-Defined Types","text":"

    Specifiers of user-defined types may provide additional structure information for the type to assist in communicating values of the type to and from systems without built-in support.

    For example, the following declares a point type with two i32 values named longitude and latitude:

    types:\n  - name: point\n    structure:\n      longitude: i32\n      latitude: i32\n

    The name-type object notation used above is syntactic sugar for NSTRUCT<longitude: i32, latitude: i32>. The following means the same thing:

    name: point\nstructure: \"NSTRUCT<longitude: i32, latitude: i32>\"\n

    The structure field of a type is only intended to inform systems that don\u2019t have built-in support for the type about how they can create and transfer values of that type to systems that do support the type.

    The structure field does not restrict or bind the internal representation of the type in any system.

    As such, it\u2019s currently not possible to \u201cunpack\u201d a user-defined type into its structure type or components thereof using FieldReferences or any other specialized record expression; if support for this is desired for a particular type, this can be accomplished with an extension function.

    "},{"location":"types/type_classes/#literals","title":"Literals","text":"

    Literals for user-defined types can be represented in one of two ways: * Using protobuf Any messages. * Using the structure representation of the type.

    "},{"location":"types/type_classes/#compound-user-defined-types","title":"Compound User-Defined Types","text":"

    User-defined types may be turned into compound types by requiring parameters to be passed to them. The supported \u201cmeta-types\u201d for parameters are data types (like those used in LIST, MAP, and STRUCT), booleans, integers, enumerations, and strings. Using parameters, we could redefine \u201cpoint\u201d with different types of coordinates. For example:

    name: point\nparameters:\n  - name: T\n    description: |\n      The type used for the longitude and latitude\n      components of the point.\n    type: dataType\n

    or:

    name: point\nparameters:\n  - name: coordinate_type\n    type: enumeration\n    options:\n      - integer\n      - double\n

    or:

    name: point\nparameters:\n  - name: LONG\n    type: dataType\n  - name: LAT\n    type: dataType\n

    We can\u2019t specify the internal structure in this case, because there is currently no support for derived types in the structure.

    The allowed range can be limited for integer parameters. For example:

    name: vector\nparameters:\n  - name: T\n    type: dataType\n  - name: dimensions\n    type: integer\n    min: 2\n    max: 3\n

    This specifies a vector that can be either 2- or 3-dimensional. Note however that it\u2019s not currently possible to put constraints on data type, string, or (technically) boolean parameters.

    Similar to function arguments, the last parameter may be specified to be variadic, allowing it to be specified one or more times instead of only once. For example:

    name: union\nparameters:\n  - name: T\n    type: dataType\nvariadic: true\n

    This defines a type that can be parameterized with one or more other data types, for example union<i32, i64> but also union<bool>. Zero or more is also possible, by making the last argument optional:

    name: tuple\nparameters:\n  - name: T\n    type: dataType\n    optional: true\nvariadic: true\n

    This would also allow for tuple<>, to define a zero-tuple.

    "},{"location":"types/type_parsing/","title":"Type Syntax Parsing","text":"

    In many places, it is useful to have a human-readable string representation of data types. Substrait has a custom syntax for type declaration. The basic structure of a type declaration is:

    name?[variation]<param0,...,paramN>\n

    The components of this expression are:

    Component Description Required Name Each type has a name. A type is expressed by providing a name. This name can be expressed in arbitrary case (e.g. varchar and vArChAr are equivalent) although lowercase is preferred. Nullability indicator A type is either non-nullable or nullable. To express nullability, a question mark is added after the type name (before any parameters). Optional, defaults to non-nullable Variation When expressing a type, a user can define the type based on a type variation. Some systems use type variations to describe different underlying representations of the same data type. This is expressed as a bracketed integer such as [2]. Optional, defaults to [0] Parameters Compound types may have one or more configurable properties. The two main types of properties are integer and type properties. The parameters for each type correspond to a list of known properties associated with a type as declared in the order defined in the type specification. For compound types (types that contain types), the data type syntax will include nested type declarations. The one exception is structs, which are further outlined below. Required where parameters are defined"},{"location":"types/type_parsing/#grammars","title":"Grammars","text":"

    It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an ANTLR grammar to ease consumption and production of types. (The grammar also supports an entire language for representing plans as text.)

    "},{"location":"types/type_parsing/#structs-named-structs","title":"Structs & Named Structs","text":"

    Structs are unique from other types because they have an arbitrary number of parameters. The parameters are recursive and may include their own subproperties. Struct parsing is declared in the following two ways:

    YAMLText Format Examples
    # Struct\nstruct?[variation]<type0, type1,..., typeN>\n\n# Named Struct\nnstruct?[variation]<name0:type0, name1:type1,..., nameN:typeN>\n
    // Struct\nstruct?<string, i8, i32?, timestamp_tz>\n\n// Named structs are not yet supported in the text format.\n

    In the normal (non-named) form, struct declares a set of types that are fields within that struct. In the named struct form, the parameters are formed by tuples of names + types, delineated by a colon. Names that are composed only of numbers and letters can be left unquoted. For other characters, names should be quoted with double quotes and use backslash for double-quote escaping.

    Note, in core Substrait algebra, fields are unnamed and references are always based on zero-index ordinal positions. However, data inputs must declare name-to-ordinal mappings and outputs must declare ordinal-to-name mappings. As such, Substrait also provides a named struct which is a pseudo-type that is useful for human consumption. Outside these places, most structs in a Substrait plan are structs, not named-structs. The two cannot be used interchangeably.

    "},{"location":"types/type_parsing/#other-complex-types","title":"Other Complex Types","text":"

    Similar to structs, maps and lists can also have a type as one of their parameters. Type references may be recursive. The key for a map is typically a simple type but it is not required.

    YAMLText Format Examples
    list?<type>>\nmap<type0, type1>\n
    list?<list<string>>\nlist<struct<string, i32>>\nmap<i32?, list<map<i32, string?>>>\n
    "},{"location":"types/type_system/","title":"Type System","text":"

    Substrait tries to cover the most common types used in data manipulation. Types beyond this common core may be represented using simple extensions.

    Substrait types fundamentally consist of four components:

    Component Condition Examples Description Class Always i8, string, STRUCT, extensions Together with the parameter pack, describes the set of non-null values supported by the type. Subdivided into simple and compound type classes. Nullability Always Either NULLABLE (? suffix) or REQUIRED (no suffix) Describes whether values of this type can be null. Note that null is considered to be a special value of a nullable type, rather than the only value of a special null type. Variation Always No suffix or explicitly [0] (system-preferred), or an extension Allows different variations of the same type class to exist in a system at a time, usually distinguished by in-memory format. Parameters Compound types only <10, 2> (for DECIMAL), <i32, string> (for STRUCT) Some combination of zero or more data types or integers. The expected set of parameters and the significance of each parameter depends on the type class.

    Refer to Type Parsing for a description of the syntax used to describe types.

    Note

    Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via cast expressions.

    "},{"location":"types/type_variations/","title":"Type Variations","text":"

    Type variations may be used to represent differences in representation between different consumers. For example, an engine might support dictionary encoding for a string, or could be using either a row-wise or columnar representation of a struct. All variations of a type are expected to have the same semantics when operated on by functions or other expressions.

    All variations except the \u201csystem-preferred\u201d variation (a.k.a. [0], see Type Parsing) must be defined using simple extensions. The key properties of these variations are:

    Property Description Base Type Class The type class that this variation belongs to. Name The name used to reference this type. Should be unique within type variations for this parent type within a simple extension. Description A human description of the purpose of this type variation. Function Behavior INHERITS or SEPARATE: whether functions that support the system-preferred variation implicitly also support this variation, or whether functions should be resolved independently. For example, if one has the function add(i8,i8) defined and then defines an i8 variation, this determines whether the i8 variation can be bound to the base add operation (inherits) or whether a specialized version of add needs to be defined specifically for this variation (separate). Defaults to inherits."}]} \ No newline at end of file