01 Jun 22:01

mwylde

ff6c985

v0.3.0

We're thrilled to announce the 0.3.0 release of Arroyo, our second minor release as an open-source project. Arroyo is a state-of-the-art stream processing engine designed to allow anyone to build complex, stateful real-time data pipelines with SQL.

Overview

The Arroyo 0.3 release focused on improving the flexibility of the system and completeness of SQL support, with the MVP for UDF support, DDL statements, and custom event time and watermarks. There have also many substantial improvements to the Web UI, including error reporting, backpressure monitoring, and under-the-hood infrastructure improvements.

We've also greatly expanded our docs since the last release. Check them out at https://doc.arroyo.dev.

New contributors

We are excited to welcome three new contributors to the project with this release:

@haoxins made their first contribution in #100
@edmondop made their first contribution in #122
@chenquan made their first contribution in #147

Thanks to all new and existing contributors!

What's next

Looking forward to the 0.4 release, we have a lot of exciting changes planned. We're adding the ability to create updating tables with native support for Debezium, allowing users to connect Arroyo to relational databases like MySQL and Postgres. Other planned features include external joins, session windows, and Delta Lake integration.

Excited to be part of the future of stream processing? Come chat with the team on our discord, check out a starter issue and submit a PR, and let us know what you'd like to see next in Arroyo!

Features

UDFs

With this release we are shipping initial support for writing user-defined functions (UDFs) in Rust, allowing users to extend SQL with custom business logic. See the udf docs for full details.

For example, we can register a Rust function:

// Returns the great-circle distance between two coordinates
fn gcd(lat1: f64, lon1: f64, lat2: f64, lon2: f64) -> f64 {
    let radius = 6371.0;

    let dlat = (lat2 - lat1).to_radians();
    let dlon = (lon2 - lon1).to_radians();

    let a = (dlat / 2.0).sin().powi(2) +
        lat1.to_radians().cos() *
            lat2.to_radians().cos() *
                (dlon / 2.0).sin().powi(2);
    let c = 2.0 * a.sqrt().atan2((1.0 - a).sqrt());

    radius * c
}

and call it from SQL:

SELECT gcd(src_lat, src_long, dst_lat, dst_long)
FROM orders;

MVP of Rust UDFs by @mwylde in #104

SQL DDL statements

It's now possible to define sources and sinks directly in SQL via CREATE TABLE statements:

CREATE TABLE orders (
  customer_id INT,
  order_id INT,
  date_string TEXT
) WITH (
  connection = 'my_kafka',
  topic = 'order_topic',
  serialization_mode = 'json'
);

These tables can then be selected from and inserted into to read and write from those systems. For example, we can duplicate the orders topic by inserting from it into a new table:

CREATE TABLE orders_copy (
  customer_id INT,
  order_id INT,
  date_string TEXT
) WITH (
  connection = 'my_kafka',
  topic = 'order_topic',
  serialization_mode = 'json'
);


INSERT INTO orders_copy SELECT * FROM orders;

In addition to connection tables, this release also adds support for views and virtual tables, which are helpful for splitting up complex queries into smaller components.

Feature/inline create table by @jacksonrnewhouse in #101
Rework sources and sinks to allow for creating tables/views in SQL queries by @jacksonrnewhouse in #107

Custom event time and watermarks

Arroyo now supports custom event time fields and watermarks, allowing users to define their own event time fields and watermarks based on the data in their streams.

When creating a connection table in SQL, it is now possible to define a virtual field generated from the data in the stream and then assign that to be the event time. We can then generate a watermark from that event time field as well.

A complete example looks like this:

CREATE TABLE orders (
  customer_id INT,
  order_id INT,
  date_string TEXT,
  event_time TIMESTAMP GENERATED ALWAYS AS (CAST(date_string as TIMESTAMP)),
  watermark TIMESTAMP GENERATED ALWAYS AS (event_time - INTERVAL '15' SECOND)
) WITH (
  connection = 'my_kafka',
  topic = 'order_topic',
  serialization_mode = 'json',
  event_time_field = 'event_time',
  watermark_field = 'watermark'
);

For more on the underlying concepts of event times and watermarks, see the concept docs.

Support virtual fields and overriding timestamp via event_time_field by @jacksonrnewhouse in #127
Add ability to configure watermark by specifying a specific override column by @jacksonrnewhouse in #142

Additional SQL features

Beyond UDFs and DDL statements, we have continued to expand the completeness of our SQL support with addition of case statements and regex functions:

Allow filters on join computations by @jacksonrnewhouse in #131
Implement CASE statements by @jacksonrnewhouse in #146
Adding support for regex_replace and regex_match by @edmondop in #122
Rework top N window functions by @jacksonrnewhouse in #136

Server-Sent Events source

We've added a new source which allows reading from Server-Sent Events APIs (also called EventSource). SSE is a simple protocol for streaming data from HTTP servers and is a common choice for web applications. See the SSE source documentation for more details, and take a look at the new Mastodon trends tutorial that makes
uses of it

Add event source source operator by @mwylde in #106
Add HTTP connections and add support for event source tables in SQL by @mwylde in #119

Web UI

This release has seen a ton of improvements to the web UI.

Show SQL names instead of primitive types in catalog by @jbeisen in#84
Add backpressure metric by @jbeisen in #109
Add backpressure graph and color pipeline nodes by @jbeisen in #110
Add page not found page by @jbeisen in #130
Use SWR for fetching data for job details page by @jbeisen in #129
Show operator checkpoint sizes by @jbeisen in #139
Write eventsource and kafka source errors to db by @jbeisen in #140
Add Errors tab to job details page by @jbeisen in #149

Improvements

Improvements to Kafka consumer/producer reliability and correctness by @mwylde in #132
Implement full_pipeline_codegen proc macro to test pipeline codegen by @jacksonrnewhouse in #135
Bump datafusion to 25.0.0 by @jacksonrnewhouse in #145
Add docker.yaml to build and push docker images. by @jacksonrnewhouse in #150
Add basic end-to-end integration test by @mwylde in #108
Add event tracking by @mwylde #144
Helm: Create service account for Postgres deployment by @haoxins in #100
Enforce prettier and eslint in the github pipeline by @jbeisen in #120
Check formatting on PRs by[@jacksonrnewhouse](http...

Assets 2

02 May 21:30

mwylde

release-0.2.0

fdb1560

v0.2.0

Arroyo 0.2.0

Arroyo is a new, state-of-the-art stream processing engine that makes it easy to build complex real-time data pipelines with SQL. This release marks our first versioned release of Arroyo since we open-sourced the engine in April.

We're excited to welcome three new contributors to the project:

@rtyler made their first contribution in #8
@akennedy4155 made their first contribution in #49
@jbeisen made their first contribution in #77

With the 0.2.0 release, we are continuing to push forward on features, stability, and productionization. We’ve added native Kubernetes support and easy deployment via a Helm chart, expanded our SQL support with features like JSON functions and windowless joins, and made many more fixes and improvements detailed below.

Looking forward to the 0.3.0 release, we will continue to improve our SQL support with the ability to create sources and sinks directly as SQL tables, views, UDFs and external joins. We will also be adding a native Pulsar connector and making continued improvements in performance and reliability.

Excited to be part of the future of stream processing? Come chat with the team on our discord, check out a starter issue and submit a PR, and let us know what you’d like to see next in Arroyo!

Features

Native Kubernetes support

As of release 0.2.0, Arroyo can natively target Kubernetes as a scheduler for running pipelines. We now also support easily running the Arroyo control plane on Kubernetes using our new helm chart.

Getting started is as easy as

$ helm repo add arroyo https://arroyosystems.github.io/helm-repo
$ helm install arroyo arroyo/arroyo \
  --set s3.bucket=my-bucket,s3.region=us-east-1

See the docs for all the details.

Add Kubernetes scheduler by @mwylde in #79
K8s deployment and helm chart by @mwylde in #91

Nomad deployments

Arroyo has long had first-class support for Nomad as a scheduler, where we take advantage of the very low-latency and lightweight scheduling support. Now we also support Nomad as an easy deploy target for the control plane as well via a nomad pack.

See the docs for more details.

Support for deploying Arroyo to a nomad cluster by @mwylde in #50

SQL features

With this release we are making big improvements in SQL completeness. Notably, we’ve made our JSON support much more flexible with the introduction of SQL JSON functions including get_json_objects, get_first_json_object, and extract_json_string.

We’ve also added support for windowless joins.

Here are some of the highlights:

Initial JSON functions and raw Kafka Source by @jacksonrnewhouse in #86
Windowless Joins by @jacksonrnewhouse in #61
String functions by @jacksonrnewhouse in #17
Hashing Functions by @akennedy4155 in #49
Casting between numeric types and strings by @jacksonrnewhouse in #5
Casting timestamps to text by @jacksonrnewhouse in #32
String Concat Operator || in SQL by @akennedy4155 in #55
Add COALESCE, NULLIF, MAKE_ARRAY by @jacksonrnewhouse in #89

Connectors, Web UI, and platform support

Arroyo now supports SASL authentication for Kafka and FreeBSD

Add FreeBSD support by @rtyler in #8, #19
SASL authentication support to kafka connections by @jacksonrnewhouse in #20
Add support for changing pipeline parallelism in the Web UI by @jbeisen in #77

Fixes

Fix filter on partition_by parsing. by @jacksonrnewhouse in #27
Make parquet state management more reliable by @jacksonrnewhouse in #23
Fix the quoting of types in the sql package by @jacksonrnewhouse in #64

Improvements

SQL macro testing by @jacksonrnewhouse in #10
Add a SQL IR and factor out optimizations by @jacksonrnewhouse in #80
Multi-arch builds for Docker by @jacksonrnewhouse in #11
Prometheus and pushgateway in the docker image for working metrics by @mwylde in #16
Bump datafusion to 23.0, arrow to 37.0 by @jacksonrnewhouse in #92
Run compiler service locally, compile in debug mode if DEBUG is set by @jacksonrnewhouse in #83
Replace shelling out to rustfmt with prettyplease by @jacksonrnewhouse in #87

See the full changelog: https://github.com/ArroyoSystems/arroyo/commits/release-0.2.0

Contributors

rtyler, mwylde, and 3 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview

New contributors

What's next

Features

UDFs

SQL DDL statements

Custom event time and watermarks

Additional SQL features

Server-Sent Events source

Web UI

Improvements

Arroyo 0.2.0

Features

Native Kubernetes support

Nomad deployments

SQL features

Connectors, Web UI, and platform support

Fixes

Improvements

Contributors

Releases: ArroyoSystems/arroyo

v0.3.0

Overview

New contributors

What's next

Features

UDFs

SQL DDL statements

Custom event time and watermarks

Additional SQL features

Server-Sent Events source

Web UI

Improvements

v0.2.0

Arroyo 0.2.0

Features

Native Kubernetes support

Nomad deployments

SQL features

Connectors, Web UI, and platform support

Fixes

Improvements

Contributors