Skip to content

V0.6.0

Compare
Choose a tag to compare
@mwylde mwylde released this 14 Sep 19:58
· 359 commits to master since this release
268f562

These release notes are also available on our blog.

Arroyo 0.6 is focused on quality. After a break-neck pace of new features in the past few releases, we're taking a step back to focus on testing, correctness, and stability. I'm particularly excited about our new SQL test suite, which has uncovered a number of bugs and edge cases in our SQL support.

That said, it wouldn't be a new Arroyo release without some new features! We've added support for Google Cloud Storage (GCS) and improved our Helm chart so that Arroyo now runs well on Google Kubernetes Engine (GKE). We've also included a couple of new connectors—a polling HTTP source and a webhook sink—that helps Arroyo fit into companies that are earlier in their streaming journey. And we've shipped the first version of our UDAF support, which allows you to write your own SQL aggregate functions in Rust.

Thanks to all our contributors for this release:

What's next

Now that 0.6 is out the door, we're already hard at work on the next release, planned for mid-October. We're working on a number of exciting features and improvements. We're adding support for accumulator-based UDAFS, which will allow for efficient incremental calculations of aggregates. We're also working on improvements to our checkpoint storage to support compaction and more efficient deletes. And we'll be releasing a Postgres connector to allow directly writing to Postgres tables without having to use Debezium.

Anything you'd like to see? Let us know on Discord!

Now on to the details.


Features

GCS/GKE

Arroyo has long supported Amazon S3 as a storage backend for storing pipeline artifacts and checkpoints. With this release we're greatly expanding our storage options, with support for Google Cloud Storage (GCS) and alternative S3-compatible systems like Minio and Localstack.

We've also made the way storage is configured more flexible and consistent, via two environment variables: CHECKPOINT_URL and ARTIFACT_URL. These variables can take a variety of URL forms, like:

  • s3://my-bucket
  • s3::https://my-custom-s3:1234/my-bucket
  • https://s3.us-east-1.amazonaws.com/my-bucket
  • file:///my/local/filesystem
  • gs://my-gcs-bucket

See the docs for how to configure Arroyo on GKE.

  • Add support for GCS and minio/localstack by @mwylde in #296

User-defined aggregate functions

SQL Aggregate functions allow you to compute summary statistics over a group of rows, like SUM, AVG, and COUNT. Arroyo 0.6 adds initial support for user-defined aggregate functions (UDAFs) which allow you to write your own aggregate functions in Rust.

For example, Arroyo doesn't include an aggregate function for computing Exponential Moving Average (EMA), but now you can write your own:

fn ema(data: Vec<f64>) -> f64 {
    const alpha: f64 = 0.1;
    let mut ema = data[0] as f64;

    for i in 1..data.len() {
        let ema_value = alpha * data[i] as f64 + (1.0 - alpha) * ema;
        ema = ema_value;
    }

    ema
}

Currently only UDAFs over vectors are supported, but we plan to expand this to support UDAFs that rely on partial aggregates and two-phase aggregations.

Connectors

We've added two new connectors that are well suited for companies that are trying to integrate Arroyo with HTTP-based systems, to complement our existing Server-Sent Events source and Websocket sources.

Polling HTTP source

The new polling http source lets you turn any HTTP API into a streaming source. It periodically polls the endpoint, and emits any new data into the stream.

For example, we're so excited whenever we get a new PR on Github and we want to know about it as soon as possible. We can use the polling HTTP source to periodically fetch our PRs and emit them whenever this changes:

CREATE TABLE prs (
    value TEXT
) WITH (
    connector = 'polling_http',
    endpoint = 'https://api.github.com/repos/ArroyoSystems/arroyo/pulls?per_page=1&state=all',
    poll_interval_ms = '5000',
    emit_behavior = 'changed',
    headers = 'User-Agent:arroyo/0.6',
    format = 'json',
    'json.unstructured' = 'true'
);

SELECT extract_json_string(value, '$[0].url') as url
from prs;

Webhook sink

On the other end of your application, we now have a webhook sink that lets you send pipeline results to any HTTP endpoint. This allows you to consume outputs without needing to adopt streaming technologies like Kafka.

For example, we can build a simple Slack notification for our PRs using the webhook sink and the source we defined above:

CREATE TABLE slack (
    value TEXT
) WITH (
    connector = 'webhook',
    endpoint = 'https://hooks.slack.com/services/XXXXX/XXXXX/XXXXX',
    method = 'POST',
    headers = 'Content-Type:application/json',
    format = 'json',
    'json.unstructured' = 'true'
);

INSERT INTO slack
SELECT concat('A new PR was created ', extract_json_string(value, '$[0].url')) as text
from prs;

Rest API

We've finished our migration from gRPC to REST for the public-facing Arroyo API. As of this release, all endpoints are now available via REST, and we've removed the gRPC endpoint from the API service. New endpoints in this release include the ability to create connections and fetch all job details including checkpoint information.

  • Switch all connection endpoints to REST API by @jbeisen in #242
  • Add pagination to pipelines and connection tables by @jbeisen in #255
  • Switch checkpoint details to REST API by @jbeisen in #265
  • Delete the GRPC API from the API service by @jbeisen in #266

Raw string serialization

Arroyo supports a number of different serialization formats, for handling data coming into and out of pipelines, including JSON, Parquet, and raw strings. In this release we've added support for serializing arbitrary string data. This is particularly powerful when combined with UDFs, which make it easy to construct complex string-based formats that Arroyo doesn't natively support.

For example it's now possible to write a UDF that produces complex, nested JSON that Arroyo's table schemas can't represent. For example:

fn my_to_json(f: f64) -> String {
    let v = serde_json::json!({
        "my_complex": {
            "nested_format": f
        }
    });

    serde_json::to_string(&v).unwrap()
}
  • Add support for serializing raw_string formats by @mwylde in #302

Improvements & Fixes

SQL

As I mentioned in the introduction, this is really the meat of this release. We've performed a major refactor to our SQL compiler that has fixed a number of bugs and inconsistencies. And we've added a new set of SQL correctness tests that perform end-to-end validation of our SQL implementation.

This work has uncovered a number of bugs and edge cases (fixed in this release) and gives us much more confidence in the quality of our SQL support going forward.

Console & API

Beyond SQL, we've also made a number of quality-of-life improvements to the console, including an improved preview experience, better error handling, and pagination.

Connectors & Schemas

  • Don't generate schemas with unsigned types for Kafka connect by @mwylde in #294
  • Fix compilation for nested json schema fields by @mwylde in #278
  • Fix connection tables with connection profiles by @mwylde in #303
  • Enable configuration of aws_region for Kinesis sources and sinks by @jacksonrnewhouse in #313
  • Fix fluvio table by @mwylde in #316
  • Skip lines from file in single_file source by @rcjmurillo in #321

Engine

Deployment

  • Revamping the helm Chart and implementing CI/CD by @edmondop in #271
  • Better support for service accounts in helm and other helm improvements by @mwylde in #314

Full Changelog: v0.5.1...v0.6.0