This repository implements connectors to read and write Pravega Streams with Apache Spark, a high-performance analytics engine for batch and streaming data.
Build end-to-end stream processing and batch pipelines that use Pravega as the stream storage and message bus, and Apache Spark for computation over the streams.
Pravega is an open source distributed storage service implementing Streams. It offers Stream as the main primitive for the foundation of reliable storage systems: a high-performance, durable, elastic, and unlimited append-only byte stream with strict ordering and consistency.
To learn more about Pravega, visit https://pravega.io
- Exactly-once processing guarantees for both Reader and Writer, supporting end-to-end exactly-once processing pipelines
- A Spark micro-batch reader connector allows Spark streaming applications to read Pravega Streams. Pravega stream cuts (i.e. offsets) are used to reliably recover from failures and provide exactly-once semantics.
- A Spark batch reader connector allows Spark batch applications to read Pravega Streams.
- A Spark writer allows Spark batch and streaming applications to write to Pravega Streams. Writes are optionally contained within Pravega transactions, providing exactly-once semantics.
- Seamless integration with Spark's checkpoints.
- Parallel Readers and Writers supporting high throughput and low latency processing.
The master branch will always have the most recent supported versions of Spark and Pravega.
Spark Version | Pravega Version | Java Version To Build Connector | Java Version To Run Connector | Git Branch |
---|---|---|---|---|
3.4 | 0.14 | Java 11 | Java 8 or 11 | master |
3.4 | 0.13 | Java 11 | Java 8 or 11 | r0.13 |
Don’t hesitate to ask! Contact the developers and community on Slack (signup) if you need any help. Open an issue if you found a bug on Github Issues.
Spark Connectors for Pravega is 100% open source and community-driven. All components are available under Apache 2 License on GitHub.