Skip to content

Debussy is an opinionated Data Architecture and Engineering framework, enabling data analysts and engineers to build better platforms and pipelines.

License

Notifications You must be signed in to change notification settings

debussy-labs/debussy_concert

Repository files navigation

GitHub issues GitHub forks GitHub stars GitHub license

Debussy Concert

Debussy is a free, open-source, opinionated Data Architecture and Engineering framework. It enables data analysts and engineers to build better data platforms through first class data pipelines, following a low-code and self-service approach.

Description · Key Features · Key Benefits · Quick Start · Integrations
Full Documentation · Communication · Contributions · License


Description

In the data engineering field, everyone is reinventing the wheel all the time – it's still rare to see the adoption of software engineering best practices, such as DRY, KISS or YAGNI. Despite the existence of several tools for data orchestration (e.g. Apache Airflow, Prefect, Dagster) and distributed data processing (e.g. Apache Spark, Apache Beam), every time a new data pipeline demand arises it usually implies lengthy development projects. Think of developing a web application without the help of a web framework such as Django or Flask!

What's even worse, although sharing key concepts, these data orchestration tools have very distinct syntaxes and features, making migrations a daunting task! Moreover, simply adopting these tools does not guarantee that best practices are being followed, including with regard to data architecture (think of data modeling, data management lifecycle, among others).

While lots of companies have faced these same issues, most of them have decided to develop their own in-house solutions, missing the opportunity for colaboration and wider adoption of data architecture and sofware engineering best practices.

With that in mind, we created Debussy! Debussy Concert is the core component of Debussy. It's a code generation engine for orchestration tools, currently supporting only Airflow, but with others on the Roadmap. It provides abstraction layers in the form of a musical themed semantic model, decoupling the pipeline logic to the underlying orchestration tool, and enabling a low-code approach to data engineering. We also provides pipelines templates (e.g. data ingestion, data transformation and reverse ETL) built with our engine, while always striving to offer the aforementioned best practices.

Key Features

  • Dynamic data pipeline generation from YAML configuration files or directly through Python
  • Provides a semantic model for data pipeline development, abstracting the inner orchestration engine
  • Enables seamless integration of first class data projects, such as Airflow, Spark, and dbt

Key Benefits

✔ It provides lower time to delivery and costs related to data pipeline development, while enabling higher ROI
✔ Avoid pipeline debt by following sound software engineering design principles
✔ Ensure your platform is following data architecture best practices

Quick Start

Debussy works on any installation of Apache Airflow 2.0, but since we currently support only GCP based data platforms as the target Data Lakehouse, we recommend a deployment to Cloud Composer.

In order to use Debussy, you first need to go through the following steps:

  1. Select or create a Google Cloud Platform project.
  2. Enable billing for your project.
  3. Create a Cloud Composer 2 environment.
  4. Install Debussy on your Cloud Composer instance: just upload the project to your plugins/ folder.
  5. Check our User's Guide and examples to learn how to use it!

Integrations

Debussy works with the tools and systems that you're already using with your data, including:

Integration Notes
Apache Airflow An open source orchestration engine
Spark Open source distributed processing engine, used for the data ingestion pipelines
dbt dbt is an open-source data transformation tool, used for the data transformation pipelines
Google Cloud Storage Cloud based blob storage, supported as data source or destination
BigQuery Google serverless massive-scale SQL analytics platform, supported as the analytical environment (aka. Data Lakehouse)
MySQL Leading open source database, supported as a data source or destination
PostgreSQL Leading open source database, supported as a data source or destination
Other SQL Relational DBs Most RDBMS are supported as data sources via JDBC drivers through Spark
AWS S3 Cloud based blob storage, supported as data source or destination

Full Documentation

See the Wiki for full documentation, examples, operational details and other information.

Communication

GitHub Issues

Discord Server

Contributions

We welcome all community contributions!

In order to have a more open and welcoming community, Debussy adheres to a code of conduct adapted from Contributor Covenant.

Please read through our contributing guidelines. Included are directions for opening issues, coding standards, and notes on development.

License

Copyright 2022 Dotz, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.