forked from apache/datafusion
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Extract catalog API to separate crate
This moves `CatalogProvider`, `TableProvider`, `SchemaProvider` to a new `datafusion-catalog` crate. The circular dependency between core `SessionState` and implementations is broken up by introducing `CatalogSession` dyn trait. Implementations of `TableProvider` that reside under core current have access to `CatalogSession` by downcasting. This is supposed to be an intermediate step.
- Loading branch information
Showing
42 changed files
with
726 additions
and
506 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
[package] | ||
name = "datafusion-catalog" | ||
authors.workspace = true | ||
edition.workspace = true | ||
homepage.workspace = true | ||
license.workspace = true | ||
readme.workspace = true | ||
repository.workspace = true | ||
rust-version.workspace = true | ||
version.workspace = true | ||
|
||
[dependencies] | ||
arrow-schema = { workspace = true } | ||
async-trait = "0.1.41" | ||
datafusion-expr = { workspace = true } | ||
datafusion-common = { workspace = true } | ||
datafusion-execution = { workspace = true } | ||
datafusion-physical-plan = { workspace = true } | ||
|
||
[lints] | ||
workspace = true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
use std::any::Any; | ||
use std::sync::Arc; | ||
|
||
pub use crate::schema::SchemaProvider; | ||
use datafusion_common::not_impl_err; | ||
use datafusion_common::Result; | ||
|
||
/// Represents a catalog, comprising a number of named schemas. | ||
/// | ||
/// # Catalog Overview | ||
/// | ||
/// To plan and execute queries, DataFusion needs a "Catalog" that provides | ||
/// metadata such as which schemas and tables exist, their columns and data | ||
/// types, and how to access the data. | ||
/// | ||
/// The Catalog API consists: | ||
/// * [`CatalogProviderList`]: a collection of `CatalogProvider`s | ||
/// * [`CatalogProvider`]: a collection of `SchemaProvider`s (sometimes called a "database" in other systems) | ||
/// * [`SchemaProvider`]: a collection of `TableProvider`s (often called a "schema" in other systems) | ||
/// * [`TableProvider]`: individual tables | ||
/// | ||
/// # Implementing Catalogs | ||
/// | ||
/// To implement a catalog, you implement at least one of the [`CatalogProviderList`], | ||
/// [`CatalogProvider`] and [`SchemaProvider`] traits and register them | ||
/// appropriately the [`SessionContext`]. | ||
/// | ||
/// [`SessionContext`]: crate::execution::context::SessionContext | ||
/// | ||
/// DataFusion comes with a simple in-memory catalog implementation, | ||
/// [`MemoryCatalogProvider`], that is used by default and has no persistence. | ||
/// DataFusion does not include more complex Catalog implementations because | ||
/// catalog management is a key design choice for most data systems, and thus | ||
/// it is unlikely that any general-purpose catalog implementation will work | ||
/// well across many use cases. | ||
/// | ||
/// # Implementing "Remote" catalogs | ||
/// | ||
/// Sometimes catalog information is stored remotely and requires a network call | ||
/// to retrieve. For example, the [Delta Lake] table format stores table | ||
/// metadata in files on S3 that must be first downloaded to discover what | ||
/// schemas and tables exist. | ||
/// | ||
/// [Delta Lake]: https://delta.io/ | ||
/// | ||
/// The [`CatalogProvider`] can support this use case, but it takes some care. | ||
/// The planning APIs in DataFusion are not `async` and thus network IO can not | ||
/// be performed "lazily" / "on demand" during query planning. The rationale for | ||
/// this design is that using remote procedure calls for all catalog accesses | ||
/// required for query planning would likely result in multiple network calls | ||
/// per plan, resulting in very poor planning performance. | ||
/// | ||
/// To implement [`CatalogProvider`] and [`SchemaProvider`] for remote catalogs, | ||
/// you need to provide an in memory snapshot of the required metadata. Most | ||
/// systems typically either already have this information cached locally or can | ||
/// batch access to the remote catalog to retrieve multiple schemas and tables | ||
/// in a single network call. | ||
/// | ||
/// Note that [`SchemaProvider::table`] is an `async` function in order to | ||
/// simplify implementing simple [`SchemaProvider`]s. For many table formats it | ||
/// is easy to list all available tables but there is additional non trivial | ||
/// access required to read table details (e.g. statistics). | ||
/// | ||
/// The pattern that DataFusion itself uses to plan SQL queries is to walk over | ||
/// the query to [find all table references], | ||
/// performing required remote catalog in parallel, and then plans the query | ||
/// using that snapshot. | ||
/// | ||
/// [find all table references]: resolve_table_references | ||
/// | ||
/// # Example Catalog Implementations | ||
/// | ||
/// Here are some examples of how to implement custom catalogs: | ||
/// | ||
/// * [`datafusion-cli`]: [`DynamicFileCatalogProvider`] catalog provider | ||
/// that treats files and directories on a filesystem as tables. | ||
/// | ||
/// * The [`catalog.rs`]: a simple directory based catalog. | ||
/// | ||
/// * [delta-rs]: [`UnityCatalogProvider`] implementation that can | ||
/// read from Delta Lake tables | ||
/// | ||
/// [`datafusion-cli`]: https://datafusion.apache.org/user-guide/cli/index.html | ||
/// [`DynamicFileCatalogProvider`]: https://github.com/apache/datafusion/blob/31b9b48b08592b7d293f46e75707aad7dadd7cbc/datafusion-cli/src/catalog.rs#L75 | ||
/// [`catalog.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/catalog.rs | ||
/// [delta-rs]: https://github.com/delta-io/delta-rs | ||
/// [`UnityCatalogProvider`]: https://github.com/delta-io/delta-rs/blob/951436ecec476ce65b5ed3b58b50fb0846ca7b91/crates/deltalake-core/src/data_catalog/unity/datafusion.rs#L111-L123 | ||
/// | ||
/// [`TableProvider]: crate::datasource::TableProvider | ||
|
||
pub trait CatalogProvider: Sync + Send { | ||
/// Returns the catalog provider as [`Any`] | ||
/// so that it can be downcast to a specific implementation. | ||
fn as_any(&self) -> &dyn Any; | ||
|
||
/// Retrieves the list of available schema names in this catalog. | ||
fn schema_names(&self) -> Vec<String>; | ||
|
||
/// Retrieves a specific schema from the catalog by name, provided it exists. | ||
fn schema(&self, name: &str) -> Option<Arc<dyn SchemaProvider>>; | ||
|
||
/// Adds a new schema to this catalog. | ||
/// | ||
/// If a schema of the same name existed before, it is replaced in | ||
/// the catalog and returned. | ||
/// | ||
/// By default returns a "Not Implemented" error | ||
fn register_schema( | ||
&self, | ||
name: &str, | ||
schema: Arc<dyn SchemaProvider>, | ||
) -> Result<Option<Arc<dyn SchemaProvider>>> { | ||
// use variables to avoid unused variable warnings | ||
let _ = name; | ||
let _ = schema; | ||
not_impl_err!("Registering new schemas is not supported") | ||
} | ||
|
||
/// Removes a schema from this catalog. Implementations of this method should return | ||
/// errors if the schema exists but cannot be dropped. For example, in DataFusion's | ||
/// default in-memory catalog, [`MemoryCatalogProvider`], a non-empty schema | ||
/// will only be successfully dropped when `cascade` is true. | ||
/// This is equivalent to how DROP SCHEMA works in PostgreSQL. | ||
/// | ||
/// Implementations of this method should return None if schema with `name` | ||
/// does not exist. | ||
/// | ||
/// By default returns a "Not Implemented" error | ||
fn deregister_schema( | ||
&self, | ||
_name: &str, | ||
_cascade: bool, | ||
) -> Result<Option<Arc<dyn SchemaProvider>>> { | ||
not_impl_err!("Deregistering new schemas is not supported") | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
// // re-export dependencies from arrow-rs to minimize version maintenance for crate users | ||
// pub use arrow; | ||
|
||
mod catalog; | ||
mod schema; | ||
mod session; | ||
mod table; | ||
|
||
pub use catalog::*; | ||
pub use schema::*; | ||
pub use session::*; | ||
pub use table::*; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.