From 937f7908277e3dd63f812c992e73b05c1344114e Mon Sep 17 00:00:00 2001 From: Will Jones Date: Wed, 24 Jan 2024 15:27:22 -0800 Subject: [PATCH] docs: document basics of configuring object storage (#832) Created based on upstream PR https://github.com/lancedb/lance/pull/1849 Closes #681 --------- Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com> --- docs/mkdocs.yml | 2 + docs/src/guides/storage.md | 91 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 93 insertions(+) create mode 100644 docs/src/guides/storage.md diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index fd8dc17b93..2ad10263bd 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -90,6 +90,7 @@ nav: - Full-text search: fts.md - Filtering: sql.md - Versioning & Reproducibility: notebooks/reproducibility.ipynb + - Configuring Storage: guides/storage.md - 🧬 Managing embeddings: - Overview: embeddings/index.md - Explicit management: embeddings/embedding_explicit.md @@ -149,6 +150,7 @@ nav: - Full-text search: fts.md - Filtering: sql.md - Versioning & Reproducibility: notebooks/reproducibility.ipynb + - Configuring Storage: guides/storage.md - Managing Embeddings: - Overview: embeddings/index.md - Explicit management: embeddings/embedding_explicit.md diff --git a/docs/src/guides/storage.md b/docs/src/guides/storage.md new file mode 100644 index 0000000000..c08f3956b0 --- /dev/null +++ b/docs/src/guides/storage.md @@ -0,0 +1,91 @@ +# Configuring cloud storage + + + +When using LanceDB OSS, you can choose where to store your data. The tradeoffs between different storage options are discussed in the [storage concepts guide](../concepts/storage.md). This guide shows how to configure LanceDB to use different storage options. + +## Object Stores + +LanceDB OSS supports object stores such as AWS S3 (and compatible stores), Azure Blob Store, and Google Cloud Storage. Which object store to use is determined by the URI scheme of the dataset path. `s3://` is used for AWS S3, `az://` is used for Azure Blob Storage, and `gs://` is used for Google Cloud Storage. These URIs are passed to the `connect` function: + +=== "Python" + + AWS S3: + + ```python + import lancedb + db = lancedb.connect("s3://bucket/path") + ``` + + Google Cloud Storage: + + ```python + import lancedb + db = lancedb.connect("gs://bucket/path") + ``` + + Azure Blob Storage: + + ```python + import lancedb + db = lancedb.connect("az://bucket/path") + ``` + +=== "JavaScript" + + AWS S3: + + ```javascript + const lancedb = require("lancedb"); + const db = await lancedb.connect("s3://bucket/path"); + ``` + + Google Cloud Storage: + + ```javascript + const lancedb = require("lancedb"); + const db = await lancedb.connect("gs://bucket/path"); + ``` + + Azure Blob Storage: + + ```javascript + const lancedb = require("lancedb"); + const db = await lancedb.connect("az://bucket/path"); + ``` + +In most cases, when running in the respective cloud and permissions are set up correctly, no additional configuration is required. When running outside of the respective cloud, authentication credentials must be provided using environment variables. In general, these environment variables are the same as those used by the respective cloud SDKs. The sections below describe the environment variables that can be used to configure each object store. + +LanceDB OSS uses the [object-store](https://docs.rs/object_store/latest/object_store/) Rust crate for object store access. There are general environment variables that can be used to configure the object store, such as the request timeout and proxy configuration. See the [object_store ClientConfigKey](https://docs.rs/object_store/latest/object_store/enum.ClientConfigKey.html) doc for available configuration options. The environment variables that can be set are the snake-cased versions of these variable names. For example, to set `ProxyUrl` use the environment variable `PROXY_URL`. (Don't let the Rust docs intimidate you! We link to them so you can see an up-to-date list of the available options.) + + +### AWS S3 + +To configure credentials for AWS S3, you can use the `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN` environment variables. + +Alternatively, if you are using AWS SSO, you can use the `AWS_PROFILE` and `AWS_DEFAULT_REGION` environment variables. + +You can see a full list of environment variables [here](https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html#method.from_env). + +#### S3-compatible stores + +LanceDB can also connect to S3-compatible stores, such as MinIO. To do so, you must specify two environment variables: `AWS_ENDPOINT` and `AWS_DEFAULT_REGION`. `AWS_ENDPOINT` should be the URL of the S3-compatible store, and `AWS_DEFAULT_REGION` should be the region to use. + + + +### Google Cloud Storage + +GCS credentials are configured by setting the `GOOGLE_SERVICE_ACCOUNT` environment variable to the path of a JSON file containing the service account credentials. There are several aliases for this environment variable, documented [here](https://docs.rs/object_store/latest/object_store/gcp/struct.GoogleCloudStorageBuilder.html#method.from_env). + + +!!! info "HTTP/2 support" + + By default, GCS uses HTTP/1 for communication, as opposed to HTTP/2. This improves maximum throughput significantly. However, if you wish to use HTTP/2 for some reason, you can set the environment variable `HTTP1_ONLY` to `false`. + +### Azure Blob Storage + +Azure Blob Storage credentials can be configured by setting the `AZURE_STORAGE_ACCOUNT_NAME` and ``AZURE_STORAGE_ACCOUNT_KEY`` environment variables. The full list of environment variables that can be set are documented [here](https://docs.rs/object_store/latest/object_store/azure/struct.MicrosoftAzureBuilder.html#method.from_env). + + + \ No newline at end of file