-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for standalone (sparkless) GC #8307
Changes from 11 commits
fc9c376
c835860
1631f8c
70abbe8
684b5bb
d597378
a15b81e
a1e7ef9
460aab6
2be77df
30b4f2a
0ea65d8
afe2a5e
273644f
a010a03
5cdab63
bc2012e
9e8a43b
62e38ea
b44a572
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
<div class="toc-block"> | ||
## Table of contents | ||
{: .no_toc .text-delta } | ||
|
||
1. TOC | ||
{:toc} | ||
{::options toc_levels="2..4" /} | ||
</div> |
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,244 @@ | ||||||||||
--- | ||||||||||
title: Standalone Garbage Collection | ||||||||||
description: Run a limited version of garbage collection without any external dependencies | ||||||||||
parent: Garbage Collection | ||||||||||
nav_order: 5 | ||||||||||
grand_parent: How-To | ||||||||||
redirect_from: | ||||||||||
- /cloud/standalone-gc.html | ||||||||||
--- | ||||||||||
|
||||||||||
# Standalone Garbage Collection | ||||||||||
{: .d-inline-block } | ||||||||||
lakeFS Enterprise | ||||||||||
{: .label .label-green } | ||||||||||
|
||||||||||
{: .d-inline-block } | ||||||||||
experimental | ||||||||||
{: .label .label-red } | ||||||||||
|
||||||||||
{: .note } | ||||||||||
> Standalone GC is only available for [lakeFS Enterprise]({% link enterprise/index.md %}). | ||||||||||
|
||||||||||
{: .note .warning } | ||||||||||
> Standalone GC is experimental and offers limited capabilities compared to the [Spark-backed GC]({% link howto/garbage-collection/gc.md %}). Read through the [limitations](./standalone-gc.md#limitations) carefully before using it. | ||||||||||
|
||||||||||
{% include toc_2-4.html %} | ||||||||||
|
||||||||||
## About | ||||||||||
|
||||||||||
Standalone GC is a limited version of the Spark-backed GC that runs without any external dependencies, as a standalone docker image. | ||||||||||
|
||||||||||
## Limitations | ||||||||||
|
||||||||||
1. Except for the [Lab tests](./standalone-gc.md#lab-tests) performed, there are no further guarantees about the performance profile of the Standalone GC. | ||||||||||
2. Horizontal scale is not supported - Only a single instance of `lakefs-sgc` can operate at a time on a given repository. | ||||||||||
3. It only marks objects and does not delete them - Equivalent to the GC's [mark only mode]({% link howto/garbage-collection/gc.md %}#mark-only-mode). \ | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||||
More about that in the [Output](./standalone-gc.md#output) section. | ||||||||||
|
||||||||||
### Lab tests | ||||||||||
|
||||||||||
Repository spec: | ||||||||||
|
||||||||||
- 100k objects | ||||||||||
- < 200 commits | ||||||||||
- 1 branch | ||||||||||
|
||||||||||
Machine spec: | ||||||||||
- 4GiB RAM | ||||||||||
- 8 CPUs | ||||||||||
|
||||||||||
In this setup, we measured: | ||||||||||
|
||||||||||
- Time: < 5m | ||||||||||
- Disk space: 120MiB | ||||||||||
|
||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should add a limitation that says that sgc only implements the mark stage without sweeping, and sweep requires user action There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done - added a bullet to "Limitations", and a new "Output" section describing this. |
||||||||||
## Installation | ||||||||||
|
||||||||||
### Step 1: Obtain Dockerhub token | ||||||||||
As an enterprise customer, you should already have a dockerhub token for the `externallakefs` user. | ||||||||||
If not, contact us at [[email protected]](mailto:[email protected]). | ||||||||||
|
||||||||||
### Step 2: Login to Dockerhub with this token | ||||||||||
```bash | ||||||||||
docker login -u <your token> | ||||||||||
``` | ||||||||||
|
||||||||||
### Step 3: Download the docker image | ||||||||||
Download the image from the [lakefs-sgc](https://hub.docker.com/repository/docker/treeverse/lakefs-sgc/general) repository: | ||||||||||
```bash | ||||||||||
docker pull treeverse/lakefs-sgc:<tag> | ||||||||||
``` | ||||||||||
|
||||||||||
## Usage | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you please add these two steps here: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I already added an example - take a look at "Example - docker run command"
Done - in the new "Output" section. Not sure WDYM by a "CTA", I just added a sentence explaining that the user should read the report and delete manually. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Update: I added a dedicated section for "Deleting marked objects" with the same sentence ^ |
||||||||||
|
||||||||||
### AWS Credentials | ||||||||||
Currently, `lakefs-sgc` does not provide an option to explicitly set AWS credentials. It relies on the hosting machine | ||||||||||
to be set up correctly, and reads the AWS credentials from the machine. | ||||||||||
|
||||||||||
This means, you should set up your machine however AWS expects you to set it. \ | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How configurations work for on-prem users who use Minio? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done - added "S3-compatible clients" section and example (cc @itaiad200) |
||||||||||
For example, by following their guide on [configuring the AWS CLI](https://docs.aws.amazon.com/cli/v1/userguide/cli-chap-configure.html). | ||||||||||
|
||||||||||
#### S3-compatible clients | ||||||||||
Naturally, this method of configuration allows for `lakefs-sgc` to work with any S3-compatible client (such as [MinIO](https://min.io/)). \ | ||||||||||
An example setup for working with MinIO: | ||||||||||
1. Add a profile to your `~/.aws/config` file: | ||||||||||
``` | ||||||||||
[profile minio] | ||||||||||
region = us-east-1 | ||||||||||
endpoint_url = <your MinIO URL> | ||||||||||
s3 = | ||||||||||
signature_version = s3v4 | ||||||||||
``` | ||||||||||
|
||||||||||
2. Add an access and secret keys to your `~/.aws/credentials` file: | ||||||||||
``` | ||||||||||
[minio] | ||||||||||
aws_access_key_id = <your MinIO access key> | ||||||||||
aws_secret_access_key = <your MinIO secret key> | ||||||||||
``` | ||||||||||
3. Run the `lakefs-sgc` docker image and pass it the `minio` profile - see [example](./standalone-gc.md#mounting-the-aws-directory) below. | ||||||||||
|
||||||||||
### Configuration | ||||||||||
The following configuration keys are available: | ||||||||||
|
||||||||||
| Key | Description | Default value | Possible values | | ||||||||||
|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|---------------------------------------------------------| | ||||||||||
| `logging.format` | Logs output format | "text" | "text","json" | | ||||||||||
| `logging.level` | Logs level | "info" | "error","warn",info","debug","trace" | | ||||||||||
| `logging.output` | Where to output the logs to | "-" | "-" (stdout), "=" (stderr), or any string for file path | | ||||||||||
| `cache_dir` | Directory to use for caching data during run | ~/.lakefs-sgc/data | string | | ||||||||||
| `aws.max_page_size` | Max number of items per page when listing objects in AWS | 1000 | number | | ||||||||||
| `aws.s3.addressing_path_style` | Whether or not to use [path-style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#path-style-access) when reading objects from AWS | true | boolean | | ||||||||||
| `objects_min_age`* | Ignore any object that is last modified within this time frame ("cutoff time") | "6h" | duration | | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need this if we have retention policy? and if it is risky to change it, why it is configurable? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed. It's on top of the retention policy, this configuration exists in the GC as well. |
||||||||||
| `lakefs.endpoint_url` | The URL to the lakeFS installation - should end with `/api/v1` | NOT SET | URL | | ||||||||||
| `lakefs.access_key_id` | Access key to the lakeFS installation | NOT SET | string | | ||||||||||
| `lakefs.secret_access_key` | Secret access key to the lakeFS installation | NOT SET | string | | ||||||||||
|
||||||||||
{: .note } | ||||||||||
> **WARNING:** Changing `objects_min_age` is dangerous and can lead to undesired behaviour, such as causing ongoing writes to fail. | ||||||||||
It's recommended to not change this property. | ||||||||||
|
||||||||||
These keys can be provided in the following ways: | ||||||||||
1. Config file: Create a YAML file with the keys, each `.` is a new nesting level. \ | ||||||||||
For example, `logging.level` will be: | ||||||||||
```yaml | ||||||||||
logging: | ||||||||||
level: <value> # info,debug... | ||||||||||
``` | ||||||||||
Then, pass it to the program using the `--config path/to/config.yaml` argument. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There isn't a default location it expects for? that's ok, just to clarify There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is, it's mentioned in the "Command line reference" section:
|
||||||||||
2. Environment variables: by setting `LAKEFS_SGC_<KEY>`, with uppercase letters and `.`s converted to `_`s. \ | ||||||||||
For example `logging.level` will be: | ||||||||||
```bash | ||||||||||
export LAKEFS_SGC_LOGGING_LEVEL=info` | ||||||||||
``` | ||||||||||
|
||||||||||
Example (minimalistic) config file: | ||||||||||
```yaml | ||||||||||
logging: | ||||||||||
level: debug | ||||||||||
lakefs: | ||||||||||
endpoint_url: https://your.url/api/v1 | ||||||||||
access_key_id: <your lakeFS access key> | ||||||||||
secret_access_key: <your lakeFS secret key> | ||||||||||
``` | ||||||||||
|
||||||||||
### Command line reference | ||||||||||
|
||||||||||
#### Flags: | ||||||||||
- `-c, --config`: config file to use (default is $HOME/.lakefs-sgc.yaml) | ||||||||||
|
||||||||||
#### Commands: | ||||||||||
**run** | ||||||||||
|
||||||||||
Usage: \ | ||||||||||
`lakefs-sgc run <repository>` | ||||||||||
|
||||||||||
Flags: | ||||||||||
- `--cache-dir`: directory to cache read files and metadataDir (default is $HOME/.lakefs-sgc/data/) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this a config value? How can it be both? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can 🙂 |
||||||||||
- `--parallelism`: number of parallel downloads for metadataDir (default 10) | ||||||||||
- `--presign`: use pre-signed URLs when downloading/uploading data (recommended) (default true) | ||||||||||
|
||||||||||
### Example run commands | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||||
|
||||||||||
#### Directly passing in credentials parsed from `~/.aws/credentials` | ||||||||||
|
||||||||||
```bash | ||||||||||
docker run \ | ||||||||||
-e AWS_REGION=<region> \ | ||||||||||
-e AWS_SESSION_TOKEN="$(grep 'aws_session_token' ~/.aws/credentials | awk -F' = ' '{print $2}')" \ | ||||||||||
-e AWS_ACCESS_KEY_ID="$(grep 'aws_access_key_id' ~/.aws/credentials | awk -F' = ' '{print $2}')" \ | ||||||||||
-e AWS_SECRET_ACCESS_KEY="$(grep 'aws_secret_access_key' ~/.aws/credentials | awk -F' = ' '{print $2}')" \ | ||||||||||
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your lakefs URL> \ | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||||
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \ | ||||||||||
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \ | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did we mention somewhere which lakeFS user this user is? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done - Added a "Permissions" section |
||||||||||
-e LAKEFS_SGC_LOGGING_LEVEL=debug \ | ||||||||||
treeverse/lakefs-sgc:<tag> run <repository> | ||||||||||
``` | ||||||||||
|
||||||||||
#### Mounting the `~/.aws` directory | ||||||||||
|
||||||||||
When working with S3-compatible clients, it's often more convenient to mount the ~/.aws` file and pass in the desired profile. | ||||||||||
|
||||||||||
First, change the permissions for `~/.aws/*` to allow the docker container to read this directory: | ||||||||||
```bash | ||||||||||
chmod 644 ~/.aws/* | ||||||||||
``` | ||||||||||
|
||||||||||
Then, run the docker image and mount `~/.aws` to the `lakefs-sgc` home directory on the docker container: | ||||||||||
```bash | ||||||||||
docker run \ | ||||||||||
--network=host \ | ||||||||||
-v ~/.aws:/home/lakefs-sgc/.aws \ | ||||||||||
-e AWS_REGION=us-east-1 \ | ||||||||||
-e AWS_PROFILE=<your profile> \ | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit; Here and elsewhere, drop the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||||
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your endpoint URL> \ | ||||||||||
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \ | ||||||||||
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \ | ||||||||||
-e LAKEFS_SGC_LOGGING_LEVEL=debug \ | ||||||||||
treeverse/lakefs-sgc:<tag> run <repository> | ||||||||||
``` | ||||||||||
### Output | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||||
`lakefs-sgc` will write its reports to `<REPOSITORY_STORAGE_NAMESPACE>/_lakefs/retention/gc/reports/<RUN_ID>/`. \ | ||||||||||
_RUN_ID_ is generated during runtime by the Standalone GC. You can find it in the logs: | ||||||||||
``` | ||||||||||
"Marking objects for deletion" ... run_id=gcoca17haabs73f2gtq0 | ||||||||||
``` | ||||||||||
|
||||||||||
In this prefix, you'll find 2 objects: | ||||||||||
- `deleted.csv` - Containing all marked objects in a CSV containing one `address` column. Example: | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a docs question, but why this file called deleted if it contains objects that are marked for deletion? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's aligned with the GC's output |
||||||||||
```csv | ||||||||||
address | ||||||||||
"data/gcnobu7n2efc74lfa5ug/csfnri7n2efc74lfa69g,_e7P9j-1ahTXtofw7tWwJUIhTfL0rEs_dvBrClzc_QE" | ||||||||||
"data/gcnobu7n2efc74lfa5ug/csfnri7n2efc74lfa78g,mKZnS-5YbLzmK0pKsGGimdxxBlt8QZzCyw1QeQrFvFE" | ||||||||||
... | ||||||||||
``` | ||||||||||
- `summary.json` - A small json summarizing the GC run. Example: | ||||||||||
```json | ||||||||||
{ | ||||||||||
"run_id": "gcoca17haabs73f2gtq0", | ||||||||||
"success": true, | ||||||||||
"first_slice": "gcss5tpsrurs73cqi6e0", | ||||||||||
"start_time": "2024-10-27T13:19:26.890099059Z", | ||||||||||
"cutoff_time": "2024-10-27T07:19:26.890099059Z", | ||||||||||
"num_deleted_objects": 33000 | ||||||||||
} | ||||||||||
``` | ||||||||||
|
||||||||||
### Deleting marked objects | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||||
|
||||||||||
To delete the objects marked by the GC, you'll need to read the `deleted.csv` file, and manually delete each address from AWS. | ||||||||||
|
||||||||||
Example bash command to move all the marked objects to a different bucket on S3: | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would add a note about playing it safe and moving the objects instead of deleting them There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||||
```bash | ||||||||||
# Change these to your correct values | ||||||||||
storage_ns=<your storage namespace (s3://...)> | ||||||||||
output_bucket=<your output bucket (s3://...)> | ||||||||||
run_id=<GC run id> | ||||||||||
|
||||||||||
# Download the CSV file | ||||||||||
aws s3 cp "$storage_ns/_lakefs/retention/gc/reports/$run_id/deleted.csv" "./run_id-$run_id.csv" | ||||||||||
|
||||||||||
# Move all addresses to the output bucket under the run_id prefix | ||||||||||
cat run_id-$run_id.csv | tail -n +2 | head -n 10 | xargs -I {} aws s3 mv "$storage_ns/{}" "$output_bucket/run_id=$run_id/" | ||||||||||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit; add a whitespace after every markdown heading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done