From 10fcb19358f30b1075951701e0c938ae67c2e190 Mon Sep 17 00:00:00 2001 From: yonipeleg33 <51454184+yonipeleg33@users.noreply.github.com> Date: Tue, 29 Oct 2024 18:33:00 +0200 Subject: [PATCH] Add documentation for standalone (sparkless) GC (#8307) * Sparkless GC - Add documentation * Add explanation about the output and specify concrete lab tests * review comments * add toc, dedicate a section for deletion * some review comments (WIP) * add warning on objects_min_age * add bash script to copy out deleted objects * add documentation for s3-compatible clients * document `aws.s3.addressing_path_style` config key, fix mounting example * formatting fix * more flexible time measurement (upper bound on the worst run i've seen) * update lab tests and add permissions * drop "your"s * recommend moving the objects instead of deleting them * limitations grammar fix * remove objects_min_age config key from docs * title change * fix csv example * clarify minimal permissions --- docs/_includes/toc_2-4.html | 8 + .../howto/garbage-collection/standalone-gc.md | 303 ++++++++++++++++++ 2 files changed, 311 insertions(+) create mode 100644 docs/_includes/toc_2-4.html create mode 100644 docs/howto/garbage-collection/standalone-gc.md diff --git a/docs/_includes/toc_2-4.html b/docs/_includes/toc_2-4.html new file mode 100644 index 00000000000..64c2f63c70c --- /dev/null +++ b/docs/_includes/toc_2-4.html @@ -0,0 +1,8 @@ +
+## Table of contents +{: .no_toc .text-delta } + +1. TOC +{:toc} +{::options toc_levels="2..4" /} +
diff --git a/docs/howto/garbage-collection/standalone-gc.md b/docs/howto/garbage-collection/standalone-gc.md new file mode 100644 index 00000000000..1b182005213 --- /dev/null +++ b/docs/howto/garbage-collection/standalone-gc.md @@ -0,0 +1,303 @@ +--- +title: Standalone Garbage Collection +description: Run a limited version of garbage collection without any external dependencies +parent: Garbage Collection +nav_order: 5 +grand_parent: How-To +redirect_from: + - /cloud/standalone-gc.html +--- + +# Standalone Garbage Collection +{: .d-inline-block } +lakeFS Enterprise +{: .label .label-green } + +{: .d-inline-block } +experimental +{: .label .label-red } + +{: .note } +> Standalone GC is only available for [lakeFS Enterprise]({% link enterprise/index.md %}). + +{: .note .warning } +> Standalone GC is experimental and offers limited capabilities compared to the [Spark-backed GC]({% link howto/garbage-collection/gc.md %}). Read through the [limitations](./standalone-gc.md#limitations) carefully before using it. + +{% include toc_2-4.html %} + +## About + +Standalone GC is a limited version of the Spark-backed GC that runs without any external dependencies, as a standalone docker image. + +## Limitations + +1. Except for the [Lab tests](./standalone-gc.md#lab-tests) performed, there are no further guarantees about the performance profile of the Standalone GC. +2. Horizontal scale is not supported - Only a single instance of `lakefs-sgc` can operate at a time on a given repository. +3. Standalone GC only marks objects and does not delete them - Equivalent to the GC's [mark only mode]({% link howto/garbage-collection/gc.md %}#mark-only-mode). \ + More about that in the [Get the List of Objects Marked for Deletion](./standalone-gc.md#get-the-list-of-objects-marked-for-deletion) section. + +### Lab tests + +Repository spec: + +- 100k objects +- 250 commits +- 100 branches + +Machine spec: +- 4GiB RAM +- 8 CPUs + +In this setup, we measured: + +- Time: < 5m +- Disk space: 123MB + +## Installation + +### Step 1: Obtain Dockerhub token +As an enterprise customer, you should already have a dockerhub token for the `externallakefs` user. +If not, contact us at [support@treeverse.io](mailto:support@treeverse.io). + +### Step 2: Login to Dockerhub with this token +```bash +docker login -u +``` + +### Step 3: Download the docker image +Download the image from the [lakefs-sgc](https://hub.docker.com/repository/docker/treeverse/lakefs-sgc/general) repository: +```bash +docker pull treeverse/lakefs-sgc: +``` + +## Usage + +### Permissions +To run `lakefs-sgc`, you'll need AWS and LakeFS users, with the following permissions: + +#### AWS +The minimal required permissions on AWS are: +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:PutObject", + "s3:GetObject" + ], + "Resource": [ + "arn:aws:s3:::some-bucket/some/prefix/*" + ] + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket" + ], + "Resource": [ + "arn:aws:s3:::some-bucket" + ] + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListAllMyBuckets" + ], + "Resource": [ + "arn:aws:s3:::*" + ] + } + ] +} +``` +In this permissions file, the example repository storage namespace is `s3://some-bucket/some/prefix`. + +#### LakeFS +The minimal required permissions on LakeFS are: +```json +{ + "statement": [ + { + "action": [ + "fs:ReadConfig", + "fs:ReadRepository", + "retention:PrepareGarbageCollectionCommits", + "retention:PrepareGarbageCollectionUncommitted", + "fs:ListObjects", + "fs:ReadConfig" + ], + "effect": "allow", + "resource": "arn:lakefs:fs:::repository/" + } + ] +} +``` +### AWS Credentials +Currently, `lakefs-sgc` does not provide an option to explicitly set AWS credentials. It relies on the hosting machine +to be set up correctly, and reads the AWS credentials from the machine. + +This means, you should set up your machine however AWS expects you to set it. \ +For example, by following their guide on [configuring the AWS CLI](https://docs.aws.amazon.com/cli/v1/userguide/cli-chap-configure.html). + +#### S3-compatible clients +Naturally, this method of configuration allows for `lakefs-sgc` to work with any S3-compatible client (such as [MinIO](https://min.io/)). \ +An example setup for working with MinIO: +1. Add a profile to your `~/.aws/config` file: + ``` + [profile minio] + region = us-east-1 + endpoint_url = + s3 = + signature_version = s3v4 + ``` + +2. Add an access and secret keys to your `~/.aws/credentials` file: + ``` + [minio] + aws_access_key_id = + aws_secret_access_key = + ``` +3. Run the `lakefs-sgc` docker image and pass it the `minio` profile - see [example](./standalone-gc.md#mounting-the-aws-directory) below. + +### Configuration +The following configuration keys are available: + +| Key | Description | Default value | Possible values | +|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|---------------------------------------------------------| +| `logging.format` | Logs output format | "text" | "text","json" | +| `logging.level` | Logs level | "info" | "error","warn",info","debug","trace" | +| `logging.output` | Where to output the logs to | "-" | "-" (stdout), "=" (stderr), or any string for file path | +| `cache_dir` | Directory to use for caching data during run | ~/.lakefs-sgc/data | string | +| `aws.max_page_size` | Max number of items per page when listing objects in AWS | 1000 | number | +| `aws.s3.addressing_path_style` | Whether or not to use [path-style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#path-style-access) when reading objects from AWS | true | boolean | +| `lakefs.endpoint_url` | The URL to the lakeFS installation - should end with `/api/v1` | NOT SET | URL | +| `lakefs.access_key_id` | Access key to the lakeFS installation | NOT SET | string | +| `lakefs.secret_access_key` | Secret access key to the lakeFS installation | NOT SET | string | + +These keys can be provided in the following ways: +1. Config file: Create a YAML file with the keys, each `.` is a new nesting level. \ + For example, `logging.level` will be: + ```yaml + logging: + level: # info,debug... + ``` + Then, pass it to the program using the `--config path/to/config.yaml` argument. +2. Environment variables: by setting `LAKEFS_SGC_`, with uppercase letters and `.`s converted to `_`s. \ + For example `logging.level` will be: + ```bash + export LAKEFS_SGC_LOGGING_LEVEL=info + ``` + +Example (minimalistic) config file: +```yaml +logging: + level: debug +lakefs: + endpoint_url: https://your.url/api/v1 + access_key_id: + secret_access_key: +``` + +### Command line reference + +#### Flags: +- `-c, --config`: config file to use (default is $HOME/.lakefs-sgc.yaml) + +#### Commands: +**run** + +Usage: \ +`lakefs-sgc run ` + +Flags: +- `--cache-dir`: directory to cache read files and metadataDir (default is $HOME/.lakefs-sgc/data/) +- `--parallelism`: number of parallel downloads for metadataDir (default 10) +- `--presign`: use pre-signed URLs when downloading/uploading data (recommended) (default true) + +### How to Run Standalone GC + +#### Directly passing in credentials parsed from `~/.aws/credentials` + +```bash +docker run \ +-e AWS_REGION= \ +-e AWS_SESSION_TOKEN="$(grep 'aws_session_token' ~/.aws/credentials | awk -F' = ' '{print $2}')" \ +-e AWS_ACCESS_KEY_ID="$(grep 'aws_access_key_id' ~/.aws/credentials | awk -F' = ' '{print $2}')" \ +-e AWS_SECRET_ACCESS_KEY="$(grep 'aws_secret_access_key' ~/.aws/credentials | awk -F' = ' '{print $2}')" \ +-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL= \ +-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID= \ +-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY= \ +-e LAKEFS_SGC_LOGGING_LEVEL=debug \ +treeverse/lakefs-sgc: run +``` + +#### Mounting the `~/.aws` directory + +When working with S3-compatible clients, it's often more convenient to mount the ~/.aws` file and pass in the desired profile. + +First, change the permissions for `~/.aws/*` to allow the docker container to read this directory: +```bash +chmod 644 ~/.aws/* +``` + +Then, run the docker image and mount `~/.aws` to the `lakefs-sgc` home directory on the docker container: +```bash +docker run \ +--network=host \ +-v ~/.aws:/home/lakefs-sgc/.aws \ +-e AWS_REGION=us-east-1 \ +-e AWS_PROFILE= \ +-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL= \ +-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID= \ +-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY= \ +-e LAKEFS_SGC_LOGGING_LEVEL=debug \ +treeverse/lakefs-sgc: run +``` +### Get the List of Objects Marked for Deletion +`lakefs-sgc` will write its reports to `/_lakefs/retention/gc/reports//`. \ +_RUN_ID_ is generated during runtime by the Standalone GC. You can find it in the logs: +``` +"Marking objects for deletion" ... run_id=gcoca17haabs73f2gtq0 +``` + +In this prefix, you'll find 2 objects: +- `deleted.csv` - Containing all marked objects in a CSV containing one `address` column. Example: + ``` + address + "data/gcnobu7n2efc74lfa5ug/csfnri7n2efc74lfa69g,_e7P9j-1ahTXtofw7tWwJUIhTfL0rEs_dvBrClzc_QE" + "data/gcnobu7n2efc74lfa5ug/csfnri7n2efc74lfa78g,mKZnS-5YbLzmK0pKsGGimdxxBlt8QZzCyw1QeQrFvFE" + ... + ``` +- `summary.json` - A small json summarizing the GC run. Example: + ```json + { + "run_id": "gcoca17haabs73f2gtq0", + "success": true, + "first_slice": "gcss5tpsrurs73cqi6e0", + "start_time": "2024-10-27T13:19:26.890099059Z", + "cutoff_time": "2024-10-27T07:19:26.890099059Z", + "num_deleted_objects": 33000 + } + ``` + +### Delete marked objects + +To delete the objects marked by the GC, you'll need to read the `deleted.csv` file, and manually delete each address from AWS. + +It is recommended to move all the marked objects to a different bucket instead of deleting them directly. + +Here's an example bash script to perform this operation: +```bash +# Change these to your correct values +storage_ns= +output_bucket= +run_id= + +# Download the CSV file +aws s3 cp "$storage_ns/_lakefs/retention/gc/reports/$run_id/deleted.csv" "./run_id-$run_id.csv" + +# Move all addresses to the output bucket under the run_id prefix +cat run_id-$run_id.csv | tail -n +2 | head -n 10 | xargs -I {} aws s3 mv "$storage_ns/{}" "$output_bucket/run_id=$run_id/" +```