rsync-sjtug is an open-source project designed to provide an efficient method of mirroring remote repositories to s3 storage, with atomic updates and periodic garbage collection.
This project implements the rsync wire protocol, and is compatible with the rsync protocol version 27. All rsyncd versions older than 2.6.0 are supported.
rsync-sjtug
is currently powering the sjtug mirror.
- Atomic repository update: users never see a partially updated repository.
- Periodic garbage collection: old versions of files can be removed from the storage.
- Delta transfer: only the changed parts of files are transferred.
- rsync-fetcher - fetches the repository from the remote server, and uploads it to s3.
- rsync-gateway - serves the mirrored repository from s3 in http protocol.
- rsync-gc - periodically removes old versions of files from s3.
- rsync-migration - see Migration section for more details.
- Sync rsync repository to S3.
$ RUST_LOG=info RUST_BACKTRACE=1 AWS_ACCESS_KEY_ID=<ID> AWS_SECRET_ACCESS_KEY=<KEY> \ rsync-fetcher \ --src rsync://upstream/path \ --s3-url https://s3_api_endpoint --s3-region region --s3-bucket bucket --s3-prefix prefix \ --pg-url postgres://user@localhost/db \ --namespace repo_name
- Serve the repository over HTTP.
$ cat > config.toml <<-EOF bind = ["localhost:8081"] s3_url = "https://s3_api_endpoint" s3_region = "region" [endpoints."out"] namespace = "repo_name" s3_bucket = "bucket" s3_prefix = "prefix" EOF $ RUST_LOG=info RUST_BACKTRACE=1 rsync-gateway <optional config file>
- GC old versions of files manually.
$ RUST_LOG=info RUST_BACKTRACE=1 AWS_ACCESS_KEY_ID=<ID> AWS_SECRET_ACCESS_KEY=<KEY> \ rsync-gc \ --s3-url https://s3_api_endpoint --s3-region region --s3-bucket bucket --s3-prefix repo_name \ --pg-url postgres://user@localhost/db
It's recommended to keep at least 2 revisions in case a gateway is still using an old revision.
File data and their metadata are stored separately.
Files are stored in S3 storage, named by their blake2b-160 hash (<prefix>/<namespace>/<hash>
).
Metadata is stored in Postgres.
An object is the smallest unit of metadata. There are three types of objects:
- File - a regular file, with its hash, size and mtime
- Directory - a directory, and its size and mtime
- Symlink - a symlink, with its size, mtime and target
Objects (files, directories and symlinks) are organized into revisions, which are immutable. Each revision has a unique
id, while an object may appear in multiple revisions. Revisions are further organized into repositories (namespaces),
like debian
, ubuntu
, etc. Repositories are mutable.
A revision can be in one of the following states:
- Live - a live revision is a revision in production, which is ready to be served. There can be multiple live revisions, but only the latest one is served by the gateway.
- Partial - a partial revision is a revision that is still being updated. It's not ready to be served yet.
- Stale - a stale revision is a revision that is no longer in production, and is ready to be garbage collected.
v0.4.x switched from Redis to Postgres for storing metadata, greatly improving the performance of many operations and reducing the storage usage.
Use rsync-migration redis-to-pg
to migrate old metadata to the new database. Note that you can only migrate from
v0.3.x to v0.4.x, and you can't migrate from v0.2.x to v0.4.x directly.
The old Redis database is not modified.
v0.3.x uses a new encoding for file metadata, which is incompatible with v0.2.x. Trying to use v0.3.x on old data will fail.
Use rsync-migration upgrade-encoding
to upgrade the encoding.
This is a destructive operation, so make sure you have a backup of the database before running it. It does nothing
without the --do
flag.
The new encoding is actually introduced in v0.2.12 by accident. rsync-gateway
between v0.2.12 and v0.3.0 can't parse
old metadata correctly and return garbage data. No data is lost though, so if you used any version between v0.2.12 and
v0.3.0, you can still use rsync-migration
to migrate to the new encoding.