Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document src/zarrman/ and its API requirements #163

Merged
merged 2 commits into from
Jul 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -118,4 +118,24 @@ jobs:
- name: Check formatting
run: cargo fmt --check

docs:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v4

- name: Install nightly Rust
uses: dtolnay/rust-toolchain@master
with:
toolchain: nightly

- name: Activate cache
if: "!startsWith(github.head_ref, 'renovate/')"
uses: Swatinem/rust-cache@v2

- name: Check docs
run: cargo doc --no-deps --all-features --document-private-items
env:
RUSTDOCFLAGS: -Dwarnings

# vim:set et sts=2:
10 changes: 10 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,16 @@ unused_comparisons = "deny"
useless_ptr_null_checks = "deny"
while_true = "deny"

[lints.rustdoc]
bare_urls = "allow"
broken_intra_doc_links = "deny"
invalid_codeblock_attributes = "deny"
invalid_html_tags = "deny"
invalid_rust_codeblocks = "deny"
private_intra_doc_links = "deny"
redundant_explicit_links = "deny"
unescaped_backticks = "deny"

[lints.clippy]
# Deny all warn-by-default lints:
all = { level = "deny", priority = -1 }
Expand Down
115 changes: 115 additions & 0 deletions doc/zarrman.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
Requirements for the Zarr Manifest API
======================================

*This document is up-to-date as of 2024 July 13.*

The implementation of the `/zarrs/` hierarchy served by `dandidav` works by
fetching documents called *Zarr manifests* from a URL hierarchy, hereafter
referred to as the *manifest tree*, the base URL of which is hereafter referred
to as the *manifest root*. Currently, the manifest root is hardcoded to
<https://datasets.datalad.org/dandi/zarr-manifests/zarr-manifests-v2-sorted/>,
a subdirectory of a mirror of <https://github.com/dandi/zarr-manifests> (but
see [issue #161](https://github.com/dandi/dandidav/issues/161)). The manifest
tree and the Zarr manifests are expected to meet the following requirements:

- An HTTP `GET` request to any extant directory in the manifest tree (including
the manifest root) must return a JSON object containing the following keys:

- `"files"` — an array of strings listing the names (i.e., final path
components) of the files (if any) available immediately within the
directory

- `"directories"` — an array of strings listing the names (i.e., final path
components) of the subdirectories (if any) available immediately within
the directory

- Each Zarr manifest describes a single Zarr in a DANDI Archive instance at a
certain point in time. The URL for each Zarr manifest must be of the form
`{manifest_root}/{prefix1}/{prefix2}/{zarr_id}/{checksum}.json`, where:

- `{prefix1}` is the first three characters of `{zarr_id}`
- `{prefix2}` is the next three characters of `{zarr_id}`
- `{zarr_id}` is the Zarr ID of the Zarr on the Archive instance
- `{checksum}` is the [Zarr checksum][] of the Zarr's contents at point in
time that the manifest represents

- The manifest tree should not contain any files that are not Zarr manifests
nor any directories that are not a parent directory of a Zarr manifest.
`dandidav`'s behavior should it encounter any such "extra" resources is
currently an implementation detail and may change in the future.

- A Zarr manifest is a JSON object containing an `"entries"` key whose value is
a tree of objects mirroring the directory & entry structure of the Zarr.

- Each entry in the Zarr is represented as an array of the following four
elements, in order, describing the entry as of the point in time
represented by the Zarr manifest:

- The S3 version ID (as a string) of the then-current version of the S3
object in which the entry is stored in the Archive instance's S3
bucket

- The `LastModified` timestamp of the entry's S3 object as a string of
the form `"YYYY-MM-DDTHH:MM:SS±HH:MM"`

- The size in bytes of the entry as an integer

- The `ETag` of the entry's S3 object as a string with leading &
trailing double quotation marks (U+0022) removed (not counting the
double quotation marks used by the JSON serialization)

- Each directory in the Zarr is represented as an object in which each key
is the name of an entry or subdirectory inside the directory and the
corresponding value is either an entry array or a directory object.

- The `entries` object itself represents the top level directory of the
Zarr.

For example, a Zarr with the following structure:

```text
.
├── .zgroup
├── foo/
│   ├── .zgroup
│   ├── bar/
│   │   ├── .zarray
│   │   └── 0
│   └── baz/
│   ├── .zarray
│   └── 0
└── quux/
├── .zarray
└── 0
```

would have an `entries` field as follows (with elements of the entry arrays
omitted):

```json
{
".zgroup": [ ... ],
"foo": {
".zgroup": [ ... ],
"bar": {
".zarray": [ ... ],
"0": [ ... ]
},
"baz": {
".zarray": [ ... ],
"0": [ ... ]
}
},
"quux": {
".zarray": [ ... ],
"0": [ ... ]
}
}
```

- For a Zarr with Zarr ID `zarr_id` and an entry therein at path `entry_path`,
the download URL for the entry is expected to be
`{base_url}/{zarr_id}/{entry_path}`, where `base_url` is currently hardcoded
for all entries to <https://dandiarchive.s3.amazonaws.com/zarr/>.

[Zarr checksum]: https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md#zarr-entry-checksum-format
19 changes: 17 additions & 2 deletions src/zarrman/manifest.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,16 @@ use serde::Deserialize;
use std::collections::BTreeMap;
use time::OffsetDateTime;

/// A parsed Zarr manifest
#[derive(Clone, Debug, Deserialize, Eq, PartialEq)]
pub(super) struct Manifest {
/// A tree of the Zarr's entries
pub(super) entries: ManifestFolder,
}

impl Manifest {
/// Retrieve a reference to the folder or entry in the manifest at `path`,
/// if any
pub(super) fn get(&self, path: &PurePath) -> Option<EntryRef<'_>> {
let mut folder = &self.entries;
for (pos, p) in path.components().with_position() {
Expand All @@ -31,6 +35,8 @@ pub(super) enum EntryRef<'a> {
Entry(&'a ManifestEntry),
}

/// A representation of a folder within a Zarr manifest: a mapping from entry &
/// subdirectory names to the entries & subdirectories
pub(super) type ManifestFolder = BTreeMap<Component, FolderEntry>;

#[derive(Clone, Debug, Deserialize, Eq, PartialEq)]
Expand All @@ -40,14 +46,23 @@ pub(super) enum FolderEntry {
Entry(ManifestEntry),
}

/// Information on a Zarr entry in a manifest as of the point in time
/// represented by the manifest
#[derive(Clone, Debug, Deserialize, Eq, PartialEq)]
pub(super) struct ManifestEntry {
// Keep these fields in this order so that deserialization will work
// properly!
// IMPORTANT: Keep these fields in this order so that deserialization will
// work properly!
/// The S3 version ID of the entry's S3 object
pub(super) version_id: String,

/// The entry's S3 object's modification time
#[serde(with = "time::serde::rfc3339")]
pub(super) modified: OffsetDateTime,

/// The size of the entry in bytes
pub(super) size: i64,

/// The ETag of the entry's S3 object
pub(super) etag: String,
}

Expand Down
Loading