Skip to content

Latest commit

 

History

History
135 lines (84 loc) · 5.86 KB

README.md

File metadata and controls

135 lines (84 loc) · 5.86 KB

Storage service

The storage service manages the storage of our digital collections, including:

  • Uploading files to cloud storage providers like Amazon S3 and Azure Blob
  • Verifying fixity information on our files (checksums, sizes, filenames)
  • Reporting on the contents of our digital archive through machine-readable APIs and search tools

Requirements

The storage service is designed to:

  • Ensure the safe, long-term (i.e. decades) storage of our digital assets
  • Provide a scalable mechanism for identifying, retrieving, and storing content
  • To support bulk processing of content, e.g. for file format migrations or batch analysis
  • Follow industry best-practices around file integrity and audit trails
  • Enable us to meet NDSA Level 4 for both digitised and "born-digital" assets

High-level design

This is the basic architecture:

Workflow systems (Goobi, Archivematica) create "bags", which are collections of files stored in the BagIt packaging format. They upload these bags to a temporary S3 bucket, and call the storage service APIs to ask it to store the bags permanently.

The storage service reads the bags, verifies their contents, and replicates the bags to our permanent storage (S3 buckets/Azure containers). It is the only thing which writes to our permanent storage; this ensures everything is stored and labelled consistently.

Delivery systems (e.g. DLCS) can then read objects back out of permanent storage, to provide access to users.

Documentation

This GitBook space includes:

  • How-to guides explaining how to do common operations, e.g. upload new files into the storage service
  • Reference material explaining how the storage service is designed, and why we made those choices
  • Notes for Wellcome developers who need to administer or debug our storage service deployment

Repo

All our storage service code is in https://github.com/wellcomecollection/storage-service

The READMEs in the repo have instructions for specific procedures, e.g. how to create new Docker images. This GitBook is meant to be a bit higher-level.


The unit of storage in the storage service is a bag. This is a collection of files packaged together with the BagIt packaging format, which are ingested and stored together.

An ingest is a record of some processing on a bag, such as creating a new bag or adding a new version of a bag.

Each bag is identified with a space (a broad category) an external identifier (a specific identifier) and a version. Read more about identifiers.

Getting started: use Terraform and AWS to run the storage service

We have a Terraform configuration that spins up an instance of the storage service. You can use this to try the storage service in your own AWS account.

How-to

Once you have a running instance of the storage service, you can use it to store bags. These guides walk you through some basic operations:

You can read the API reference for more detailed information about how to use the storage service.

Once you're comfortable storing individual bags, you can read about more advanced topics:

  • [Storing multiple versions of the same bag]
  • [Sending a partial update to a bag]
  • [Storing preservation and access copies in different storage classes]
  • [Reporting on the contents of the storage service]
  • [Getting callback notifications from the storage service]
  • Getting notifications of newly stored bags

and some information about what to do when things go wrong:

Reference

These topics explain how the storage service work, and why it's designed in the way it is:

We also have the storage service RFC, the original design document -- although this isn't actively updated, and some of the details have changed in the implementation.

Developer information

These topics are useful for a developer looking to modify or extend the storage service.

Developer workflow: