The storage service manages the storage of our digital collections, including:
- Uploading files to cloud storage providers like Amazon S3 and Azure Blob
- Verifying fixity information on our files (checksums, sizes, filenames)
- Reporting on the contents of our digital archive through machine-readable APIs and search tools
The storage service is designed to:
- Ensure the safe, long-term (i.e. decades) storage of our digital assets
- Provide a scalable mechanism for identifying, retrieving, and storing content
- To support bulk processing of content, e.g. for file format migrations or batch analysis
- Follow industry best-practices around file integrity and audit trails
- Enable us to meet NDSA Level 4 for both digitised and "born-digital" assets
This is the basic architecture:
Workflow systems (Goobi, Archivematica) create "bags", which are collections of files stored in the BagIt packaging format. They upload these bags to a temporary S3 bucket, and call the storage service APIs to ask it to store the bags permanently.
The storage service reads the bags, verifies their contents, and replicates the bags to our permanent storage (S3 buckets/Azure containers). It is the only thing which writes to our permanent storage; this ensures everything is stored and labelled consistently.
Delivery systems (e.g. DLCS) can then read objects back out of permanent storage, to provide access to users.
This GitBook space includes:
- How-to guides explaining how to do common operations, e.g. upload new files into the storage service
- Reference material explaining how the storage service is designed, and why we made those choices
- Notes for Wellcome developers who need to administer or debug our storage service deployment
All our storage service code is in https://github.com/wellcomecollection/storage-service
The READMEs in the repo have instructions for specific procedures, e.g. how to create new Docker images. This GitBook is meant to be a bit higher-level.
The unit of storage in the storage service is a bag. This is a collection of files packaged together with the BagIt packaging format, which are ingested and stored together.
An ingest is a record of some processing on a bag, such as creating a new bag or adding a new version of a bag.
Each bag is identified with a space (a broad category) an external identifier (a specific identifier) and a version. Read more about identifiers.
We have a Terraform configuration that spins up an instance of the storage service. You can use this to try the storage service in your own AWS account.
Once you have a running instance of the storage service, you can use it to store bags. These guides walk you through some basic operations:
- Ingest a bag into the storage service
- Look up an already-stored bag in the storage service
- Look up the versions of a bag in the storage service
You can read the API reference for more detailed information about how to use the storage service.
Once you're comfortable storing individual bags, you can read about more advanced topics:
- [Storing multiple versions of the same bag]
- [Sending a partial update to a bag]
- [Storing preservation and access copies in different storage classes]
- [Reporting on the contents of the storage service]
- [Getting callback notifications from the storage service]
- Getting notifications of newly stored bags
and some information about what to do when things go wrong:
- [Why ingests fail: understanding ingest errors]
- [Operational monitoring of the storage service]
- Manually marking ingests as failed
These topics explain how the storage service work, and why it's designed in the way it is:
- The semantics of bags, ingests and ingest types
- [Detailed architecture: what do the different services do?]
- How identifiers work in the storage service
- How files are laid out in the underlying storage
- [How bags are verified]
- [How bags are versioned]
- Compressed vs uncompressed bags, and the choice of tar.gz
We also have the storage service RFC, the original design document -- although this isn't actively updated, and some of the details have changed in the implementation.
These topics are useful for a developer looking to modify or extend the storage service.
- An API reference for the user-facing storage service APIs
- Key technologies
- [Adding support for another replica location (e.g. Google Cloud)]
- Inter-app messaging with SQS and SNS
- How requests are routed from the API to app containers
- [Locking around operations in S3 and Azure Blob]
Developer workflow: