Could we replicate to other storage providers? #1006

alexwlchan · 2022-06-10T07:35:08Z

Currently the storage service supports replicating to two storage providers:

Amazon S3
Azure Blob

We chose these because they're the two providers used by Wellcome, but it's designed to be extensible, e.g. to support Google Cloud Storage. What would it take?

These are some rough notes; not a comprehensive work list. It's meant to give a finger-in-the-air estimate.

Assumptions

Every new bag will be replicated to every storage provider. It would be additional work to support mixed locations.
We're only interested in replicating new bags; replicating existing bags is out-of-scope for this discussion. See How do you backfill bags from an existing storage provider to a new provider? #1007

Where it fits in the ingest process

This is where a new storage provider would sit in the ingest process:

graph LR
     PV[pre-replication<br/>verifier] --> R1[Replicator #1]
     A[...] --> PV
     PV --> R2[Replicator #2]
     PV --> R3[Replicator #3]
     R1 --> V1[Verifier #1]
     R2 --> V2[Verifier #2]
     R3 --> V3[Verifier #3]
     V1 --> RA[Replica aggregator]
     V2 --> RA
     V3 --> RA
     RA --> BR[Bag register]
     BR --> BT[Bag tagger]

Everything before the pre-replication verifier is happening in S3; it doesn't care about replication locations.

The pre-replication verifier sends a message "please replicate bag X", which gets fanned out to as many replicators as exist
You'd need a new replicator that can copy objects from S3 to the new storage provider
You'd need a new verifier that can read objects in the new storage provider
The replica aggregator doesn't care as much about exact storage providers; it works in terms of "primary" and "secondary" replicas (alternatively: warm/cold, active/backup). It expects to see exactly one primary replica and N secondary replicas, where N is configurable in the app config. It would need to be able to interpret a message form the bag verifier "I have verified a bag in location X in provider P", but that's it.
The bag register also doesn't care much about exact providers; it just needs to know how to serialise the provider location as JSON for the storage manifest
The bag tagger might need to care about the new location, depending on how you're using tags.

App code: implement a generic storage provider trait

To allow extensibility, the Scala code in the services is designed around generic traits. Off the top of my head, this is the rough set of operations it uses:

trait StorageLocation {
  namespace: string
  key: string

  def join(location: StorageLocation, suffix: string) -> StorageLocation
}

trait StorageProvider {
  def get(location) -> InputStream
  def put(location, InputStream) -> ()

  // note: this is used for verification; we write checksum tags to the object after it's been verified
  // if the storage provider doesn't support tags natively, these can be written to a sidecar database.
  def putTags(location, tags: Map[String, String]) -> ()

  def listPrefix(prefix: StorageLocation) -> List[StorageLocation]

  def copyFromS3(s3Location, location) -> ()

  // e.g. timeouts are retryable, object not found isn't
  // this is used by the storage service to retry when it's safe to do so, because often even big
  // storage providers get flaky under heavy load
  def isErrorRetryableOrTerminal(error) -> Boolean
}

We already have S3 and Azure implementations of these traits. Adding a third provider would require implementing this trait and plumbing it into the storage service apps that would work with these locations/provider, including:

the bag replicator (which replicates into the location)
the bag verifier (which verifies content written into the location)
the replica aggregator and bag register (which need to know they might get replicas in that location)

Note that some of this generic code isn't in the storage service repo, but in the storage library of the scala-libs repo.

Infra code: configuring the new replicator in Terraform

You'd need to modify the Terraform to plumb in the new replicator/verifier.

Ideally we'd do this in a way that minimised divergence from the existing Terraform modules, so it would be easy for users to stay in step with the core modules.

Time estimate

I think this would take months, not weeks.

It took months to add support for Azure replication.
Going from 2->3 should be faster than 1->2, because we're removed a lot of the hard-coded S3 calls, but it might expose other ways in which the code isn't as extensible as we'd like.

The text was updated successfully, but these errors were encountered:

alexwlchan added ✅ Bag verification 💬 Discussion ↪️ Bag replication and removed ✅ Bag verification labels Jun 10, 2022

alexwlchan mentioned this issue Jun 10, 2022

How do you backfill bags from an existing storage provider to a new provider? #1007

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could we replicate to other storage providers? #1006

Could we replicate to other storage providers? #1006

alexwlchan commented Jun 10, 2022 •

edited

Loading

Could we replicate to other storage providers? #1006

Could we replicate to other storage providers? #1006

Comments

alexwlchan commented Jun 10, 2022 • edited Loading

Assumptions

Where it fits in the ingest process

App code: implement a generic storage provider trait

Infra code: configuring the new replicator in Terraform

Time estimate

alexwlchan commented Jun 10, 2022 •

edited

Loading