You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the storage service supports replicating to two storage providers:
Amazon S3
Azure Blob
We chose these because they're the two providers used by Wellcome, but it's designed to be extensible, e.g. to support Google Cloud Storage. What would it take?
These are some rough notes; not a comprehensive work list. It's meant to give a finger-in-the-air estimate.
Assumptions
Every new bag will be replicated to every storage provider. It would be additional work to support mixed locations.
Everything before the pre-replication verifier is happening in S3; it doesn't care about replication locations.
The pre-replication verifier sends a message "please replicate bag X", which gets fanned out to as many replicators as exist
You'd need a new replicator that can copy objects from S3 to the new storage provider
You'd need a new verifier that can read objects in the new storage provider
The replica aggregator doesn't care as much about exact storage providers; it works in terms of "primary" and "secondary" replicas (alternatively: warm/cold, active/backup). It expects to see exactly one primary replica and N secondary replicas, where N is configurable in the app config. It would need to be able to interpret a message form the bag verifier "I have verified a bag in location X in provider P", but that's it.
The bag register also doesn't care much about exact providers; it just needs to know how to serialise the provider location as JSON for the storage manifest
The bag tagger might need to care about the new location, depending on how you're using tags.
App code: implement a generic storage provider trait
To allow extensibility, the Scala code in the services is designed around generic traits. Off the top of my head, this is the rough set of operations it uses:
trait StorageLocation {
namespace: string
key: string
def join(location: StorageLocation, suffix: string) -> StorageLocation
}
trait StorageProvider {
def get(location) -> InputStream
def put(location, InputStream) -> ()
// note: this is used for verification; we write checksum tags to the object after it's been verified
// if the storage provider doesn't support tags natively, these can be written to a sidecar database.
def putTags(location, tags: Map[String, String]) -> ()
def listPrefix(prefix: StorageLocation) -> List[StorageLocation]
def copyFromS3(s3Location, location) -> ()
// e.g. timeouts are retryable, object not found isn't
// this is used by the storage service to retry when it's safe to do so, because often even big
// storage providers get flaky under heavy load
def isErrorRetryableOrTerminal(error) -> Boolean
}
We already have S3 and Azure implementations of these traits. Adding a third provider would require implementing this trait and plumbing it into the storage service apps that would work with these locations/provider, including:
the bag replicator (which replicates into the location)
the bag verifier (which verifies content written into the location)
the replica aggregator and bag register (which need to know they might get replicas in that location)
Note that some of this generic code isn't in the storage service repo, but in the storage library of the scala-libs repo.
Infra code: configuring the new replicator in Terraform
You'd need to modify the Terraform to plumb in the new replicator/verifier.
Ideally we'd do this in a way that minimised divergence from the existing Terraform modules, so it would be easy for users to stay in step with the core modules.
Time estimate
I think this would take months, not weeks.
It took months to add support for Azure replication.
Going from 2->3 should be faster than 1->2, because we're removed a lot of the hard-coded S3 calls, but it might expose other ways in which the code isn't as extensible as we'd like.
The text was updated successfully, but these errors were encountered:
Currently the storage service supports replicating to two storage providers:
We chose these because they're the two providers used by Wellcome, but it's designed to be extensible, e.g. to support Google Cloud Storage. What would it take?
These are some rough notes; not a comprehensive work list. It's meant to give a finger-in-the-air estimate.
Assumptions
Where it fits in the ingest process
This is where a new storage provider would sit in the ingest process:
Everything before the pre-replication verifier is happening in S3; it doesn't care about replication locations.
App code: implement a generic storage provider trait
To allow extensibility, the Scala code in the services is designed around generic traits. Off the top of my head, this is the rough set of operations it uses:
We already have S3 and Azure implementations of these traits. Adding a third provider would require implementing this trait and plumbing it into the storage service apps that would work with these locations/provider, including:
Note that some of this generic code isn't in the storage service repo, but in the storage library of the scala-libs repo.
Infra code: configuring the new replicator in Terraform
You'd need to modify the Terraform to plumb in the new replicator/verifier.
Ideally we'd do this in a way that minimised divergence from the existing Terraform modules, so it would be easy for users to stay in step with the core modules.
Time estimate
I think this would take months, not weeks.
The text was updated successfully, but these errors were encountered: