Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we delete bags from the storage service? #1004

Open
alexwlchan opened this issue Jun 10, 2022 · 0 comments
Open

How do we delete bags from the storage service? #1004

alexwlchan opened this issue Jun 10, 2022 · 0 comments

Comments

@alexwlchan
Copy link
Contributor

alexwlchan commented Jun 10, 2022

At some point, we'll need to delete content from the storage service. e.g. the Miro images in cold store; born-digital accessions that have been catalogued; material we need to de-accession.

The current process for deleting bags involves a script I wrote several years ago, which obliterates all knowledge of the bag from the universe. It's not a great solution, and we should try to build a better implementation.

These are my initial notes on the topic, based on my thinking and some chats with Robert before he left. When we implement this properly, we should write it up as a proper RFC.

Desirable properties

  • Deletion is strongly access-controlled – it should not be possible without prior approval, including by developers
  • There should be an audit control of all deletions, including the timestamp, the name of person who initiated the deletion, and why they did it
  • We should have a way to rollback if we discover a bag was deleted accidentally or maliciously

High-level proposal

Within the storage service, an ingest is a record of some processing on a bag. Currently we support two ingest types:

  • create a new copy of a bag
  • update an existing copy of a bag

Deleting a bag could be another type of ingest.

The rough approach

  1. We set S3 bucket policies on the two main storage buckets that:

    • Prevent anybody from deleting objects from the bucket
    • Prevent anybody from updating the bucket policy

    This means only the root user would be able to modify the bucket policy (and thus allow objects to be deleted). This is stronger than what we have today.

  2. If you want to delete an object from the bucket, you get the root user credentials and modify this policy to allow deleting the specific bags you want to delete.

    (I'm not entirely sure what this would look like, but I think such a hole-punched policy is possible. You might need a condition on the Deny statement, e.g. if prefix is not "to_be_deleted/v1" then Deny any calls to "s3:DeleteObject". We'd need to investigate this further.)

  3. Modify the Azure container to allow deletions.

    (I don't remember the specifics, just that deletions are currently prevented and need to be unlocked.)

  4. Then you make an API call to the storage service. This would be a DELETE to the /ingests endpoint, specifying the bag you want to delete.

    (We might put separate authentication on this endpoint. We might want to investigate tying it to Azure AD SSO, so we can trace delete requests to specific people. We will require you to specify the exact version of the bag you want to delete.)

  5. The storage service deletes the bag. This involves:

    • Replicating and verifying the bag to a working storage area which has a 90 day expiry policy. (This gives us easier rollbacks, but we have versioning on the main bucket. Do we need this? How would it interact with Glacier'd objects?)
    • Deleting the bag from all its storage locations (this needs to include our Azure tag cache)
    • Updating the bags API to mark this bag as deleted
    • Recording this deletion in an audit table somewhere
    • (Optional) Sending an email or Slack notification, so the deletion is broadly visible and we can see if an accidental/unauthorised deletion occurred that needs rolling back

Properties of this approach

  • We'd get stronger overall protection on the buckets than we have now – although our dev roles are prevented from deleting objects, it's possible for us to override them (albeit hard to do unintentionally)
  • It would be a documented, tested approach with an audit trail, rather than the manual process we have now
  • Versions are truly immutable – once bag V1 is written, it will always contain the same files, or be deleted. The current approach obliterates the version record also, so bag V1 could refer to different files at different times.

Questions

  • How do we prevent "races" between bag verification/deletion?

    e.g. I store a V1 bag. Then I store a V2 bag that refers back to the V1 bag (using fetch.txt) at the same time as deleting the V1 bag. The V2 bag may pass verification, but once the V1 bag is deleted it's broken. How do we prevent this?

    We probably start by saying "you can only delete the latest version of a bag". How do we stop a new version being written at the same time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant