-
Notifications
You must be signed in to change notification settings - Fork 0
File management in S3 buckets
(Historical, from original implementation)
All our derivatives are currently stored on S3 with public access control, even if the object is not marked public in the app. To access someone would need to guess a non-obvious URL, but this is not to be considered protection from all but the most casual interest.
Rather, we have just decided that none of our materials are particular sensitive/confidential, non-public materials are mostly just those that haven't been finalized yet. So we have not prioritized the extra app dev time to ensure derivatives are protected on S3. (The trick is mostly around making things with cacheable URLs, yet protected on S3). We may do so in the future as priorities and materials change.
The originals are secure in S3, and only available via a signed URL delivered by the app to an authorized user, because that was easier to do without caching or other performance concerns.
References to "env" or "config" variables are to keys controlled by ScihistDigicoll::Env, which allows setting via a ./config/local_env.yml
file (set on our AWS machines by ansible) as lowercase, or on a dev machine in ./config/local_env_development.yml
, or in shell ENV
as uppercase.
Different classes of objects we want in S3. Dividing into different categories depending on access control; lifecycle management; and/or CORS setting differences. I may be missing some, hopefully not, but others categories may come up as we develop!
- No public ACLs, S3 policies preventing
- local_env key: s3_bucket_originals
- staging bucket: scihi-kithe-stage-originals
- special backup/lifecycle treatment (documented (in text or code) where?)
- DO have public acls right now, we'd need to do more app work to keep them private
- local_env key: s3_bucket_derivatives
- staging bucket: scihi-kithe-stage-derivatives
- Can be re-generated at any time, loss would at most be temporary (although can take a while to regenerate)
- backup details: TBD
- No public ACLs, S3 policies preventing
- Uploaded from a browser directly to an S3 bucket. Correspond to the "uploaded files" location in sufia, but in new app this doesn't need to be mounted block storage, it can be all S3.
- Don't need to be backed up at all.
- local_env key: s3_bucket_uploads
- staging bucket: scihi-kithe-stage-uploads
- DO need life cycle rules to purge files older than X days. That app can't manage to clean up after itself in all cases no matter what, as files can be abandoned by a browser mid-process with no way for app to know next ingest step won't be taken.
- Does need special CORS settings
- no public ACL needed, S3 permissions limiting such recommended
- The one that gets an icon on Windows desktops, used for ingest process
- Currently shared with production sufia app, one windows-mounted ingest bucket
- local_env key: s3_bucket_ingest
- bucket_name: scih-uploads
- DO Need to have public S3 acls, as we still haven't figured out any good access controls that work with out setup
- Are otherwise derivatives where backups not important
- See the note about regeneration time in derivatives
- Don't need public acl
- Can be re-generated at any time, in fact our front-end code will automatically re-generate on demand
- We treat these as 'cache', don't actually want to hold onto them forever, so also need lifecycle rules to delete files that haven't been accessed in X days
- Currently on S3 just cause, why not, avoids the need for other persistent file system (that can be served by web)
- Does't need to be backed up at all, doesn't need any lifecycle management
- public ACL
- Since this is a single file, we can look at this on the file level rather than a bucket/prefix level if we want.
Any other categories/types of S3 files? Not that I'm thinking of now!
The app has mode for using S3 in development, where it puts all files in a shared S3 dev bucket. This is triggered by env storage_mode: dev_s3
, which is default in dev.
It will use the bucket in env s3_dev_bucket
(default scih-uploads-dev
), and put a given dev apps files in a segregated prefix by default created by the username and hostname where the app is running, but can be set with env s3_dev_prefix
.