Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design question — multiple storage backends or locations #255

Open
alexdutton opened this issue Aug 5, 2020 · 1 comment
Open

Design question — multiple storage backends or locations #255

alexdutton opened this issue Aug 5, 2020 · 1 comment

Comments

@alexdutton
Copy link
Member

As I understand it, the current design/implementation has the following:

  • a single swappable storage factory, configured by FILES_REST_STORAGE_FACTORY, the provided implementation being pyfs_storage_factory. As far as I can tell, this implementation isn't particularly/actually PyFS-specific.
  • invenio-s3 has its own storage factory, which implies that it's not currently possible to have multiple storage backends in use at the same time, without creating a new storage factory that hands off to the right storage class.
  • potentially many storage class implementations (with the default configured by FILES_REST_DEFAULT_STORAGE_CLASS)
  • the storage class can apparently be specified/overridden as Bucket.default_storage_class, FileInstance.storage_class (both a db.Column(db.String(1)), which implies it's never been used? see source)
  • according to the entity relationship diagram, a bucket has a default location, which is used to construct the FileInstance.uri. The location is not used after that, and so file URIs for filesystem-stored files are absolute paths on disk.
  • the storage class is passed into the factory by the models API user. e.g. invenio_files_rest.tasks:merge_multipartobject calls mp.merge_parts(), which has a **kwargs that gets passed all the way through to FileInstance.storage(), which defaults to storage_class being the one specfied by FILES_REST_DEFAULT_STORAGE_CLASS. In this scenario, the storage_class on the FileInstance gets ignored.

This seems a bit complicated, and doesn't yet necessarily meet its objectives of having a tidy interface and support for multiple storage backends.

What I would have expected:

  • A Location defines a storage class and a base path, e.g. (abstractly) (pyfs, /some/path), (s3, /bucket-name/some-path) or (s3bucket, /some/path)
  • A Bucket provides for a default location for new files, defaulting to the global default location
  • A FileInstance has a Location and a path. The location defaults to the bucket at creation time. The base path for the bucket and the path for the file instance are combined and used by the storage backend defined on the location
  • Storage backends are declared using entrypoints, with the entrypoint name being used as the key for the storage backend on the Location model. This key→class mapping can be overridden in config
  • FileInstance.storage() takes no **kwargs, and creates the storage class instance based on the above. Maybe it's a (cached?) property. Its behaviour is contained in a function that replaces the storage class factory and can be overridden, but generally an implementer would leave this alone unless they want to do something special.

This GitHub issue came about because I was trying to work out how one would support hooking up to multiple S3 buckets, and my assumption was that each would be a location for each bucket, but it wasn't immediately clear how to integrate this into the existing framework.

Could we improve on how this all works before we have a public release?

@lnielsen
Copy link
Member

lnielsen commented Aug 5, 2020

A couple of comments about some of the design choices:

  • storage_class was originally intended as a property to be interpreted by the storage factory, and used for e.g. in the same storage system having offline/online files or reduced redundancy. This way, an integrity checker would know that it had to issue a fetch to get the offline file first, before being able to check integrity. Similarly, you might optimize storage costs, but having fewer replicas. That said, it never really got into use and is now always the same. So probably it should either be removed, or we need a real use case.

  • FileInstance is storing the full URI to a file. It was a deliberate design choice to not tie it to a specific location, to ensure that files were not bound to a specific location. E.g. you could point a FileInstance URI into an old file hierarchy from another repository.

  • Location was by design only meant as a way seed the initial location of a file (because we wanted a FileInstance to point directly to the file).

  • PyFilesystem is a storage abstraction layer that's able to access multiple backends, which makes it a bit confusing also having Files-REST FileStorage class and a property on the db models called storage_class. Thus a bit of clarification:

    • storage_class is not actually pointing to e.g. a specific FileStorage class, but it was supposed to be interpreted by an implementation of FileStorage.
    • Files-REST FileStorage class defines the interface by which Invenio can get access to bits and bytes and defines the minimal operations (like open, send_file, checksum, copy).
    • PyFilesytem is used internally by PyFSFileStorage. This makes it easy to provide access to multiple backends, without having to implement a new Invenio FileStorage class for each backend. However, implementing a new PyFilesystem backend is a big task that also involves e.g. implementing listdir and other tons of other methods we don't need. Also, some PyFilesystem implementations are not very efficient, in that they store a file in temporary space first, before sending it to remote storage (this essentially kills performance and possible makes a server run out of memory for GB/TB sized files).

I think the issue you point to is that it's hard today in Invenio to use multiple Invenio FileStorage classes. You would have to implement your own custom storage factory, that inspects the URI. I'm not sure how an ideal solution looks like, that we can discuss further IRL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants