Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File storage interface improvements #256

Draft
wants to merge 41 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
7dd691d
Entry points for storage backends
alexdutton Aug 19, 2020
0bf2176
Add a class for computing checksums, counting bytes and reporting pro…
alexdutton Aug 19, 2020
7096e97
Add a new extensible storage factory
alexdutton Aug 19, 2020
9715111
_FilesRESTState.storage_factory deprecates old storage factory, can i…
alexdutton Aug 19, 2020
7774b7d
Add some type annotations
alexdutton Aug 19, 2020
bca7ee2
Deprecate passing **kwargs to the storage factory
alexdutton Aug 19, 2020
f5244b3
Use the passthrough checksummer
alexdutton Aug 19, 2020
c39871a
Add a metaclass for FileStorage to return the backend name
alexdutton Aug 19, 2020
37bd798
Simplify the pyfs storage backend implementation
alexdutton Aug 19, 2020
568cf75
fixup! Add a new extensible storage factory
alexdutton Aug 19, 2020
9bcd900
fixup! Add a new extensible storage factory
alexdutton Aug 20, 2020
6e94128
intermediate
alexdutton Sep 8, 2020
5899f79
Lots of things. Needs pulling apart before merging.
alexdutton Oct 6, 2020
2e1a82a
Remove debugging print statements
alexdutton Oct 6, 2020
282e618
Fix mypy errors
alexdutton Oct 6, 2020
c62cfa0
Remove StorageBackend metaclass
alexdutton Oct 6, 2020
88e5fa8
Fix typo
alexdutton Oct 6, 2020
705b95e
Reinstate _write_stream, and use context managers
alexdutton Oct 6, 2020
93bd542
Improve checksumming
alexdutton Oct 6, 2020
a8fbd15
Add the beginning of documentation for writing storage backends
alexdutton Oct 6, 2020
8a25500
Opening via the open() method should only be for reading
alexdutton Oct 13, 2020
7820f84
Add delete() to list of methods in docs that need defining
alexdutton Oct 13, 2020
d4e660e
Simplify by removing now-unused PassthroughChecksum
alexdutton Oct 13, 2020
a508e3e
No longer need to get class object to look up storage name
alexdutton Oct 13, 2020
429504f
docstrings for the storage factory
alexdutton Oct 13, 2020
0aeefc4
pydocstyle fixes
alexdutton Oct 13, 2020
8a55a0b
Remove unused method on base StorageBackend
alexdutton Oct 13, 2020
f03154a
list to tuple for __all__, to placate pydocstyle
alexdutton Oct 13, 2020
501f35c
docstring fixes for pydocstyle
alexdutton Oct 14, 2020
5ac96fe
Remove unused imports and unused Py3.7+ annotations future
alexdutton Oct 14, 2020
5546899
PEP-8 style fixes. Mostly line-length-related.
alexdutton Oct 14, 2020
6b33fd8
More PEP-8 fixes
alexdutton Oct 14, 2020
da397f8
The last PEP-8 warnings
alexdutton Oct 14, 2020
f56a7c3
Run isort, for CI happiness
alexdutton Oct 14, 2020
3b0868f
Remove more extraneous code
alexdutton Oct 14, 2020
985d64d
Add migration for backend_name
alexdutton Oct 14, 2020
55d059f
Coalesce on `storage_backend`, not `backend_name` for model attr
alexdutton Oct 14, 2020
0fede69
fixup! Add migration for backend_name
alexdutton Oct 14, 2020
dd1a557
More style fixes in the alembic migration
alexdutton Oct 15, 2020
142660d
global: copyright headers update
lnielsen Nov 20, 2020
4879f59
models: remove python 2 support
lnielsen Nov 20, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
MIT License

Copyright (C) 2015-2019 CERN.
Copyright (C) 2015-2020 CERN.
Copyright (C) 2020 Cottage Labs LLP.

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Invenio-Files-REST.
installation
configuration
usage
new-storage-backend


API Reference
Expand Down
98 changes: 98 additions & 0 deletions docs/new-storage-backend.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
..
This file is part of Invenio.
Copyright (C) 2020 Cottage Labs LLP

Invenio is free software; you can redistribute it and/or modify it
under the terms of the MIT License; see LICENSE file for more details.


Developing a new storage backend
================================

A storage backend should subclass ``invenio_files_rest.storage.StorageBackend`` and should minimally implement the
``open()``, ``delete()``, ``_initialize(size)``, ``get_save_stream()`` and ``update_stream(seek)`` methods. You should
register the backend with an entry point in your ``setup.py``:

.. code-block:: python

setup(
...,
entry_points={
...,
'invenio_files_rest.storage': [
'mybackend = mypackage.storage:MyStorageBackend',
],
...
},
...
)

Implementation
--------------

The base class handles reporting progress, file size limits and checksumming.

Here's an example implementation of a storage backend that stores files remotely over HTTP with no authentication.

.. code-block:: python

import contextlib
import httplib
import urllib
import urllib.parse

from invenio_files_rest.storage import StorageBackend


class StorageBackend(StorageBackend):
def open(self):
return contextlib.closing(
urllib.urlopen('GET', self.uri)
)

def _initialize(self, size=0):
# Allocate space for the file. You can use `self.uri` as a suggested location, or return
# a new location as e.g. `{"uri": the_new_uri}`.
urllib.urlopen('POST', self.uri, headers={'X-Expected-Size': str(size)})
return {}

@contextlib.contextmanager
def get_save_stream(self):
# This should be a context manager (i.e. something that can be used in a `with` statement)
# which closes the file when exiting the context manager and performs any clear-up if
# an error occurs.
parsed_uri = urllib.parse.urlparse(self.uri)
# Assume the URI is HTTP, not HTTPS, and doesn't have a port or querystring
conn = httplib.HTTPConnection(parsed_uri.netloc)

conn.putrequest('PUT', parsed_uri.path)
conn.endheaders()

# HTTPConnections have a `send` method, whereas a file-like object should have `write`
conn.write = conn.send

try:
yield conn.send
response = conn.getresponse()
if not 200 <= response.status < 300:
raise IOError("HTTP error")
finally:
conn.close()


Checksumming
------------

The base class performs checksumming by default, using the ``checksum_hash_name`` class or instance attribute as
the hashlib hashing function to use. If your underlying storage system provides checksumming functionality you can set
this to ``None`` and override ``checksum()``:

.. code-block:: python

class RemoteChecksumStorageBackend(StorageBackend):
checksum_hash_name = None

def checksum(self, chunk_size=None):
checksum = urllib.urlopen(self.uri + '?checksum=sha256').read()
return f'sha256:{checksum}'

41 changes: 41 additions & 0 deletions invenio_files_rest/alembic/0999e27defd5_add_backend_name.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#
# This file is part of Invenio.
# Copyright (C) 2020 Cottage Labs LLP
#
# Invenio is free software; you can redistribute it and/or modify it
# under the terms of the MIT License; see LICENSE file for more details.

"""Add backend_name."""

import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = '0999e27defd5'
down_revision = '8ae99b034410'
branch_labels = ()
depends_on = None


def upgrade():
"""Upgrade database."""
# ### commands auto generated by Alembic - please adjust! ###
op.add_column(
'files_files',
sa.Column('storage_backend',
sa.String(length=32),
nullable=True))
op.add_column(
'files_location',
sa.Column('storage_backend',
sa.String(length=32),
nullable=True))
# ### end Alembic commands ###


def downgrade():
"""Downgrade database."""
# ### commands auto generated by Alembic - please adjust! ###
op.drop_column('files_location', 'storage_backend')
op.drop_column('files_files', 'storage_backend')
# ### end Alembic commands ###
13 changes: 12 additions & 1 deletion invenio_files_rest/config.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
# -*- coding: utf-8 -*-
#
# This file is part of Invenio.
# Copyright (C) 2015-2019 CERN.
# Copyright (C) 2015-2020 CERN.
# Copyright (C) 2020 Cottage Labs LLP
#
# Invenio is free software; you can redistribute it and/or modify it
# under the terms of the MIT License; see LICENSE file for more details.

"""Invenio Files Rest module configuration file."""

import pkg_resources
from datetime import timedelta

from invenio_files_rest.helpers import create_file_streaming_redirect_response
Expand Down Expand Up @@ -57,6 +59,15 @@
FILES_REST_STORAGE_FACTORY = 'invenio_files_rest.storage.pyfs_storage_factory'
"""Import path of factory used to create a storage instance."""

FILES_REST_DEFAULT_STORAGE_BACKEND = 'pyfs'
"""The default storage backend name."""

FILES_REST_STORAGE_BACKENDS = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I see some potential troubles here.

How are users expected to provide storage backend?

  1. Only via specifying entry points? In this case, we should probably not make it config, but just a method on the Flask extension.
  2. Allow developers/instances to specify a dict of storage backends? in this case, it makes sense to have a config. However when FILES_REST_STORAGE_BACKENDS is overwritten, then below code will still execute causing all storage backends to be loaded, even if we don't want them.

Thus in both cases, I think it makes sense to move the entry point iteration to the Flask extension to avoid unnecessarily loading storage backends that we don't need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was supporting both so that there was an obvious default behaviour (entry points), but that they could override this if desired. The "overriding" use case was that they wanted to change the behaviour of a backend they'd already used, but which they don't maintain (e.g. one in invenio-files-rest, or invenio-s3). In this case defining a new entrypoint with the same name wouldn't work, so they'd have to define the name→backend mapping explicitly as config.

However, yes, it's not great to do the entrypoint dereferencing here regardless of whether it's overridden. If you're happy with the rationale above I can move this to the extension loading code.

I should also write documentation for this.

ep.name: ep.load()
for ep in pkg_resources.iter_entry_points('invenio_files_rest.storage')
}
"""A mapping from storage backend names to classes."""

FILES_REST_PERMISSION_FACTORY = \
'invenio_files_rest.permissions.permission_factory'
"""Permission factory to control the files access from the REST interface."""
Expand Down
17 changes: 15 additions & 2 deletions invenio_files_rest/ext.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# -*- coding: utf-8 -*-
#
# This file is part of Invenio.
# Copyright (C) 2015-2019 CERN.
# Copyright (C) 2015-2020 CERN.
# Copyright (C) 2020 Cottage Labs LLP.
#
# Invenio is free software; you can redistribute it and/or modify it
# under the terms of the MIT License; see LICENSE file for more details.
Expand All @@ -10,6 +11,7 @@

from __future__ import absolute_import, print_function

import warnings
from flask import abort
from werkzeug.exceptions import UnprocessableEntity
from werkzeug.utils import cached_property
Expand All @@ -30,9 +32,20 @@ def __init__(self, app):
@cached_property
def storage_factory(self):
"""Load default storage factory."""
return load_or_import_from_config(
if self.app.config['FILES_REST_STORAGE_FACTORY'] in [
'invenio_files_rest.storage.pyfs_storage_factory',
]:
warnings.warn(DeprecationWarning(
"The " + self.app.config['FILES_REST_STORAGE_FACTORY'] +
" storage factory has been deprecated in favour of"
" 'invenio_files_rest.storage:StorageFactory"
))
storage_factory = load_or_import_from_config(
'FILES_REST_STORAGE_FACTORY', app=self.app
)
if isinstance(storage_factory, type):
storage_factory = storage_factory(self.app)
return storage_factory

@cached_property
def permission_factory(self):
Expand Down
Loading