Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/zenodo uploader #214

Merged
merged 41 commits into from
Dec 22, 2023
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
ef1d395
Started implementation with basic test to understand better the requi…
jsmatias Nov 21, 2023
525233e
Endpoint created
jsmatias Nov 22, 2023
242a35b
Tests were refactored and the logic for the zenodo uploader included.
jsmatias Nov 27, 2023
8b81edc
more tests added to zenodo uploader
jsmatias Nov 27, 2023
392f776
Clean up and minor rearrangement of the files.
jsmatias Nov 27, 2023
893bc3f
Some docstrings added
jsmatias Nov 27, 2023
ede20bf
Parameter as an option to publish included and more tests written.
jsmatias Nov 29, 2023
f9b3965
Merge branch 'develop' into feature/zenodo_uploader
jsmatias Dec 8, 2023
0459759
Resolving PR comments: Replaced engine and with DbSession on tests.
jsmatias Dec 8, 2023
f397043
Resolving PR comments: corrected and improved description of the zeno…
jsmatias Dec 8, 2023
bf46cdb
Included a validator for zenodo id and refactored part of the code fo…
jsmatias Dec 11, 2023
dbc48fd
Corrected statuds code 423 -> 409
jsmatias Dec 11, 2023
3055689
Created an abstract class for general functionalities of the uploaders.
jsmatias Dec 12, 2023
acec6f1
corrected code duplication
jsmatias Dec 12, 2023
06bcaf0
improved readability of the tests of the zenodo uploader
jsmatias Dec 13, 2023
0e7ef9e
replaced set_up function for pytest.fixtures
jsmatias Dec 13, 2023
e181286
improved error handling logic
jsmatias Dec 14, 2023
6347b10
improved metadata
jsmatias Dec 16, 2023
986afb9
made the method in the abstract
jsmatias Dec 16, 2023
6572200
included a license validator to upload content to zenodo
jsmatias Dec 17, 2023
b631288
validation of the license value was moved to the beginning of the upl…
jsmatias Dec 17, 2023
eada74c
included validation for the metadata contact name, required to publis…
jsmatias Dec 18, 2023
d8aae43
minor corretion on docstring
jsmatias Dec 18, 2023
042742f
corrected the url for browser access of the dataset on zenodo
jsmatias Dec 18, 2023
243066d
Merge branch 'develop' into feature/zenodo_uploader
jsmatias Dec 19, 2023
c0a47a4
added zenodo validator for platform id to concept.py
jsmatias Dec 20, 2023
c94faa0
renamed error_handling function
jsmatias Dec 20, 2023
3c49371
Minor changes
jsmatias Dec 20, 2023
f3e9c37
added a version requirement check
jsmatias Dec 20, 2023
2cd454e
improved documentation
jsmatias Dec 20, 2023
9896c7f
added the functionality of auto generating repo_id for hugging face
jsmatias Dec 20, 2023
ff81fca
created abstract method for the validation of the platform_resource_i…
jsmatias Dec 20, 2023
f98de88
Improved error tracing back for an unexpected zenodo response status …
jsmatias Dec 21, 2023
7518cc7
removed token and repo_id from class instance
jsmatias Dec 21, 2023
729ece7
minor changes on docstrings
jsmatias Dec 21, 2023
2d5816f
Added authorisation to the upload endpoints
jsmatias Dec 21, 2023
a33725b
renamed error_handlers -> error_handling
jsmatias Dec 21, 2023
c303f22
Merge branch 'develop' into feature/zenodo_uploader
jsmatias Dec 21, 2023
d837e21
improved documentation and renamed function
jsmatias Dec 21, 2023
4665fbe
correct regex and added tests for platform_resource_identifier
jsmatias Dec 22, 2023
c552cd6
extra tests for zenodo and hugging face uploaders
jsmatias Dec 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions src/database/model/concept/concept.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,13 @@
from database.model.platform.platform_names import PlatformName
from database.model.relationships import OneToOne
from database.model.serializers import CastDeserializer
from database.validators import huggingface_validators, openml_validators
from database.validators import huggingface_validators, openml_validators, zenodo_validators

IS_SQLITE = os.getenv("DB") == "SQLite"
CONSTRAINT_LOWERCASE = f"{'platform' if IS_SQLITE else 'BINARY(platform)'} = LOWER(platform)"


class AIoDConceptBase(SQLModel):

platform: str | None = Field(
max_length=SHORT,
default=None,
Expand Down Expand Up @@ -60,6 +59,10 @@ def platform_resource_identifier_valid(cls, platform_resource_identifier: str, v
openml_validators.throw_error_on_invalid_identifier(
platform_resource_identifier
)
case PlatformName.zenodo:
zenodo_validators.throw_error_on_invalid_identifier(
platform_resource_identifier
)
return platform_resource_identifier


Expand Down
23 changes: 16 additions & 7 deletions src/database/validators/zenodo_validators.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,28 @@
import re

MSG_PREFIX = "The platform_resource_identifier for Zenodo should be a valid repo_id. "
MSG_PREFIX = "The platform_resource_identifier for Zenodo should be "
"a valid repository identifier or a valid file identifier. "

repo_id_pattern = r"(?:\Azenodo.org:[1-9]\d*\Z)"
jsmatias marked this conversation as resolved.
Show resolved Hide resolved
jsmatias marked this conversation as resolved.
Show resolved Hide resolved
file_id_pattern = r"(?:\A[a-z\d]{8}-[a-z\d]{4}-[a-z\d]{4}-[a-z\d]{4}-[a-z\d]{12}\Z)"
pattern = re.compile("|".join([repo_id_pattern, file_id_pattern]))
jsmatias marked this conversation as resolved.
Show resolved Hide resolved


def throw_error_on_invalid_identifier(platform_resource_identifier: str):
jsmatias marked this conversation as resolved.
Show resolved Hide resolved
"""
Throw a ValueError on an invalid repository identifier.
Throw a ValueError on an invalid repository or distribution identifier.

Valid repo_id:
Valid repository identifier:
zenodo.org:<int>
Valid distribution identifier:

jsmatias marked this conversation as resolved.
Show resolved Hide resolved
"""
repo_id = platform_resource_identifier
pattern = re.compile(r"^zenodo.org:[1-9][0-9]*$")
if not pattern.match(repo_id):
jsmatias marked this conversation as resolved.
Show resolved Hide resolved
msg = "A repo_id has the following pattern: "
"the string 'zenodo.org:' followed by an integer."
"E.g., zenodo.org:100"
msg = "A repository identifier has the following pattern: "
"the string 'zenodo.org:' followed by an integer: e.g., zenodo.org:100. \n"
"A file identifier is a string composed by a group of 8 characters, "
"3 groups of 4 characters, and a group of 12 characters, where the characters "
"include letters and numbers and the groups are separated by a dash '-': "
"e.g, abcde123-abcd-0000-ab00-abcdef000000."
raise ValueError(MSG_PREFIX + msg)
1 change: 0 additions & 1 deletion src/error_handlers/__init__.py

This file was deleted.

1 change: 1 addition & 0 deletions src/error_handling/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from error_handling.error_handling import as_http_exception # noqa:F401
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from fastapi import HTTPException, status


def _wrap_as_http_exception(exception: Exception) -> HTTPException:
def as_http_exception(exception: Exception) -> HTTPException:
if isinstance(exception, HTTPException):
return exception
traceback.print_exc()
Expand Down
4 changes: 2 additions & 2 deletions src/routers/resource_ai_asset_router.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from fastapi.responses import Response

from database.model.ai_asset.ai_asset import AIAsset
from error_handlers import _wrap_as_http_exception
from error_handling import as_http_exception
from .resource_router import ResourceRouter


Expand Down Expand Up @@ -101,7 +101,7 @@ def get_resource_content(
return Response(content=content, headers=headers)

except Exception as exc:
raise _wrap_as_http_exception(exc)
raise as_http_exception(exc)

def get_resource_content_default(
identifier: Annotated[
Expand Down
12 changes: 6 additions & 6 deletions src/routers/resource_router.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
)
from database.model.serializers import deserialize_resource_relationships
from database.session import DbSession
from error_handlers import _wrap_as_http_exception
from error_handling import as_http_exception


class Pagination(BaseModel):
Expand Down Expand Up @@ -229,7 +229,7 @@ def get_resources(self, schema: str, pagination: Pagination, platform: str | Non
[convert_schema(resource) for resource in session.scalars(query).all()]
)
except Exception as e:
raise _wrap_as_http_exception(e)
raise as_http_exception(e)

def get_resource(self, identifier: str, schema: str, platform: str | None = None):
"""
Expand All @@ -244,7 +244,7 @@ def get_resource(self, identifier: str, schema: str, platform: str | None = None
return self.schema_converters[schema].convert(session, resource)
return self._wrap_with_headers(self.resource_class_read.from_orm(resource))
except Exception as e:
raise _wrap_as_http_exception(e)
raise as_http_exception(e)

def get_resources_func(self):
"""
Expand Down Expand Up @@ -297,7 +297,7 @@ def get_resource_count(
for platform, count in count_list
}
except Exception as e:
raise _wrap_as_http_exception(e)
raise as_http_exception(e)

return get_resource_count

Expand Down Expand Up @@ -396,7 +396,7 @@ def register_resource(
except Exception as e:
self._raise_clean_http_exception(e, session, resource_create)
except Exception as e:
raise _wrap_as_http_exception(e)
raise as_http_exception(e)

return register_resource

Expand Down Expand Up @@ -491,7 +491,7 @@ def delete_resource(
session.commit()
return self._wrap_with_headers(None)
except Exception as e:
raise _wrap_as_http_exception(e)
raise as_http_exception(e)

return delete_resource

Expand Down
6 changes: 3 additions & 3 deletions src/routers/search_router.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from database.model.platform.platform import Platform
from database.model.resource_read_and_create import resource_read
from database.session import DbSession
from error_handlers import _wrap_as_http_exception
from error_handling import as_http_exception
from .search_routers.elasticsearch import ElasticsearchSingleton

SORT = {"identifier": "asc"}
Expand Down Expand Up @@ -115,7 +115,7 @@ def search(
database_platforms = session.scalars(query).all()
platform_names = {p.name for p in database_platforms}
except Exception as e:
raise _wrap_as_http_exception(e)
raise as_http_exception(e)

if platforms and not set(platforms).issubset(platform_names):
raise HTTPException(
Expand Down Expand Up @@ -176,7 +176,7 @@ def _db_query(
)
return [read_class.from_orm(resource) for resource in resources]
except Exception as e:
raise _wrap_as_http_exception(e)
raise as_http_exception(e)

def _cast_resource(
self, read_class: Type[SQLModel], resource_dict: dict[str, Any]
Expand Down
22 changes: 20 additions & 2 deletions src/routers/uploader_routers/upload_router_huggingface.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from fastapi import APIRouter
from fastapi import APIRouter, Depends
from fastapi import File, Query, UploadFile

from authentication import User, get_current_user
from routers.uploader_router import UploaderRouter
from uploaders.hugging_face_uploader import HuggingfaceUploader

Expand All @@ -23,7 +24,24 @@ def huggingFaceUpload(
username: str = Query(
..., title="Huggingface username", description="The username of HuggingFace"
),
user: User = Depends(get_current_user),
) -> int:
return hugging_face_uploader.handle_upload(identifier, file, token, username)
"""
Use this endpoint to upload a file (content) to Hugging Face using
the AIoD metadata identifier of the dataset.

Before uploading a dataset content, its metadata must exist on AIoD metadata catalogue.

1. **Create Metadata**
- If the metadata doesn't exist on AIoD catalogue, you can create it sending a `POST`
request to `/datasets/{version}/`.
- Make sure to set `platform = "huggingface"` and
`platform_resource_identifier` with a string representing the repository name.

2. **Upload File**
- Use this `POST` endpoint to upload a file to Hugging Face using the AIoD
metadata identifier of the metadata dataset.
"""
return hugging_face_uploader.handle_upload(identifier, file, token, username, user=user)

return router
39 changes: 33 additions & 6 deletions src/routers/uploader_routers/upload_router_zenodo.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
from typing import Annotated

from fastapi import APIRouter
from fastapi import APIRouter, Depends
from fastapi import File, Query, UploadFile, Path

from authentication import User, get_current_user
from uploaders.zenodo_uploader import ZenodoUploader
from routers.uploader_router import UploaderRouter

Expand Down Expand Up @@ -31,13 +32,39 @@ def zenodo_upload(
),
] = False,
token: str = Query(title="Zenodo Token", description="The access token of Zenodo"),
user: User = Depends(get_current_user),
) -> int:
"""
Uploads a dataset to Zenodo using the AIoD metadata identifier.
If the metadata does not exist on Zenodo
(i.e., the platform_resource_identifier is None),
a new repository will be created on Zenodo.
Use this endpoint to upload a file (content) to Zenodo using
the AIoD metadata identifier of the dataset.

Before uploading a dataset content, its metadata must exist on AIoD metadata catalogue
and contain at least the following required fields:
`name`, `description`, `creator`, `version`, and `license`.

1. **Create Metadata**
- If the metadata doesn't exist on AIoD catalogue, you can create it sending a `POST`
request to `/datasets/{version}/`.
- If the metadata already exists on zenodo, set `platform = "zenodo"` and
`platform_resource_identifier = "zenodo.org:{id}`, where `{id}` is the identifier
of this dataset on zenodo.
If you don't set a value to these fields, a new repository will be create
on Zenodo when you upload the first file the external platform with the following step.

2. **Upload Files**
- Use this `POST` endpoint to upload a file to Zenodo using the AIoD metadata identifier
of the metadata dataset.
- Zenodo accepts multiple files for each dataset. Thus, repeat this step for each file.

3. **Publish Dataset**
- To make the dataset and all its content public to the AI community on Zenodo, perform
a new `POST` request setting `publish` to `True` only when posting the last file.

**Note:**
- Zenodo supports multiple files within the same dataset.
- You can replace an existing file on Zenodo by uploading another one with same name.

"""
return zenodo_uploader.handle_upload(identifier, publish, token, file)
return zenodo_uploader.handle_upload(identifier, file, token, publish, user=user)

return router
85 changes: 76 additions & 9 deletions src/tests/uploader/huggingface/test_dataset_uploader.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from database.model.dataset.dataset import Dataset
from database.session import DbSession
from tests.testutils.paths import path_test_resources
from uploaders.hugging_face_uploader import _throw_error_on_invalid_repo_id
from uploaders.hugging_face_uploader import HuggingfaceUploader


def test_happy_path_new_repository(
Expand Down Expand Up @@ -44,9 +44,76 @@ def test_happy_path_new_repository(
files=files,
)

assert response.status_code == 200, response.json()
id_response = response.json()
assert id_response == 1
assert response.status_code == 200, response.json()
id_response = response.json()
assert id_response == 1


def test_happy_path_generating_repo_id(
client: TestClient, mocked_privileged_token: Mock, dataset: Dataset
):
dataset = copy.deepcopy(dataset)
dataset.platform = None
dataset.platform_resource_identifier = None
dataset.name = "Repo Test Name 1"

keycloak_openid.introspect = mocked_privileged_token
with DbSession() as session:
session.add(dataset)
session.commit()

with open(path_test_resources() / "uploaders" / "huggingface" / "example.csv", "rb") as f:
files = {"file": f.read()}

with responses.RequestsMock() as mocked_requests:
mocked_requests.add(
responses.POST,
"https://huggingface.co/api/repos/create",
json={"url": "url"},
status=200,
)
huggingface_hub.upload_file = Mock(return_value=None)
response = client.post(
"/upload/datasets/1/huggingface",
params={"username": "Fake-username", "token": "Fake-token"},
headers={"Authorization": "Fake token"},
files=files,
)

assert response.status_code == 200, response.json()
id_response = response.json()
assert id_response == 1


def test_failed_generating_repo_id(
client: TestClient, mocked_privileged_token: Mock, dataset: Dataset
):
dataset = copy.deepcopy(dataset)
dataset.platform = None
dataset.platform_resource_identifier = None
dataset.name = "Repo inv@lid name"

keycloak_openid.introspect = mocked_privileged_token
with DbSession() as session:
session.add(dataset)
session.commit()

with open(path_test_resources() / "uploaders" / "huggingface" / "example.csv", "rb") as f:
files = {"file": f.read()}

huggingface_hub.upload_file = Mock(return_value=None)
response = client.post(
"/upload/datasets/1/huggingface",
params={"username": "Fake-username", "token": "Fake-token"},
headers={"Authorization": "Fake token"},
files=files,
)

assert response.status_code == 400, response.json()
error_msg = response.json()["detail"]
assert (
"We derived an invalid HuggingFace identifier: Fake-username/Repo_inv@lid_name" in error_msg
)


def test_repo_already_exists(client: TestClient, mocked_privileged_token: Mock, dataset: Dataset):
Expand Down Expand Up @@ -80,9 +147,9 @@ def test_repo_already_exists(client: TestClient, mocked_privileged_token: Mock,
headers={"Authorization": "Fake token"},
files=files,
)
assert response.status_code == 200, response.json()
id_response = response.json()
assert id_response == 1
assert response.status_code == 200, response.json()
id_response = response.json()
assert id_response == 1


def test_wrong_platform(client: TestClient, mocked_privileged_token: Mock, dataset: Dataset):
Expand Down Expand Up @@ -164,8 +231,8 @@ def test_wrong_platform(client: TestClient, mocked_privileged_token: Mock, datas
)
def test_repo_id(username: str, dataset_name: str, expected_error: ValueError | None):
if expected_error is None:
_throw_error_on_invalid_repo_id(username, dataset_name)
HuggingfaceUploader._platform_resource_id_validator(dataset_name, username)
else:
with pytest.raises(type(expected_error)) as exception_info:
_throw_error_on_invalid_repo_id(username, dataset_name)
HuggingfaceUploader._platform_resource_id_validator(dataset_name, username)
assert exception_info.value.args[0] == expected_error.args[0]
Loading