Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datamodel v1 #49

Merged
merged 14 commits into from
Feb 11, 2020
Merged

datamodel v1 #49

merged 14 commits into from
Feb 11, 2020

Conversation

ppanero
Copy link
Member

@ppanero ppanero commented Jan 27, 2020

Contains fields according to discussed in the InvenioRDM Project meeting in January:

Notes/Questions:

  • All administrative fields are at the beginning preceded by underscore. Warning: There are some posts online about ES/Kibana not accepting fields prefixed with underscore. However, I did not find any official documentation except the one for Beats and it does allow them.

  • No enumeration has been set in the JSONSchema. The ones from Datacite are enforced/validated via Marshamllow schemas. However, they will be made configurable in the near future.

  • ES mappings contain only object. Since the current querying system does not support nested queries, even if the results are not optimal the user can still query the fields. Moreover, once nested queries are supported re-indexing with nested where necessary is not costly since the schema per se does not change.

  • ES fields might require sub-fields (e.g. text/keyword to implement better search).

  • Opted for fields.List() rather than many=True in marshmallow due to RFC: Remove Nested(many=True) to keep only the List(Nested()) form marshmallow-code/marshmallow#779.

  • Both attribute and data_key are needed. See attribute used instead of data_key during deserialization marshmallow-code/marshmallow#791

  • Creators type changed to name_type to avoid confusion with contributors type (role).

  • Should owner be required (e.g. at least one)?

  • Should the recid be repeated? i.e. once (id) as part of the record, not the metadata, and then again recid inside the metadata.

  • Titles: Shall we add type "MainTitle" to be able to identify which is the main one in an easier way?

  • Shall _created_by field be required in Marshmallow (i.e. having a default value even if not required in the JSONSchema). This could be done with a pre_loader as is the case for publication_date.

  • Dependencies:

    • Pinned Marshmallow

Missing:

Migrated here #54

Closes:

Closes #23
Closes #22
Closes #38 <-- I think it gets closed. Since it is slowly getting solved.
Closes #43 <-- I agree that a resource should have a resource_type. However @lnielsen mentioned that for workflow reasons (e.g. create a record when the user starts submitting files to they can be uploaded while the user fills the form) the required fields are minimal. However, we can put a default value. WDYT?

@ppanero ppanero marked this pull request as ready for review February 3, 2020 13:11
@ppanero ppanero changed the title [WIP] datamodel v1 datamodel v1 Feb 3, 2020
Copy link
Contributor

@fenekku fenekku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all: thanks for writing up this big PR - it's really good to have it all laid out and that was a lot of work.

I only went through 2 files -(jsonschema + mapping) so far (took most of my day). I will finish reviewing the other files tomorrow with a fresh mind, because this is important. However I wanted to get the accumulated comments out right away to get the process moving.

Here are my thoughts on some of the questions / notes you had:

Both attribute and data_key are needed

I haven't looked at the marshmallow code yet, but marshmallow 2 doesn't know about data_key...

Creators type changed to name_type to avoid confusion with contributors type (role).

Contributors type is a name type just like creators's type. role is a different field. So I would keep it as it is but add a "description" to it to make it clear what type means.

Should the recid be repeated? i.e. once (id) as part of the record, not the metadata, and then again recid inside the metadata.

How I see it: if a field is both metadata and administrative in some sense, it should be in the metadata only. This way there is no repetition / unsyncing and no hand-wringing about where to place fields.

Titles: Shall we add type "MainTitle" to be able to identify which is the main one in an easier way?

First title should be the main one. We can add 'MainTitle' to it, but shouldn't have to is my thinking there.

Shall _created_by field be required in Marshmallow (i.e. having a default value even if not required in the JSONSchema)

Yes, I think it should be required... but I also think it should be required in the JSONSchema :P

Closes 43:

I think resource_type should be required and the case of file upload without metadata should be the exception
-> this should apply to the other fields anyway. That goes to a discussion point in the review below about the required fields in the jsonschema.

Thanks again, I will finish up looking at things tomorrow.

invenio_rdm_records/jsonschemas/records/record-v1.0.0.json Outdated Show resolved Hide resolved
invenio_rdm_records/jsonschemas/records/record-v1.0.0.json Outdated Show resolved Hide resolved
invenio_rdm_records/jsonschemas/records/record-v1.0.0.json Outdated Show resolved Hide resolved
invenio_rdm_records/jsonschemas/records/record-v1.0.0.json Outdated Show resolved Hide resolved
invenio_rdm_records/jsonschemas/records/record-v1.0.0.json Outdated Show resolved Hide resolved
invenio_rdm_records/mappings/v6/records/record-v1.0.0.json Outdated Show resolved Hide resolved
invenio_rdm_records/mappings/v6/records/record-v1.0.0.json Outdated Show resolved Hide resolved
invenio_rdm_records/mappings/v6/records/record-v1.0.0.json Outdated Show resolved Hide resolved
@ppanero
Copy link
Member Author

ppanero commented Feb 4, 2020

@fenekku Thanks a lot for the PR, I know its humongous :`)

I am answering most of your comments in this one. I answered some specif ones above. However, I will wait for others to review before fixing the rest (the ones not answered, I mostly agree and will implement the change) just waiting in order no to double work, just in case.

I haven't looked at the marshmallow code yet, but marshmallow 2 doesn't know about data_key...

Since Marshmallow 2.x was for Python 2.x and it is not supported anymore, InvenioRDM should come with Marshmallow 3.x. Even though Invenio will support Marshmallow 2.x for a transition period we do not have to do so :hands_up:.

Contributors type is a name type just like creators's type. role is a different field. So I would keep it as it is but add a "description" to it to make it clear what type means.

Noted, will change it.

How I see it: if a field is both metadata and administrative in some sense, it should be in the metadata only. This way there is no repetition / unsyncing and no hand-wringing about where to place fields.

Agree also, that's why I was "confused" lets see what others think about it.

First title should be the main one. We can add 'MainTitle' to it, but shouldn't have to is my thinking there.

I see, but if we do not add it. What type do users select for it? Depends on how we do deposit UI tho, maybe is a different part in the end.

Yes, I think it should be required... but I also think it should be required in the JSONSchema :P

Agree

Thanks once again!

@ppanero
Copy link
Member Author

ppanero commented Feb 4, 2020

After IRL dicsussion we have agreed (@fenekku , @zzacharo and me):

1- Keep the JsonSchema with required fields for realistic (currently in mind) use cases. Therefore, requireing resource_type etc. and approach the different type of validation (maybe at API level as in ILS) later on when we start with deposit.

2- In a similar fashion, not to change the access related fields until a consensus is reached. Mainly to have something that is working.

Copy link
Contributor

@fenekku fenekku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished full review. Really good to have something concrete to discuss. Thanks!

invenio_rdm_records/marshmallow/json.py Show resolved Hide resolved
invenio_rdm_records/marshmallow/json.py Outdated Show resolved Hide resolved
invenio_rdm_records/marshmallow/json.py Show resolved Hide resolved
invenio_rdm_records/marshmallow/json.py Outdated Show resolved Hide resolved
invenio_rdm_records/marshmallow/json.py Outdated Show resolved Hide resolved
tests/unit/test_schemas_json_load.py Show resolved Hide resolved
tests/unit/test_schemas_json_load.py Outdated Show resolved Hide resolved
tests/unit/test_schemas_json_load.py Outdated Show resolved Hide resolved
tests/unit/test_schemas_json_load.py Outdated Show resolved Hide resolved
tests/unit/test_schemas_json_load.py Outdated Show resolved Hide resolved
Pablo Panero added 2 commits February 10, 2020 14:04
Re-order administrative fields alphabetically
Re-order metadata fields into datacite's order
Move back to initial access fields (until we have clear which ones to use)
Remove globally required fields from JSONschema. Using only Marshmallow.
Copy link
Contributor

@fenekku fenekku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with caveats for these outstanding points:

  • required fields for the subjects jsonschema :what is the status on this? (obviously I am rooting for my take :) )

  • removal of required=True on the Method field (doesn't do anything)

  • _embargo_date... stuff. I understand now. I would remove the comment about extra non-administrative fields (because they are -- anything that is not datacite core is administrative I think)

  • a test for Creator that is of type "Organizational" : we should be clear about how an organization would be filled

  • test for minimum record : given what you said we need to amend the test for minimal record, because a minimal record is bigger than what we have right now.

@ppanero
Copy link
Member Author

ppanero commented Feb 10, 2020

@fenekku Thanks!

I think I answered all your questions in the corresponding comment. Except subject which I am leaving open for now.

I think I have all fields that are minimal according to ours/datacite model, but please let me know if I forgot anything.

As for:

a test for Creator that is of type "Organizational" : we should be clear about how an organization would be filled

I shelved to do further validation later on: #52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants