Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Affiliations OpenAIRE OpenOrgs PIC #392

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ptamarit
Copy link
Member

@ptamarit ptamarit commented Aug 27, 2024

❤️ Thank you for your contribution!

Closes #393

Description

Recommended to be reviewed commit by commit.

This Pull Request configures a datastream which "augments" the existing affiliations vocabulary by adding the Participant Identification Code (PIC) based on data coming from OpenAIRE Graph Dataset (which is based on OpenAIRE's OpenOrgs data).

Updating existing affiliations can be achieved by running:

invenio vocabularies update -v affiliations:openaire -o full

Checklist

Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:

Frontend

Reminder

By using GitHub, you have already agreed to the GitHub’s Terms of Service including that:

  1. You license your contribution under the same terms as the current repository’s license.
  2. You agree that you have the right to license your contribution under the current repository’s license.

@ptamarit ptamarit force-pushed the affiliations-openaire-openorgs-pic branch 2 times, most recently from 9bc08bb to 2ea4434 Compare August 27, 2024 14:30
and isinstance(value, list)
):
for value_item in value:
# TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix this problem, we would need to not only check for equality like we do here, but we would need to check the scheme of each identifier to know if it's a new scheme or an existing scheme being updated. This would mean that the writer logic would be specific to a given vocabulary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think duplicates would be problematic? maybe better to just accumulate IDs 😅

"OpenAIREHTTPReader downloads one file and therefore does not iterate through items"
)

def read(self, item=None, *args, **kwargs):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: this reader still does not do a comparison of the publication date with the last successful run.
The only reader implementing such a logic so far is RORHTTPReader.

def get_vocabulary_config(vocabulary):
"""Factory function to get the appropriate Vocabulary Config."""
vocab_config = {
"names": NamesVocabularyConfig,
"funders": FundersVocabularyConfig,
"awards": AwardsVocabularyConfig,
"affiliations": AffiliationsVocabularyConfig,
"affiliations:openaire": AffiliationsOpenAIREVocabularyConfig,
Copy link
Member Author

@ptamarit ptamarit Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: here we are introducing the notion of different datasources for a given vocabulary type, using : as a kind of namespacing.

return super().write(stream_entry, *args, **kwargs)

def write_many(self, stream_entries, *args, **kwargs):
"""Writes the input entries using a given service."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_many in ServiceWriter already does not really handle the update flag logic. Now that I also reject entries in write, I am not sure how to reuse all this logic here.

@ptamarit ptamarit force-pushed the affiliations-openaire-openorgs-pic branch from 8db63c1 to 9d54a2a Compare August 29, 2024 07:48
Comment on lines +97 to +111
if self._insert:
try:
return StreamEntry(self._service.create(self._identity, entry))
except PIDAlreadyExists:
if not self._update:
raise WriterError([f"Vocabulary entry already exists: {entry}"])
return self._do_update(entry)
elif self._update:
try:
return self._do_update(entry)
except (NoResultFound, PIDDoesNotExistError):
raise WriterError([f"Vocabulary entry does not exist: {entry}"])
else:
raise WriterError(
["Writer wrongly configured to not insert and to not update"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: maybe we're confused but what if you want to insert and update? it seems like you can only do one or the other

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Try to simplify or add comments

@@ -0,0 +1,84 @@
# -*- coding: utf-8 -*-
#
# Copyright (C) 2024 CERN.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: maintain 2022-2024?

Comment on lines +47 to +51
def is_pic(val):
"""Test if argument is a Participant Identification Code (PIC)."""
if len(val) != 9:
return False
return val.isdigit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def is_pic(val):
"""Test if argument is a Participant Identification Code (PIC)."""
if len(val) != 9:
return False
return val.isdigit()
def is_pic(val):
"""Test if argument is a Participant Identification Code (PIC)."""
return len(val) == 9 and val.isdigit()


for pid in record["pid"]:
if pid["scheme"] == "ROR":
organization["id"] = pid["value"].removeprefix("https://ror.org/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: does this have to be id? would it not be better to be ror/rorid for clarity? or be pid to match the column in affiliation_metadata

Comment on lines +92 to +94
if not entry["openaire_id"].startswith("openorgs____::"):
raise WriterError([f"Not valid OpenAIRE OpenOrgs id for: {entry}"])
del entry["openaire_id"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why do we validate the id and then delete it? would it be better to have the check in the transformer instead?

Comment on lines +94 to +103
for key, value in entry.items():
if (
key in updated
and isinstance(updated[key], list)
and isinstance(value, list)
):
for value_item in value:
# TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries.
if value_item not in updated[key]:
updated[key].append(value_item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for key, value in entry.items():
if (
key in updated
and isinstance(updated[key], list)
and isinstance(value, list)
):
for value_item in value:
# TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries.
if value_item not in updated[key]:
updated[key].append(value_item)
for key, values in entry.items():
if (
key in updated
and isinstance(updated[key], list)
and isinstance(value, list)
):
for value in values:
# TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries.
if value not in updated[key]:
updated[key].append(value)

minor: maybe better to have values instead of value_items in value

and isinstance(value, list)
):
for value_item in value:
# TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think duplicates would be problematic? maybe better to just accumulate IDs 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Affiliations OpenAIRE OpenOrgs PIC #392
2 participants