-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Affiliations OpenAIRE OpenOrgs PIC #392
base: master
Are you sure you want to change the base?
Affiliations OpenAIRE OpenOrgs PIC #392
Conversation
9bc08bb
to
2ea4434
Compare
and isinstance(value, list) | ||
): | ||
for value_item in value: | ||
# TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix this problem, we would need to not only check for equality like we do here, but we would need to check the scheme of each identifier to know if it's a new scheme or an existing scheme being updated. This would mean that the writer logic would be specific to a given vocabulary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think duplicates would be problematic? maybe better to just accumulate IDs 😅
"OpenAIREHTTPReader downloads one file and therefore does not iterate through items" | ||
) | ||
|
||
def read(self, item=None, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remark: this reader still does not do a comparison of the publication date with the last successful run.
The only reader implementing such a logic so far is RORHTTPReader
.
def get_vocabulary_config(vocabulary): | ||
"""Factory function to get the appropriate Vocabulary Config.""" | ||
vocab_config = { | ||
"names": NamesVocabularyConfig, | ||
"funders": FundersVocabularyConfig, | ||
"awards": AwardsVocabularyConfig, | ||
"affiliations": AffiliationsVocabularyConfig, | ||
"affiliations:openaire": AffiliationsOpenAIREVocabularyConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remark: here we are introducing the notion of different datasources for a given vocabulary type, using :
as a kind of namespacing.
2ea4434
to
8db63c1
Compare
return super().write(stream_entry, *args, **kwargs) | ||
|
||
def write_many(self, stream_entries, *args, **kwargs): | ||
"""Writes the input entries using a given service.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_many
in ServiceWriter
already does not really handle the update
flag logic. Now that I also reject entries in write
, I am not sure how to reuse all this logic here.
8db63c1
to
9d54a2a
Compare
if self._insert: | ||
try: | ||
return StreamEntry(self._service.create(self._identity, entry)) | ||
except PIDAlreadyExists: | ||
if not self._update: | ||
raise WriterError([f"Vocabulary entry already exists: {entry}"]) | ||
return self._do_update(entry) | ||
elif self._update: | ||
try: | ||
return self._do_update(entry) | ||
except (NoResultFound, PIDDoesNotExistError): | ||
raise WriterError([f"Vocabulary entry does not exist: {entry}"]) | ||
else: | ||
raise WriterError( | ||
["Writer wrongly configured to not insert and to not update"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: maybe we're confused but what if you want to insert and update? it seems like you can only do one or the other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Try to simplify or add comments
@@ -0,0 +1,84 @@ | |||
# -*- coding: utf-8 -*- | |||
# | |||
# Copyright (C) 2024 CERN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: maintain 2022-2024?
def is_pic(val): | ||
"""Test if argument is a Participant Identification Code (PIC).""" | ||
if len(val) != 9: | ||
return False | ||
return val.isdigit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def is_pic(val): | |
"""Test if argument is a Participant Identification Code (PIC).""" | |
if len(val) != 9: | |
return False | |
return val.isdigit() | |
def is_pic(val): | |
"""Test if argument is a Participant Identification Code (PIC).""" | |
return len(val) == 9 and val.isdigit() |
|
||
for pid in record["pid"]: | ||
if pid["scheme"] == "ROR": | ||
organization["id"] = pid["value"].removeprefix("https://ror.org/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: does this have to be id
? would it not be better to be ror/rorid for clarity? or be pid to match the column in affiliation_metadata
if not entry["openaire_id"].startswith("openorgs____::"): | ||
raise WriterError([f"Not valid OpenAIRE OpenOrgs id for: {entry}"]) | ||
del entry["openaire_id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why do we validate the id and then delete it? would it be better to have the check in the transformer instead?
for key, value in entry.items(): | ||
if ( | ||
key in updated | ||
and isinstance(updated[key], list) | ||
and isinstance(value, list) | ||
): | ||
for value_item in value: | ||
# TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries. | ||
if value_item not in updated[key]: | ||
updated[key].append(value_item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for key, value in entry.items(): | |
if ( | |
key in updated | |
and isinstance(updated[key], list) | |
and isinstance(value, list) | |
): | |
for value_item in value: | |
# TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries. | |
if value_item not in updated[key]: | |
updated[key].append(value_item) | |
for key, values in entry.items(): | |
if ( | |
key in updated | |
and isinstance(updated[key], list) | |
and isinstance(value, list) | |
): | |
for value in values: | |
# TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries. | |
if value not in updated[key]: | |
updated[key].append(value) |
minor: maybe better to have values
instead of value_item
s in value
and isinstance(value, list) | ||
): | ||
for value_item in value: | ||
# TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think duplicates would be problematic? maybe better to just accumulate IDs 😅
❤️ Thank you for your contribution!
Closes #393
Description
Recommended to be reviewed commit by commit.
This Pull Request configures a datastream which "augments" the existing affiliations vocabulary by adding the Participant Identification Code (PIC) based on data coming from OpenAIRE Graph Dataset (which is based on OpenAIRE's OpenOrgs data).
Updating existing affiliations can be achieved by running:
Checklist
Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:
Frontend
Reminder
By using GitHub, you have already agreed to the GitHub’s Terms of Service including that: