Affiliations OpenAIRE OpenOrgs PIC #392

ptamarit · 2024-08-27T14:24:38Z

❤️ Thank you for your contribution!

Closes #393

Description

Recommended to be reviewed commit by commit.

This Pull Request configures a datastream which "augments" the existing affiliations vocabulary by adding the Participant Identification Code (PIC) based on data coming from OpenAIRE Graph Dataset (which is based on OpenAIRE's OpenOrgs data).

Updating existing affiliations can be achieved by running:

invenio vocabularies update -v affiliations:openaire -o full

Checklist

Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:

I'm aware of the code of conduct.
I've created logical separate commits and followed the commit message format.
I've added relevant test cases.
I've added relevant documentation.
I've marked translation strings.
I've identified the copyright holder(s) and updated copyright headers for touched files (>15 lines contributions).
I've NOT included third-party code (copy/pasted source code or new dependencies).
- If you have added third-party code (copy/pasted or new dependencies), please reach out to an architect.

Frontend

I've followed the CSS/JS and React guidelines.
I've followed the web accessibility guidelines.
I've followed the user interface guidelines.

Reminder

By using GitHub, you have already agreed to the GitHub’s Terms of Service including that:

You license your contribution under the same terms as the current repository’s license.
You agree that you have the right to license your contribution under the current repository’s license.

ptamarit · 2024-08-27T14:32:32Z

invenio_vocabularies/datastreams/writers.py

+                and isinstance(value, list)
+            ):
+                for value_item in value:
+                    # TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries.


To fix this problem, we would need to not only check for equality like we do here, but we would need to check the scheme of each identifier to know if it's a new scheme or an existing scheme being updated. This would mean that the writer logic would be specific to a given vocabulary.

Do we think duplicates would be problematic? maybe better to just accumulate IDs 😅

ptamarit · 2024-08-27T14:34:22Z

invenio_vocabularies/contrib/common/openaire/datastreams.py

+            "OpenAIREHTTPReader downloads one file and therefore does not iterate through items"
+        )
+
+    def read(self, item=None, *args, **kwargs):


Remark: this reader still does not do a comparison of the publication date with the last successful run.
The only reader implementing such a logic so far is RORHTTPReader.

ptamarit · 2024-08-27T14:36:12Z

invenio_vocabularies/factories.py

 def get_vocabulary_config(vocabulary):
    """Factory function to get the appropriate Vocabulary Config."""
    vocab_config = {
        "names": NamesVocabularyConfig,
        "funders": FundersVocabularyConfig,
        "awards": AwardsVocabularyConfig,
        "affiliations": AffiliationsVocabularyConfig,
+        "affiliations:openaire": AffiliationsOpenAIREVocabularyConfig,


Remark: here we are introducing the notion of different datasources for a given vocabulary type, using : as a kind of namespacing.

ptamarit · 2024-08-27T14:47:48Z

invenio_vocabularies/contrib/affiliations/datastreams.py

+        return super().write(stream_entry, *args, **kwargs)
+
+    def write_many(self, stream_entries, *args, **kwargs):
+        """Writes the input entries using a given service."""


write_many in ServiceWriter already does not really handle the update flag logic. Now that I also reject entries in write, I am not sure how to reuse all this logic here.

…ader

… identifier

carlinmack · 2024-08-30T12:33:36Z

invenio_vocabularies/datastreams/writers.py

+            if self._insert:
+                try:
+                    return StreamEntry(self._service.create(self._identity, entry))
+                except PIDAlreadyExists:
+                    if not self._update:
+                        raise WriterError([f"Vocabulary entry already exists: {entry}"])
+                    return self._do_update(entry)
+            elif self._update:
+                try:
+                    return self._do_update(entry)
+                except (NoResultFound, PIDDoesNotExistError):
+                    raise WriterError([f"Vocabulary entry does not exist: {entry}"])
+            else:
+                raise WriterError(
+                    ["Writer wrongly configured to not insert and to not update"]


question: maybe we're confused but what if you want to insert and update? it seems like you can only do one or the other

Try to simplify or add comments

carlinmack · 2024-08-30T12:36:50Z

invenio_vocabularies/contrib/common/openaire/datastreams.py

@@ -0,0 +1,84 @@
+# -*- coding: utf-8 -*-
+#
+# Copyright (C) 2024 CERN.


nitpick: maintain 2022-2024?

carlinmack · 2024-08-30T12:39:25Z

invenio_vocabularies/config.py

+def is_pic(val):
+    """Test if argument is a Participant Identification Code (PIC)."""
+    if len(val) != 9:
+        return False
+    return val.isdigit()


Suggested change

def is_pic(val):

"""Test if argument is a Participant Identification Code (PIC)."""

if len(val) != 9:

return False

return val.isdigit()

def is_pic(val):

"""Test if argument is a Participant Identification Code (PIC)."""

return len(val) == 9 and val.isdigit()

carlinmack · 2024-08-30T12:46:45Z

invenio_vocabularies/contrib/affiliations/datastreams.py

+
+        for pid in record["pid"]:
+            if pid["scheme"] == "ROR":
+                organization["id"] = pid["value"].removeprefix("https://ror.org/")


nitpick: does this have to be id? would it not be better to be ror/rorid for clarity? or be pid to match the column in affiliation_metadata

carlinmack · 2024-08-30T12:53:02Z

invenio_vocabularies/contrib/affiliations/datastreams.py

+        if not entry["openaire_id"].startswith("openorgs____::"):
+            raise WriterError([f"Not valid OpenAIRE OpenOrgs id for: {entry}"])
+        del entry["openaire_id"]


question: why do we validate the id and then delete it? would it be better to have the check in the transformer instead?

carlinmack · 2024-08-30T12:58:42Z

invenio_vocabularies/datastreams/writers.py

+        for key, value in entry.items():
+            if (
+                key in updated
+                and isinstance(updated[key], list)
+                and isinstance(value, list)
+            ):
+                for value_item in value:
+                    # TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries.
+                    if value_item not in updated[key]:
+                        updated[key].append(value_item)


Suggested change

for key, value in entry.items():

if (

key in updated

and isinstance(updated[key], list)

and isinstance(value, list)

):

for value_item in value:

# TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries.

if value_item not in updated[key]:

updated[key].append(value_item)

for key, values in entry.items():

if (

key in updated

and isinstance(updated[key], list)

and isinstance(value, list)

):

for value in values:

# TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries.

if value not in updated[key]:

updated[key].append(value)

minor: maybe better to have values instead of value_items in value

carlinmack · 2024-08-30T13:01:25Z

invenio_vocabularies/datastreams/writers.py

+                and isinstance(value, list)
+            ):
+                for value_item in value:
+                    # TODO: If an identifier was wrong and is then corrected, this will cause duplicated entries.


Do we think duplicates would be problematic? maybe better to just accumulate IDs 😅

ptamarit force-pushed the affiliations-openaire-openorgs-pic branch 2 times, most recently from 9bc08bb to 2ea4434 Compare August 27, 2024 14:30

ptamarit commented Aug 27, 2024

View reviewed changes

This was referenced Aug 27, 2024

Link EC Projects (Awards vocabulary) to EuroSciVoc subjects and participating organizations with data from CORDIS #382

Open

config: vocabularies Datastream common OpenAIRE inveniosoftware/invenio-app-rdm#2813

Open

ptamarit force-pushed the affiliations-openaire-openorgs-pic branch from 2ea4434 to 8db63c1 Compare August 27, 2024 14:45

ptamarit commented Aug 27, 2024

View reviewed changes

ptamarit added 4 commits August 29, 2024 09:48

cli: make the update command work for writers without args

42d6323

datastreams: writers: add option to not insert

0e1fa04

datastreams: move OpenAIREProjectHTTPReader to generic OpenAIREHTTPRe…

88ba835

…ader

datastreams: affiliations: OpenAIRE transformer and writer adding PIC…

9d54a2a

… identifier

ptamarit force-pushed the affiliations-openaire-openorgs-pic branch from 8db63c1 to 9d54a2a Compare August 29, 2024 07:48

carlinmack reviewed Aug 30, 2024

View reviewed changes

inveniosoftware deleted a comment from carlinmack Aug 30, 2024

ptamarit mentioned this pull request Sep 6, 2024

Link EC Projects (Awards vocabulary) to EuroSciVoc subjects and participating organizations with data from CORDIS #399

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Affiliations OpenAIRE OpenOrgs PIC #392

Affiliations OpenAIRE OpenOrgs PIC #392

ptamarit commented Aug 27, 2024 •

edited

Loading

ptamarit Aug 27, 2024

carlinmack Aug 30, 2024

ptamarit Aug 27, 2024

ptamarit Aug 27, 2024 •

edited

Loading

ptamarit Aug 27, 2024

carlinmack Aug 30, 2024

ptamarit Aug 30, 2024

carlinmack Aug 30, 2024

carlinmack Aug 30, 2024

carlinmack Aug 30, 2024

carlinmack Aug 30, 2024

carlinmack Aug 30, 2024

carlinmack Aug 30, 2024

Affiliations OpenAIRE OpenOrgs PIC #392

Are you sure you want to change the base?

Affiliations OpenAIRE OpenOrgs PIC #392

Conversation

ptamarit commented Aug 27, 2024 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ptamarit Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ptamarit commented Aug 27, 2024 •

edited

Loading

ptamarit Aug 27, 2024 •

edited

Loading