Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helmholtz Kernel Information Profile integration #138

Closed
1 task done
christian-rli opened this issue Feb 21, 2024 · 14 comments · Fixed by #150
Closed
1 task done

Helmholtz Kernel Information Profile integration #138

christian-rli opened this issue Feb 21, 2024 · 14 comments · Fixed by #150
Assignees
Labels
enhancement New feature or request

Comments

@christian-rli
Copy link
Contributor

Description of the issue

As pointed out by @carstenhoyerklick there should be a way to handle the Helmholtz Kernel Information Profile in oemetadata or to at least map it.

Ideas of solution

New field? Reference in an existing one? Please discuss.

Workflow checklist

@carstenhoyerklick
Copy link

In chapter 3 there a number of data fields with should be part of the metadata. We should look at the keys there and check which ones we are maybe missing yet.

e.g.
• UnterEmbagroUntil

For the most we have an equivalent in the OEMetadata, we should list a mapping there.

@carstenhoyerklick
Copy link

I started a mapping between the Helmholtz KIP and OE Metadata.

https://docs.google.com/spreadsheets/d/1Q0tWNRujw3taKw4f-jUVjlFl2anZe38WtVpcD9IlO2s/edit?usp=sharing

The legend is roughly:

  • White: Existing in OEMetadata
  • Yellow: not existing in OEMetadata
  • Orange: Maybe we can use information from the databus Metadata

@jh-RLI
Copy link
Contributor

jh-RLI commented Apr 16, 2024

Great, that's a very helpful first step. I will provide a full example of the new oemetadata version proposal in advance of our meeting next week.

@christian-rli
Copy link
Contributor Author

Thank you for a helpful start @carstenhoyerklick . Do I understand correclty that the proposed course of action is to implement all the fields highlighted in yellow and ignore the orange ones, because they are covered by the databus? Or should we implement the orange ones as well, so that the information can be shown in the regular metadata? Either way it's quite a big extension of the current standard. Do we agree that all of the fields should show up there? If yes, I'm happy to implement them in the example files and schemas.

@carstenhoyerklick
Copy link

My personal preference would be to implement them all and to use the meatadata string as a master source to populate the databus. On the other side, if you have information on two places there is the danger of contradicting information. Which again might be a reason to have it all in the metadata string, as an authoritative source.

@christian-rli
Copy link
Contributor Author

@jh-RLI and I agree. We will implement them in the next version. The resulting list of fields will be quite long and intimidating. Therefore we also decided that the tooling will take that into account. The conversion and export software will return the metadata with only the populated fields by default - empty fields are provided optionally.

jh-RLI added a commit that referenced this issue Jun 18, 2024
- add new trace field for traceability
christian-rli added a commit that referenced this issue Jun 18, 2024
@christian-rli
Copy link
Contributor Author

@jh-RLI and I sorted though the new tags and came up with a structure. We thought it made sense to group almost all new keys together on resource level.

"trace": {
  "alternateOf": "",
  "checksum": "",
  "dateModified": "",
  "digtalObjectLocationAccessProtocol": "",
  "digitalObjectType": "",
  "hadPrimarySource": "",
  "hasMetadata": "",
  "isMetadataFor": "",
  "policy": "",
  "provenanceGraph": "",
  "specializationOf": "",
  "version": "",
  "wasDerivedFrom": "",
  "wasGeneratedBy": "",
  "wasRevisionOf": "",
  "wasQuotedFrom": "",
  "contributort stas": [
    {
      "title": "John Doe",
      "email": "[email protected]",
      "date": "2016-06-16",
      "object": "data and metadata",
      "comment": "Fix typo in the title."
    }
  ]
},

The key name is open for debate. We were looking for something that encompasses things you would need for provenance and reproducibility. Other candidates were 'track', 'trail', 'linked data' or 'provenance'. Currently we like 'trace', but feel free to convince us otherwise.

Other notes:

I understand "isMetadataFor" such that by default it would describe the resource on the OEP. In other words the key would be a duplicate "id" most of the time. Therefore on the OEP it should basically be hidden virtually all the time.

There is no explanation for "locationPreview". Can you help out @carstenhoyerklick ?

"underEmbargoUntil" can go next to date. It's a bit awkward to implement, because one turns into the other, ideally, but if it's not actually published on the planned date there has to be a logic on the OEP to deal with that.

@carsten can you maybe elaborate on the "wasQuotedFrom" field? What's the difference between sources? Does this concern the entire dataset (i.e. this whole table is actually a quote from another resource) or is it meant to reflect sources for parts of the data. Maybe another key within sources or a redefinition to a URI would help here. I assume it's not a "quotedBy" that lists where the resource has been quoted.

@carstenhoyerklick
Copy link

I understand "isMetadataFor" such that by default it would describe the resource on the OEP. In other words the key would be a duplicate "id" most of the time. Therefore on the OEP it should basically be hidden virtually all the time.

I think we should think beyond the OEP here. For the OEP id doubles, but for other repositories it may not. I think it is fair to hide it on the OEP.

There is no explanation for "locationPreview". Can you help out @carstenhoyerklick ?
According the HMC document HMC Kernel Informaiton Profile Page 22 it is a web-resolvable point to a preview, e.g. a low-resolution image of the object referenced. It comes from a RDA recommendation.

This may be relevant for non tabular data. E.g. GIS data sets, they can be connected to a preview.

"underEmbargoUntil" can go next to date. It's a bit awkward to implement, because one turns into the other, ideally, but if it's not actually published on the planned date there has to be a logic on the OEP to deal with that.

I think it is save to ignore it on the OEP, as it takes only published data. But it may be relevant for other platforms.

@carsten can you maybe elaborate on the "wasQuotedFrom" field? What's the difference between sources? Does this concern the entire dataset (i.e. this whole table is actually a quote from another resource) or is it meant to reflect sources for parts of the data. Maybe another key within sources or a redefinition to a URI would help here. I assume it's not a "quotedBy" that lists where the resource has been quoted.

What it means is that this data set which is documented is quoted in another data set. It is also an RDA recommendation. It could be that the documented data set is a sub-set of a larger data set, which has been devided. IsQuotedFrom could be an umbrella data set which references this data set as a subset. It is a kind of a backpointer.

@carstenhoyerklick
Copy link

@jh-RLI and I sorted though the new tags and came up with a structure. We thought it made sense to group almost all new keys together on resource level.

I thought a while about it and I think we have to make some careful thoughts.

Some of the things as alternateOf or checksum, 'digtalObjectLocationAccessProtocolor digitalObjectType` may more in the general part.

We have thing about what are source and what are revisions. In general if a data set is revised, the original data set is a source.
But you could thinks of source are data sets that we used to produce the data set. The new data set has been created by a fusion/modeling process and these are the data sources. These source may have very different characteristics than the target data set.

Revisions are a bit different. The characteristics of the data stays basically the same. A revision may also change some of the structures of the data.

The Helmholtz Kernel information profile differentiates between different types of sources. wasDerivedFrom is probably closest to the sources we have. specializationOf could be a subset of a larger data set or something similar which make this data set more special than the original or a data set specifically enriched . wasRevisionOf probably is more towards an update of the data set. The characteristics come from RDA or PROV-DM (Prov Data Model). Therefore I think we cannot ignore these. But we have to find a way to handle the difference source-target relations which come from the PROV-Data Model

@jh-RLI
Copy link
Contributor

jh-RLI commented Oct 11, 2024

@Ludee We should take another look at the last two comments.

@carstenhoyerklick
Copy link

I have implemented the Helmholtz KIP Information a bit different in the Open Transport Metasdata. Maybe we could try to align this.
grafik
grafik
grafik

@Ludee
Copy link
Member

Ludee commented Oct 11, 2024

From my point of view this is a huge overload of the metadata standard. One major principle of OEMetadata was to keep it as simple as possible. Each topic and key should be discussed separately in order to be added.
For now I will remove all keys because none of them are relevant to the OEP at the moment.

Ludee added a commit that referenced this issue Oct 11, 2024
Ludee added a commit that referenced this issue Oct 11, 2024
@jh-RLI
Copy link
Contributor

jh-RLI commented Oct 23, 2024

I think most of the keys (the more technical ones) are already included in the metadata layer the MOSS tool provides ontop of the oemetadata. They will be available as soon as the data is registered there. Keeping the oemetadata more lean and then link to other resources is a good idea I think. For now, I think we can close this issue-

@carstenhoyerklick
Copy link

Fine with me.

@Ludee Ludee closed this as completed Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants