Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Data Nodes #93

Merged
merged 4 commits into from
Feb 8, 2024
Merged

Rework Data Nodes #93

merged 4 commits into from
Feb 8, 2024

Conversation

HLWeil
Copy link
Member

@HLWeil HLWeil commented Jan 24, 2024

Data Selectors

This PR includes the specification for annotating not only full data resources, but parts of it. For this, after specifying the resource location, a selector can be appended, separated by a #.

This design is heavily inspired by data fragment selectors that can be found in URLs and has two-fold advantage over the solution proposed in #80 (comment), where the selector is moved into another column:

  1. In standard cases, a single column suffices, making the information more compact (and e.g. more easily copyable)
  2. This more closely resembles URIs, potentially being more intuitive for data annotation experts

To support non-standard cases and increase verbosity, two qualifying columns were added, closely following the proposal made by @stain in ISA-tools/isa-specs#15 (comment). This goes in line with Schema.org/CreativeWork and by this I hope to increase compatiblity with RO-Crate.

Data Category annotation

Additionally, for specifying the Input and Output of an annotation table, I cut out all distinctions about the content of the Data resource (Raw Data File, Derived Data File and Image File). This is in line with many discussion about this topic, with the conclusion that this distinction is kind of artificial. I also went against Data File and Data Directory as again, this distinction tries to increase information, but by design excludes cases that do not fall under these categories.

Any input would be welcome
@kappe-c @chgarth @muehlhaus @Brilator

@Brilator
Copy link
Member

I understand this as a nice additional feature, not a must.

  • should pair well with the isa.dataset.xlsx / data dictionary discussed, without making it obsolete
  • makes the flow to data node more explicit

Little off topic, but I'm wondering, wouldn't it then be consequent to remove the "artificial" complexity from Source / Sample / Material nodes as well (plus adding a similar layer to allow annotating the type of sample just like the format of data)?

@HLWeil
Copy link
Member Author

HLWeil commented Jan 25, 2024

I understand this as a nice additional feature, not a must.

I've heard this kind of comment a few times now. IMO in order to actually produce a machine actionable representation of a research cycle, this is definitely a MUST. If this is not given, associating data to the samples it was measured from will remain implicit.

With all the other points I agree.

@HLWeil
Copy link
Member Author

HLWeil commented Jan 25, 2024

We will need some great tooling though to allow both programmers and wet lab researchers to create these selectors without much hassle.

@Brilator
Copy link
Member

produce a machine actionable representation of a research cycle, this is definitely a MUST

Totally agree. I just thought that's what the ISA extension with isa.dataset.xlsx is good for

@HLWeil
Copy link
Member Author

HLWeil commented Jan 25, 2024

Totally agree. I just thought that's what the ISA extension with isa.dataset.xlsx is good for

The selector will be part both of the annotation table (assay and study files) and the dataset table (dataset file). In the annotation table, the main purpose is to make a connection between the data fragments and the samples, basically ankering them in the process graph.
The dataset on the other hand is then used to add further annotations about the fragments in the datafiles. It's not about the from where and to where but more like a what.

So the two additions will work together but not fulfilling the same task.

@kappe-c
Copy link

kappe-c commented Jan 26, 2024

Having talked about this with @HLWeil in person, I agree with this approach.
Commenting on the remarks about the separate dataset file, it was my understanding that several samples may "end up" in (better: contribute to), e.g., the same column in a tabular data file. That, I think, is another reason for the "orthogonal" dataset file: then every file fragment needs to be described only once (as one row in the dataset file, instead of a column or more in the assay file, that would potentially have the same value for several rows (=samples) – tedious and error-prone).

@HLWeil
Copy link
Member Author

HLWeil commented Feb 8, 2024

Thanks for your input @Brilator & @kappe-c!
Will merge now.

@HLWeil HLWeil merged commit 604a083 into v2.0.0 Feb 8, 2024
@Freymaurer Freymaurer mentioned this pull request Feb 21, 2024
@HLWeil HLWeil deleted the selector branch October 29, 2024 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants