Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attached detached lite #390

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

ptsefton
Copy link
Contributor

@ptsefton ptsefton commented Jan 10, 2025

This is another approach to clarifying Attached vs Detached that does not introduce much new terminology or conformsTo etc. Again, this is a first pass to see if this approach makes sense, it will need checking.

I moved a bit of stuff around in structure.md but there are no major changes there except to cover how to deal with a Detached RO-Crate Package. Also, did not try to deal with how you'd link one to a website.

In the section on Data Entites i further tidied up the logic around @ids and contentUrls -- I think this has made it clearer, and I don't think it will be hard to implement.

Thanks for your feedback @simleo about the complexity in my last try.

@ptsefton ptsefton requested review from elichad, stain and simleo and removed request for elichad January 10, 2025 04:49
docs/_specification/1.2-DRAFT/data-entities.md Outdated Show resolved Hide resolved
docs/_specification/1.2-DRAFT/data-entities.md Outdated Show resolved Hide resolved
docs/_specification/1.2-DRAFT/data-entities.md Outdated Show resolved Hide resolved
docs/_specification/1.2-DRAFT/data-entities.md Outdated Show resolved Hide resolved
* an absolute URI
* a local reference beginning with `#`

For an _Attached RO-Crate Package_:
* The `@id` MUST be a relative path that resolves to a directory that is present in the _RO-Crate Root_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Dataset in an attached crate can also be web-based, with an absolute URI as its @id: this is already allowed in RO-Crate 1.1, I don't think we should change that. The requirement that the directory be present is also absent in 1.1: adding it would basically force validators to perform a check on the file system for every Dataset in the crate, which could be quite expensive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I don't think I looked at this this on this edit -- what should it say?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the whole paragraph For an _Attached RO-Crate Package_ [...] in the _RO-Crate Root_. should be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to:

If the @id is a relative path, then it MUST that resolve to a directory which must be present in the RO-Crate Root along with its parent directories.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The folder may be empty, e.g. because of contentUrl trick above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirement that the directory be present is also absent in 1.1

@simleo I checked this on a call with @stain, and while the requirement isn't stated in this part of the spec, it is covered elsewhere.

In 1.1, data-entities.md has:

A Dataset (directory) Data Entity MUST have the following properties:

  • @type MUST be Dataset or an array where Dataset is one of the values.
  • @id MUST be either a URI Path relative to the RO Crate root, or an absolute URI. The id SHOULD end with /

But structure.md also has

Data Entities in the RO-Crate MUST either be payload files/directories present within the RO-Crate root directory or its subdirectories, or be Web-based Data Entities."

So in 1.2 this specific requirement is just clearer since the info is in one place.

- References to files and directories in the RO-Crate Metadata Document are present in the RO-Crate or available online as [Web-based Data Entities](data-entities.html#web-based-data-entities).
2. A _Detached RO-Crate Package_:
- Is defined by a stand alone RO-Crate metadata document which may be stored in a file or distributed via an API.
- If stored in a file, known as a _Detached RO-Crate Metadata File_, the filename SHOULD be `${slug}-ro-crate-metadata.json` where the variable `$slug` is a human readable version of the dataset's ID or name, to signal that the document should be interpreted as part of an _Attached RO-Crate Data Package_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like something that could break the algorithm for finding the metadata file and root data entity. I also have trouble understanding the use case, which doesn't seem to be mentioned elsewhere: can a detached crate be part of an attached crate? What would that imply?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not affect finding the RO Crate Metadata description as that is ALWAYS ro-crate-metadata.json in all cases -- it is essentially a magic string (this is already noted that if you get a crate over an API).

The intention of the proposed changes here is that when you are dealing with a Detached crate an algorithm should never have to find it, it will be passed in directly either as a file, a string or a URI to some endpoint. The point of recommending this slug is twofold - firstly to distinguish files people might have in their downloads -- ATM you end up with a lot of ro-crate-metadata.json files and secondly, to STOP detached crates from being accidentally treated as Attached RO-Crate Packages.

If this makes sense then I will look at making it clearer in the spec.

Copy link
Contributor

@simleo simleo Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. Then the paragraph should end with "... to signal that the document should be interpreted as part of a Detached RO-Crate Data Package"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded it

docs/_specification/1.2-DRAFT/structure.md Outdated Show resolved Hide resolved
docs/_specification/1.2-DRAFT/structure.md Outdated Show resolved Hide resolved
docs/_specification/1.2-DRAFT/structure.md Outdated Show resolved Hide resolved

At the basic level, an Attached RO-Crate is a collection of files and resources represented as a Schema.org [Dataset], that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The _RO-Crate Metadata Document_ describes the RO-Crate, and MUST be stored in the _RO-Crate Root_.
In a _Detached RO-Crate Package_ the [root data entity](root-data-entity) SHOULD have an `@id` which is a URL that resolves to the _RO-Crate Metadata Document_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the @id of the root data entity resolve to the metadata document? They are two separate entities with different ids.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a detached crate then in many cases that will have some kind of online home at an API or a website - this is saying the the @id should point at that so it should say something like.

In a Detached RO-Crate Package the root data entity SHOULD have an @id which is a URL that resolves to an online source for the RO-Crate Metadata Document which may be a on the web or available over an API.

Does that make sense to you @simleo ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm heavily influenced by the pre-1.2 mindset. So in a detached crate the metadata file descriptor will always have the "magic" string ro-crate-metadata.json as @id, while the root data entity's @id will (possibly) point to a resource where the RO-Crate metadata can actually be retrieved, right? I think changing "SHOULD have an @id which is a URL" to "SHOULD have an @id which is an absolute URL" would help make things clearer.

- References to files and directories in the RO-Crate Metadata Document are present in the RO-Crate or available online as [Web-based Data Entities](data-entities.html#web-based-data-entities).
2. A _Detached RO-Crate Package_:
- Is defined by a stand alone RO-Crate metadata document which may be stored in a file or distributed via an API.
- If stored in a file, known as a _Detached RO-Crate Metadata File_, the filename SHOULD be `${slug}-ro-crate-metadata.json` where the variable `$slug` is a human readable version of the dataset's ID or name, to signal that the document should be interpreted as part of an _Attached RO-Crate Data Package_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If stored in a file, known as a _Detached RO-Crate Metadata File_, the filename SHOULD be `${slug}-ro-crate-metadata.json` where the variable `$slug` is a human readable version of the dataset's ID or name, to signal that the document should be interpreted as part of an _Attached RO-Crate Data Package_.
- If stored in a file, known as a _Detached RO-Crate Metadata File_, the filename SHOULD be `${slug}-ro-crate-metadata.json` where the variable `$slug` is a human readable version of the dataset's ID or name, to signal that the document should be interpreted as part of a _Detached RO-Crate Data Package_.

Copy link
Contributor

@stain stain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through with @elichad and we have indicated some changes. I didn't complete the review.

b. An Absolute URI indicating that the entity is a [Web-based Data Entity](#web-based-data-entity).

2. For a _Detached RO-Crate Package_ all [File] Data Entities are [Web-based Data Enties](#web-based-data-entity)
* If a `contentUrl`is present: `@id` MUST be a A valid relative URI reference and `contentURL` must be an absolute URI. The presence of the `contentUrl` property is an indication that the File content may be sourced from that URL and if the _Detached RO-Crate Package_ were to be converted to an _Attached RO-Crate Package_ the `@id` indicates the `filePath` to use for saving a local copy of the [File].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't allow relative paths of data entities in a detached RO-Crate.

For instance if http://example.com/crate?id=15 and http://example.com/crate?id=83 both declare a "@id": "file.txt" then these relative path both resolve to http://example.com/file.txt when interpreted as JSON-LD. And it makes it crucial to be able to detect if something is a detached crate to avoid fetching these deliberately broken URIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#337 suggested a new property localPath with a string that references desired path if so chosen to download. But in that case @id would need rewriting after.


Further constraints on the `@id` are dependent on whether the [File] entity is being considered as part of an _Attached RO-Crate Package_ or _Detached RO-Crate Package_.

If an `@id` is a relative URI then it is treated as a `filePath`, which is calculated by appending the `@id` to the `RO-Crate Root`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid use of filePath code markdown syntax as it looks like a JSON key, I think file path may work.

* an absolute URI
* a local reference beginning with `#`

For an _Attached RO-Crate Package_:
* The `@id` MUST be a relative path that resolves to a directory that is present in the _RO-Crate Root_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to:

If the @id is a relative path, then it MUST that resolve to a directory which must be present in the RO-Crate Root along with its parent directories.

* an absolute URI
* a local reference beginning with `#`

For an _Attached RO-Crate Package_:
* The `@id` MUST be a relative path that resolves to a directory that is present in the _RO-Crate Root_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The folder may be empty, e.g. because of contentUrl trick above.

* The `@id` MUST be a relative path that resolves to a directory that is present in the _RO-Crate Root_.

For a _Detached RO-Crate Package_:
* If the `@id` is a _URI Path it MAY be used to create a directory and MAY resolve to a service which returns a list of files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar issue as above, e.g. @id: "folder/subfolder/"from detached cratehttp://example.com/api/get_crate?1239` would wrongly resolve to http://example.com/api/folder/subfolder/

@@ -67,20 +67,19 @@ property referencing the _Root Data Entity_'s `@id`.
}
```

{% include callout.html type="note" content="Even in [Detached RO-Crates](structure#detached-ro-crate) which do not have an _RO-Crate Metadata File_ present, the identifier `ro-crate-metadata.json` MUST be used." %}
{: .note}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change back to {% include callout... style


{% include callout.html type="tip" content="The `conformsTo` property MAY be an array, to additionally indicate
specializing [RO-Crate profiles](profiles)." %}
{% include callout.html type="tip" content="In RO-Crate 1,2 The `conformsTo` property MAY not have more than one value, to additionally indicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may or may not agree with this (breaks Workflow RO-Crate profile 1.0) and I see this is because the profiles should now be on the root Dataset, but I don't think this change should be snuck into this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also be documented better as a PR or issue as technically it's a breaking change. I think we can still do it for 1.2 because it's the very same spot you see if it's 1.1 or 1.2

@@ -128,14 +127,15 @@ be minimally valid.

## Direct properties of the Root Data Entity

The _Root Data Entity_ MUST have the following properties:
The _Root Data Entity_ of a _Valid RO-Crate Dataset_ MUST have the following properties:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

terminology Valid RO-Crate Dataset is new, what does this mean?


* `@type`: MUST be [Dataset] or an array that contains `Dataset`
* `@id`: SHOULD be the string `./` or an absolute URI (see [below](#root-data-entity-identifier))
* `name`: SHOULD identify the dataset to humans well enough to disambiguate it from other RO-Crates
* `description`: SHOULD further elaborate on the name to provide a summary of the context in which the dataset is important.
* `datePublished`: MUST be a single string value in [ISO 8601 date format][DateTime], SHOULD be specified to at least the precision of a day, and MAY be a timestamp down to the millisecond.
* `license`: SHOULD link to a _Contextual Entity_ or _Data Entity_ in the _RO-Crate Metadata Document_ with a name and description (see section on [licensing](contextual-entities#licensing-access-control-and-copyright)). MAY, if necessary, be a textual description of how the RO-Crate may be used.
* `conformsTo` with a value of {"@id": "https://w3id.org/ro/crate/1.2-DRAFT#ro-crate-dataset"} (may be an array with multiple values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be a remnant from the 2.0 modularisation idea in #388 and should be removed from this PR

RO-Crates that have been assigned a _persistent identifier_ (e.g. a DOI) SHOULD indicate this using [identifier] on the Root Data Entity using the approach set out in the [Science On Schema.org guides], that is, through a `PropertyValue`.

{% include callout.html type="note" content="RO-Crate 1.1 and earlier recommended `identifier` to be plain string URIs. Clients SHOULD be permissive of an RO-Crate `identifier` being a string (which MAY be a URI), or a `@id` reference, which SHOULD be represented as an `PropertyValue` entity which MUST have a human readable `value`, and SHOULD have a `url` if the identifier is Web-resolvable. A citable representation of this persistent identifier MAY be given as a `description` of the `PropertyValue`, but as there are more than 10,000 known [citation styles], no attempt should be made to parse this string." %}
{: note}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use callout block instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants