Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Fix specifications navigation/formatting #46

Merged
merged 9 commits into from
Mar 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 72 additions & 105 deletions content/docs/specifications/data-package.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,25 +38,25 @@ The data included in the package can be provided as:
- Remote resources, referenced by URL
- "Inline" data (see below) which is included directly in the descriptor

### Illustrative Structure
## Structure

A minimal data package on disk would be a directory containing a single file:

```
```text
datapackage.json # (required) metadata and schemas for this data package
```

Lacking a single external source of data would make this of limited use. A slightly less minimal version would be:

```
```text
datapackage.json
# a data file (CSV in this case)
data.csv
```

Additional files such as a README, scripts (for processing or analyzing the data) and other material may be provided. By convention scripts go in a scripts directory and thus, a more elaborate data package could look like this:

```
```text
datapackage.json # (required) metadata and schemas for this data package
README.md # (optional) README in markdown format

Expand All @@ -68,194 +68,173 @@ data/otherdata.csv
scripts/my-preparation-script.py
```

Several example data packages can be found in the [datasets organization on github][datasets], including:

- [World GDP][gdp]
- [ISO 3166-2 country codes][3166]

[datasets]: https://github.com/datasets
[gdp]: https://github.com/datasets/gdp
[3166]: https://github.com/datasets/country-codes
Several example data packages can be found in the [datasets organization on github](https://github.com/datasets), including:

## Specification
- [World GDP](https://github.com/datasets/gdp)
- [ISO 3166-2 country codes](https://github.com/datasets/country-codes)

### Descriptor
## Descriptor

The descriptor is the central file in a Data Package. It provides:

- General metadata such as the package's title, license, publisher etc
- A list of the data "resources" that make up the package including their location on disk or online and other relevant information (including, possibly, schema information about these data resources in a structured form)

A Data Package descriptor `MUST` be a valid JSON `object`. (JSON is defined in [RFC 4627][]). When available as a file it `MUST` be named `datapackage.json` and it `MUST` be placed in the top-level directory (relative to any other resources provided as part of the data package).

[RFC 4627]: http://www.ietf.org/rfc/rfc4627.txt
A Data Package descriptor `MUST` be a valid JSON `object`. (JSON is defined in [RFC 4627](http://www.ietf.org/rfc/rfc4627.txt)). When available as a file it `MUST` be named `datapackage.json` and it `MUST` be placed in the top-level directory (relative to any other resources provided as part of the data package).

The descriptor `MUST` contain a `resources` property describing the data resources.

All other properties are considered `metadata` properties. The descriptor `MAY` contain any number of other `metadata` properties. The following sections provides a description of required and optional metadata properties for a Data Package descriptor.

Adherence to the specification does not imply that additional, non-specified properties cannot be used: a descriptor `MAY` include any number of properties in additional to those described as required and optional properties. For example, if you were storing time series data and wanted to list the temporal coverage of the data in the Data Package you could add a property `temporal` (cf [Dublin Core][dc-temporal]):
Adherence to the specification does not imply that additional, non-specified properties cannot be used: a descriptor `MAY` include any number of properties in additional to those described as required and optional properties. For example, if you were storing time series data and wanted to list the temporal coverage of the data in the Data Package you could add a property `temporal` (cf [Dublin Core](http://dublincore.org/documents/usageguide/qualifiers.shtml#temporal)):

```javascript
```json
"temporal": {
"name": "19th Century",
"start": "1800-01-01",
"end": "1899-12-31"
}
```

This flexibility enables specific communities to extend Data Packages as appropriate for the data they manage. As an example, the [Tabular Data Package][tdp] specification extends Data Package to the case where all the data is tabular and stored in CSV.

[tdp]: /tabular-data-package/
This flexibility enables specific communities to extend Data Packages as appropriate for the data they manage. As an example, the [Tabular Data Package](https://specs.frictionlessdata.io/tabular-data-package/) specification extends Data Package to the case where all the data is tabular and stored in CSV.

Here is an illustrative example of a datapackage JSON file:

```javascript
```json
{
# general "metadata" like title, sources etc
"name" : "a-unique-human-readable-and-url-usable-identifier",
"title" : "A nice title",
"licenses" : [ ... ],
"sources" : [...],
# list of the data resources in this data package
"sources" : [ ... ],
"resources": [
{
... resource info described below ...
...
}
],
# optional
... additional information ...
]
}
```

### Resource Information
## Properties

Packaged data resources are described in the `resources` property of the package descriptor. This property `MUST` be an array of `objects`. Each object `MUST` follow the [Data Resource specification][dr].
A Data Package descriptor `MUST` have `resoures` property and `SHOULD` have `name`, `id`, `licenses`, and `profile` properties.

[dr]: /data-resource/

### Metadata

#### Required Properties
### `resources` [required]

The `resources` property is `REQUIRED`, with at least one resource.

#### Recommended Properties

In addition to the required properties, the following properties `SHOULD` be included in every package descriptor:
Packaged data resources are described in the `resources` property of the package descriptor. This property `MUST` be an array of `objects`. Each object `MUST` follow the [Data Resource ](../data-resource/) specification.

##### `name`
### `name`

The name is a simple name or identifier to be used for this package in relation to any registry in which this package will be deposited.

- It `SHOULD` be human-readable and consist only of lowercase alphanumeric characters plus ".", "-" and "\_".
- It `SHOULD` be unique in relation to any registry in which this package will be deposited (and preferably globally unique).
- It `SHOULD` be invariant, meaning that it `SHOULD NOT` change when a data package is updated, unless the new package version `SHOULD` be considered a distinct package, e.g. due to significant changes in structure or interpretation. Version distinction `SHOULD` be left to the version property. As a corollary, the name also `SHOULD NOT` include an indication of time range covered.

##### `id`
### `id`

A property reserved for globally unique identifiers. Examples of identifiers that are unique include UUIDs and DOIs.

A common usage pattern for Data Packages is as a packaging format within the bounds of a system or platform. In these cases, a unique identifier for a package is desired for common data handling workflows, such as updating an existing package. While at the level of the specification, global uniqueness cannot be validated, consumers using the `id` property `MUST` ensure identifiers are globally unique.

Examples:

```javascript
```json
{
"id": "b03ec84-77fd-4270-813b-0c698943f7ce"
}
```

```javascript
```json
{
"id": "https://doi.org/10.1594/PANGAEA.726855"
}
```

##### `licenses`
### `licenses`

The license(s) under which the package is provided.

**This property is not legally binding and does not guarantee the package is licensed under the terms defined in this property.**
:::caution
This property is not legally binding and does not guarantee the package is licensed under the terms defined in this property.
:::

`licenses` `MUST` be an array. Each item in the array is a License. Each `MUST` be an `object`. The object `MUST` contain a `name` property and/or a `path` property. It `MAY` contain a `title` property.

Here is an example:

```javascript
```json
"licenses": [{
"name": "ODC-PDDL-1.0",
"path": "http://opendatacommons.org/licenses/pddl/",
"title": "Open Data Commons Public Domain Dedication and License v1.0"
}]
```

- `name`: The `name` `MUST` be an [Open Definition license ID][od-licenses]
- `path`: A [url-or-path][] string, that is a fully qualified HTTP address, or a relative POSIX path (see [the url-or-path definition in Data Resource for details][url-or-path]).
- `name`: The `name` `MUST` be an [Open Definition license ID](http://licenses.opendefinition.org/)
- `path`: A [url-or-path](../data-resource/#url-or-path) string, that is a fully qualified HTTP address, or a relative POSIX path.
- `title`: A human-readable title.

[od-licenses]: http://licenses.opendefinition.org/
[od-approved]: http://opendefinition.org/licenses/
[semver]: http://semver.org
[url-or-path]: /data-resource/#url-or-path

##### `profile`
### `profile`

A string identifying the [profile][] of this descriptor as per the [profiles][profile] specification.

[profile]: /profiles/
A string identifying the profile of this descriptor as per the [profiles](https://specs.frictionlessdata.io/profiles/) specification.

Examples:

```javascript
```json
{
"profile": "tabular-data-package"
}
```

```javascript
```json
{
"profile": "http://example.com/my-profiles-json-schema.json"
}
```

#### Optional Properties

The following are commonly used properties that the package descriptor `MAY` contain:

##### `title`
### `title`

A `string` providing a title or one sentence description for this package

##### `description`
### `description`

A description of the package. The description `MUST` be [markdown][] formatted -- this also allows for simple plain text as plain text is itself valid markdown. The first paragraph (up to the first double line break) `SHOULD` be usable as summary information for the package.
A description of the package. The description `MUST` be [markdown](http://commonmark.org/) formatted -- this also allows for simple plain text as plain text is itself valid markdown. The first paragraph (up to the first double line break) `SHOULD` be usable as summary information for the package.

##### `homepage`
### `homepage`

A URL for the home on the web that is related to this data package.

##### `version`
### `image`

A version string identifying the version of the package. It `SHOULD` conform to the [Semantic Versioning][semver] requirements and `SHOULD` follow the [Data Package Version](/recipes/#data-package-version) recipe.
An image to use for this data package. For example, when showing the package in a listing.

##### `sources`
The value of the image property `MUST` be a string pointing to the location of the image. The string `MUST` be a [url-or-path](../data-resource/#url-or-path), that is a fully qualified HTTP address, or a relative POSIX path.

The raw sources for this data package. It `MUST` be an array of Source objects. A Source object `MUST` have at least one property. A Source object is `RECOMMENDED` to have `title` property and `MAY` have `path`, `email`, and `version` properties. Example:
### `version`

A version string identifying the version of the package. It `SHOULD` conform to the [Semantic Versioning](http://semver.org) requirements and `SHOULD` follow the [Data Package Version](../../recipes/data-package-version) recipe.

### `created`

The datetime on which this was created.

Note: semantics may vary between publishers -- for some this is the datetime the data was created, for others the datetime the package was created.

The datetime `MUST` conform to the string formats for datetime as described in [RFC3339](https://tools.ietf.org/html/rfc3339#section-5.6). Example:

```json
"sources": [{
"title": "World Bank and OECD",
"path": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
}]
{
"created": "1985-04-12T23:20:50.52Z"
}
```

- `title`: title of the source (e.g. document or organization name)
- `path`: A [url-or-path][] string, that is a fully qualified HTTP address, or a relative POSIX path (see [the url-or-path definition in Data Resource for details][url-or-path]).
- `email`: An email address
- `version`: A version of the source
### `keywords`

##### `contributors`
An Array of string keywords to assist users searching for the package in catalogs.

### `contributors`

The people or organizations who contributed to this Data Package. It `MUST` be an array. Each entry is a Contributor and `MUST` be an `object`. A Contributor `MUST` have at least one property. A Contributor is RECOMMENDED to have `title` property and MAY contain `givenName`, `familyName`, `path`, `email`, `roles`, and `organization` properties. An example of the object structure is as follows:

Expand All @@ -276,7 +255,7 @@ The people or organizations who contributed to this Data Package. It `MUST` be a
- `roles`: an array of strings describing the roles of the contributor. A role is `RECOMMENDED` to follow an established vocabulary, such as [DataCite Metadata Schema's contributorRole](https://support.datacite.org/docs/datacite-metadata-schema-v44-recommended-and-optional-properties#7a-contributortype) or [CreDIT](https://credit.niso.org/). Useful roles to indicate are: `creator`, `contact`, `rightsHolder`, and `dataCurator`.
- `organization`: a string describing the organization this contributor is affiliated to.

Use of the "creator" role does not imply that that person was the original creator of the data in the data package - merely that they created and/or maintain the data package. It is common for data packages to "package" up data from elsewhere. The original origin of the data can be indicated with the `sources` property - see above.
Use of the `creator` role does not imply that that person was the original creator of the data in the data package - merely that they created and/or maintain the data package. It is common for data packages to "package" up data from elsewhere. The original origin of the data can be indicated with the `sources` property - see above.

References:

Expand All @@ -286,30 +265,18 @@ References:
If the `roles` property is not provided a data consumer MUST fall back to using `role` property which was a part of the `v1.0` of the specification. This property has the same semantics but it is a string allowing to specify only a single role.
:::

##### `keywords`

An Array of string keywords to assist users searching for the package in catalogs.

##### `image`

An image to use for this data package. For example, when showing the package in a listing.

The value of the image property `MUST` be a string pointing to the location of the image. The string `MUST` be a [url-or-path][], that is a fully qualified HTTP address, or a relative POSIX path (see [the url-or-path definition in Data Resource for details][url-or-path]).

##### `created`
### `sources`

The datetime on which this was created.

Note: semantics may vary between publishers -- for some this is the datetime the data was created, for others the datetime the package was created.

The datetime `MUST` conform to the string formats for datetime as described in [RFC3339][]. Example:
The raw sources for this data package. It `MUST` be an array of Source objects. A Source object `MUST` have at least one property. A Source object is `RECOMMENDED` to have `title` property and `MAY` have `path`, `email`, and `version` properties. Example:

```javascript
{
"created": "1985-04-12T23:20:50.52Z"
}
```json
"sources": [{
"title": "World Bank and OECD",
"path": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
}]
```

[RFC3339]: https://tools.ietf.org/html/rfc3339#section-5.6
[dc-temporal]: http://dublincore.org/documents/usageguide/qualifiers.shtml#temporal
[markdown]: http://commonmark.org/
- `title`: title of the source (e.g. document or organization name)
- `path`: A [url-or-path][] string, that is a fully qualified HTTP address, or a relative POSIX path (see [the url-or-path definition in Data Resource for details][url-or-path]).
- `email`: An email address
- `version`: A version of the source
Loading