diff --git a/content/docs/specifications/data-package.md b/content/docs/specifications/data-package.md index 800b582d..417f4b11 100644 --- a/content/docs/specifications/data-package.md +++ b/content/docs/specifications/data-package.md @@ -38,17 +38,17 @@ The data included in the package can be provided as: - Remote resources, referenced by URL - "Inline" data (see below) which is included directly in the descriptor -### Illustrative Structure +## Structure A minimal data package on disk would be a directory containing a single file: -``` +```text datapackage.json # (required) metadata and schemas for this data package ``` Lacking a single external source of data would make this of limited use. A slightly less minimal version would be: -``` +```text datapackage.json # a data file (CSV in this case) data.csv @@ -56,7 +56,7 @@ data.csv Additional files such as a README, scripts (for processing or analyzing the data) and other material may be provided. By convention scripts go in a scripts directory and thus, a more elaborate data package could look like this: -``` +```text datapackage.json # (required) metadata and schemas for this data package README.md # (optional) README in markdown format @@ -68,35 +68,27 @@ data/otherdata.csv scripts/my-preparation-script.py ``` -Several example data packages can be found in the [datasets organization on github][datasets], including: - -- [World GDP][gdp] -- [ISO 3166-2 country codes][3166] - -[datasets]: https://github.com/datasets -[gdp]: https://github.com/datasets/gdp -[3166]: https://github.com/datasets/country-codes +Several example data packages can be found in the [datasets organization on github](https://github.com/datasets), including: -## Specification +- [World GDP](https://github.com/datasets/gdp) +- [ISO 3166-2 country codes](https://github.com/datasets/country-codes) -### Descriptor +## Descriptor The descriptor is the central file in a Data Package. It provides: - General metadata such as the package's title, license, publisher etc - A list of the data "resources" that make up the package including their location on disk or online and other relevant information (including, possibly, schema information about these data resources in a structured form) -A Data Package descriptor `MUST` be a valid JSON `object`. (JSON is defined in [RFC 4627][]). When available as a file it `MUST` be named `datapackage.json` and it `MUST` be placed in the top-level directory (relative to any other resources provided as part of the data package). - -[RFC 4627]: http://www.ietf.org/rfc/rfc4627.txt +A Data Package descriptor `MUST` be a valid JSON `object`. (JSON is defined in [RFC 4627](http://www.ietf.org/rfc/rfc4627.txt)). When available as a file it `MUST` be named `datapackage.json` and it `MUST` be placed in the top-level directory (relative to any other resources provided as part of the data package). The descriptor `MUST` contain a `resources` property describing the data resources. All other properties are considered `metadata` properties. The descriptor `MAY` contain any number of other `metadata` properties. The following sections provides a description of required and optional metadata properties for a Data Package descriptor. -Adherence to the specification does not imply that additional, non-specified properties cannot be used: a descriptor `MAY` include any number of properties in additional to those described as required and optional properties. For example, if you were storing time series data and wanted to list the temporal coverage of the data in the Data Package you could add a property `temporal` (cf [Dublin Core][dc-temporal]): +Adherence to the specification does not imply that additional, non-specified properties cannot be used: a descriptor `MAY` include any number of properties in additional to those described as required and optional properties. For example, if you were storing time series data and wanted to list the temporal coverage of the data in the Data Package you could add a property `temporal` (cf [Dublin Core](http://dublincore.org/documents/usageguide/qualifiers.shtml#temporal)): -```javascript +```json "temporal": { "name": "19th Century", "start": "1800-01-01", @@ -104,47 +96,35 @@ Adherence to the specification does not imply that additional, non-specified pro } ``` -This flexibility enables specific communities to extend Data Packages as appropriate for the data they manage. As an example, the [Tabular Data Package][tdp] specification extends Data Package to the case where all the data is tabular and stored in CSV. - -[tdp]: /tabular-data-package/ +This flexibility enables specific communities to extend Data Packages as appropriate for the data they manage. As an example, the [Tabular Data Package](https://specs.frictionlessdata.io/tabular-data-package/) specification extends Data Package to the case where all the data is tabular and stored in CSV. Here is an illustrative example of a datapackage JSON file: -```javascript +```json { - # general "metadata" like title, sources etc "name" : "a-unique-human-readable-and-url-usable-identifier", "title" : "A nice title", "licenses" : [ ... ], - "sources" : [...], - # list of the data resources in this data package + "sources" : [ ... ], "resources": [ { - ... resource info described below ... + ... } - ], - # optional - ... additional information ... + ] } ``` -### Resource Information +## Properties -Packaged data resources are described in the `resources` property of the package descriptor. This property `MUST` be an array of `objects`. Each object `MUST` follow the [Data Resource specification][dr]. +A Data Package descriptor `MUST` have `resoures` property and `SHOULD` have `name`, `id`, `licenses`, and `profile` properties. -[dr]: /data-resource/ - -### Metadata - -#### Required Properties +### `resources` [required] The `resources` property is `REQUIRED`, with at least one resource. -#### Recommended Properties - -In addition to the required properties, the following properties `SHOULD` be included in every package descriptor: +Packaged data resources are described in the `resources` property of the package descriptor. This property `MUST` be an array of `objects`. Each object `MUST` follow the [Data Resource ](../data-resource/) specification. -##### `name` +### `name` The name is a simple name or identifier to be used for this package in relation to any registry in which this package will be deposited. @@ -152,7 +132,7 @@ The name is a simple name or identifier to be used for this package in relation - It `SHOULD` be unique in relation to any registry in which this package will be deposited (and preferably globally unique). - It `SHOULD` be invariant, meaning that it `SHOULD NOT` change when a data package is updated, unless the new package version `SHOULD` be considered a distinct package, e.g. due to significant changes in structure or interpretation. Version distinction `SHOULD` be left to the version property. As a corollary, the name also `SHOULD NOT` include an indication of time range covered. -##### `id` +### `id` A property reserved for globally unique identifiers. Examples of identifiers that are unique include UUIDs and DOIs. @@ -160,29 +140,31 @@ A common usage pattern for Data Packages is as a packaging format within the bou Examples: -```javascript +```json { "id": "b03ec84-77fd-4270-813b-0c698943f7ce" } ``` -```javascript +```json { "id": "https://doi.org/10.1594/PANGAEA.726855" } ``` -##### `licenses` +### `licenses` The license(s) under which the package is provided. -**This property is not legally binding and does not guarantee the package is licensed under the terms defined in this property.** +:::caution +This property is not legally binding and does not guarantee the package is licensed under the terms defined in this property. +::: `licenses` `MUST` be an array. Each item in the array is a License. Each `MUST` be an `object`. The object `MUST` contain a `name` property and/or a `path` property. It `MAY` contain a `title` property. Here is an example: -```javascript +```json "licenses": [{ "name": "ODC-PDDL-1.0", "path": "http://opendatacommons.org/licenses/pddl/", @@ -190,72 +172,69 @@ Here is an example: }] ``` -- `name`: The `name` `MUST` be an [Open Definition license ID][od-licenses] -- `path`: A [url-or-path][] string, that is a fully qualified HTTP address, or a relative POSIX path (see [the url-or-path definition in Data Resource for details][url-or-path]). +- `name`: The `name` `MUST` be an [Open Definition license ID](http://licenses.opendefinition.org/) +- `path`: A [url-or-path](../data-resource/#url-or-path) string, that is a fully qualified HTTP address, or a relative POSIX path. - `title`: A human-readable title. -[od-licenses]: http://licenses.opendefinition.org/ -[od-approved]: http://opendefinition.org/licenses/ -[semver]: http://semver.org -[url-or-path]: /data-resource/#url-or-path - -##### `profile` +### `profile` -A string identifying the [profile][] of this descriptor as per the [profiles][profile] specification. - -[profile]: /profiles/ +A string identifying the profile of this descriptor as per the [profiles](https://specs.frictionlessdata.io/profiles/) specification. Examples: -```javascript +```json { "profile": "tabular-data-package" } ``` -```javascript +```json { "profile": "http://example.com/my-profiles-json-schema.json" } ``` -#### Optional Properties - -The following are commonly used properties that the package descriptor `MAY` contain: - -##### `title` +### `title` A `string` providing a title or one sentence description for this package -##### `description` +### `description` -A description of the package. The description `MUST` be [markdown][] formatted -- this also allows for simple plain text as plain text is itself valid markdown. The first paragraph (up to the first double line break) `SHOULD` be usable as summary information for the package. +A description of the package. The description `MUST` be [markdown](http://commonmark.org/) formatted -- this also allows for simple plain text as plain text is itself valid markdown. The first paragraph (up to the first double line break) `SHOULD` be usable as summary information for the package. -##### `homepage` +### `homepage` A URL for the home on the web that is related to this data package. -##### `version` +### `image` -A version string identifying the version of the package. It `SHOULD` conform to the [Semantic Versioning][semver] requirements and `SHOULD` follow the [Data Package Version](/recipes/#data-package-version) recipe. +An image to use for this data package. For example, when showing the package in a listing. -##### `sources` +The value of the image property `MUST` be a string pointing to the location of the image. The string `MUST` be a [url-or-path](../data-resource/#url-or-path), that is a fully qualified HTTP address, or a relative POSIX path. -The raw sources for this data package. It `MUST` be an array of Source objects. A Source object `MUST` have at least one property. A Source object is `RECOMMENDED` to have `title` property and `MAY` have `path`, `email`, and `version` properties. Example: +### `version` + +A version string identifying the version of the package. It `SHOULD` conform to the [Semantic Versioning](http://semver.org) requirements and `SHOULD` follow the [Data Package Version](../../recipes/data-package-version) recipe. + +### `created` + +The datetime on which this was created. + +Note: semantics may vary between publishers -- for some this is the datetime the data was created, for others the datetime the package was created. + +The datetime `MUST` conform to the string formats for datetime as described in [RFC3339](https://tools.ietf.org/html/rfc3339#section-5.6). Example: ```json -"sources": [{ - "title": "World Bank and OECD", - "path": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD" -}] +{ + "created": "1985-04-12T23:20:50.52Z" +} ``` -- `title`: title of the source (e.g. document or organization name) -- `path`: A [url-or-path][] string, that is a fully qualified HTTP address, or a relative POSIX path (see [the url-or-path definition in Data Resource for details][url-or-path]). -- `email`: An email address -- `version`: A version of the source +### `keywords` -##### `contributors` +An Array of string keywords to assist users searching for the package in catalogs. + +### `contributors` The people or organizations who contributed to this Data Package. It `MUST` be an array. Each entry is a Contributor and `MUST` be an `object`. A Contributor `MUST` have at least one property. A Contributor is RECOMMENDED to have `title` property and MAY contain `givenName`, `familyName`, `path`, `email`, `roles`, and `organization` properties. An example of the object structure is as follows: @@ -276,7 +255,7 @@ The people or organizations who contributed to this Data Package. It `MUST` be a - `roles`: an array of strings describing the roles of the contributor. A role is `RECOMMENDED` to follow an established vocabulary, such as [DataCite Metadata Schema's contributorRole](https://support.datacite.org/docs/datacite-metadata-schema-v44-recommended-and-optional-properties#7a-contributortype) or [CreDIT](https://credit.niso.org/). Useful roles to indicate are: `creator`, `contact`, `rightsHolder`, and `dataCurator`. - `organization`: a string describing the organization this contributor is affiliated to. -Use of the "creator" role does not imply that that person was the original creator of the data in the data package - merely that they created and/or maintain the data package. It is common for data packages to "package" up data from elsewhere. The original origin of the data can be indicated with the `sources` property - see above. +Use of the `creator` role does not imply that that person was the original creator of the data in the data package - merely that they created and/or maintain the data package. It is common for data packages to "package" up data from elsewhere. The original origin of the data can be indicated with the `sources` property - see above. References: @@ -286,30 +265,18 @@ References: If the `roles` property is not provided a data consumer MUST fall back to using `role` property which was a part of the `v1.0` of the specification. This property has the same semantics but it is a string allowing to specify only a single role. ::: -##### `keywords` - -An Array of string keywords to assist users searching for the package in catalogs. - -##### `image` - -An image to use for this data package. For example, when showing the package in a listing. - -The value of the image property `MUST` be a string pointing to the location of the image. The string `MUST` be a [url-or-path][], that is a fully qualified HTTP address, or a relative POSIX path (see [the url-or-path definition in Data Resource for details][url-or-path]). - -##### `created` +### `sources` -The datetime on which this was created. - -Note: semantics may vary between publishers -- for some this is the datetime the data was created, for others the datetime the package was created. - -The datetime `MUST` conform to the string formats for datetime as described in [RFC3339][]. Example: +The raw sources for this data package. It `MUST` be an array of Source objects. A Source object `MUST` have at least one property. A Source object is `RECOMMENDED` to have `title` property and `MAY` have `path`, `email`, and `version` properties. Example: -```javascript -{ - "created": "1985-04-12T23:20:50.52Z" -} +```json +"sources": [{ + "title": "World Bank and OECD", + "path": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD" +}] ``` -[RFC3339]: https://tools.ietf.org/html/rfc3339#section-5.6 -[dc-temporal]: http://dublincore.org/documents/usageguide/qualifiers.shtml#temporal -[markdown]: http://commonmark.org/ +- `title`: title of the source (e.g. document or organization name) +- `path`: A [url-or-path][] string, that is a fully qualified HTTP address, or a relative POSIX path (see [the url-or-path definition in Data Resource for details][url-or-path]). +- `email`: An email address +- `version`: A version of the source diff --git a/content/docs/specifications/data-resource.md b/content/docs/specifications/data-resource.md index e163b526..dc1457eb 100644 --- a/content/docs/specifications/data-resource.md +++ b/content/docs/specifications/data-resource.md @@ -15,25 +15,19 @@ sidebar: -A simple format to describe and package a single data resource such as a individual table or file. +A simple format to describe and package a single data resource such as a individual table or file. The essence of a Data Resource is a locator for the data it describes. A range of other properties can be declared to provide a richer set of metadata. ## Language The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `SHOULD NOT`, `RECOMMENDED`, `MAY`, and `OPTIONAL` in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt) -## Introduction - -The **Data Resource** format describes a data resource such as an individual file or table. -The essence of a Data Resource is a locator for the data it describes. -A range of other properties can be declared to provide a richer set of metadata. - -### Examples +## Example A minimal Data Resource looks as follows: With data accessible via the local filesystem. -```javascript +```json { "name": "resource-name", "path": "resource-path.csv" @@ -42,7 +36,7 @@ With data accessible via the local filesystem. With data accessible via http. -```javascript +```json { "name": "resource-name", "path": "http://example.com/resource-path.csv" @@ -51,20 +45,18 @@ With data accessible via http. A minimal Data Resource pointing to some inline data looks as follows. -```javascript +```json { "name": "resource-name", "data": { - "resource-name-data": [ - {"a": 1, "b": 2} - ] - }, + "resource-name-data": [{ "a": 1, "b": 2 }] + } } ``` A comprehensive Data Resource example with all required, recommended and optional properties looks as follows. -```javascript +```json { "name": "solar-system", "path": "http://example.com/solar-system.csv", @@ -74,178 +66,162 @@ A comprehensive Data Resource example with all required, recommended and optiona "mediatype": "text/csv", "encoding": "utf-8", "bytes": 1, - "hash": "", - "schema": "", - "sources": "", - "licenses": "" + "hash": ..., + "schema": ..., + "sources": [ ... ], + "licenses": [ ... ] } ``` -### Descriptor - -A Data Resource descriptor `MUST` be a valid JSON `object`. (JSON is defined in [RFC 4627][]). - -Key properties of the descriptor are described below. A descriptor `MAY` include any number of properties in additional to those described below as required and optional properties. - -[RFC 4627]: http://www.ietf.org/rfc/rfc4627.txt +## Descriptor -### Data Location +A Data Resource descriptor `MUST` be a valid JSON `object`. (JSON is defined in [RFC 4627](http://www.ietf.org/rfc/rfc4627.txt)). -A resource `MUST` contain a property describing the location of the -data associated to the resource. The location of resource data `MUST` be -specified by the presence of one (and only one) of these two properties: +## Properties -- `path`: for data in files located online or locally on disk. -- `data`: for data inline in the descriptor itself. - -#### `path` Data in Files +Standard properties of the descriptor are described below. A descriptor `MAY` include any number of properties in additional to those described below as required and optional properties. -`path` `MUST` be a string -- or an array of strings (see "Data in Multiple -Files"). Each string `MUST` be a "url-or-path" as defined in the next section. +### `name` [required] -##### URL or Path - -A "url-or-path" is a `string` with the following additional constraints: +A resource `MUST` contain a `name` property. The name is a simple name or identifier to be used for this resource. -- `MUST` either be a URL or a POSIX path -- [URLs][url] `MUST` be fully qualified. `MUST` be using either http or https scheme. (Absence of a scheme indicates `MUST` be a POSIX path) -- [POSIX paths][posix] (unix-style with `/` as separator) are supported for referencing local files, with the security restraint that they `MUST` be relative siblings or children of the descriptor. Absolute paths `/`, relative parent paths `../`, hidden folders starting from a dot `.hidden` `MUST` NOT be used. +- It `MUST` be unique amongst all resources in this data package. +- It `SHOULD` be human-readable and consist only of lowercase alphanumeric characters plus `.`, `-` and `\_`. +- It would be usual for the name to correspond to the file name (minus the extension) of the data file the resource describes. -[url]: https://en.wikipedia.org/wiki/Uniform_Resource_Locator -[posix]: https://en.wikipedia.org/wiki/Path_%28computing%29#POSIX_pathname_definition +### `path` or `data` [required] -Examples: +A resource `MUST` contain a property describing the location of the data associated to the resource. The location of resource data `MUST` be specified by the presence of one (and only one) of these two properties: -``` -# fully qualified url -"path": "http://ex.datapackages.org/big-csv/my-big.csv" +- `path`: for data in files located online or locally on disk. +- `data`: for data inline in the descriptor itself. -# relative path -# note: this will work both as a relative path on disk and on online -"path": "my-data-directory/my-csv.csv" -``` +#### Single File -:::warning -`/` (absolute path) and `../` (relative parent path) are forbidden to avoid security vulnerabilities when implementing data package software. These limitations on resource `path` ensure that resource paths only point to files within the data package directory and its subdirectories. This prevents data package software being exploited by a malicious user to gain unintended access to sensitive information. +If a resource have only a single file then `path` `MUST` be a string that a "url-or-path" as defined in [URL of Path](#url-or-path) section. -For example, suppose a data package hosting service stores packages on disk and allows access via an API. A malicious user uploads a data package with a resource path like `/etc/passwd`. The user then requests the data for that resource and the server naively opens `/etc/passwd` and returns that data to the caller. - -Prior to release 1.0.0-beta.18 (Nov 17 2016) there was a `url` property distinct from `path`. In order to support backwards compatibility, implementors `MAY` want to automatically convert a `url` property to a `path` property and issue a warning. -::: - -#### Data in Multiple Files +#### Multiple Files Usually, a resource will have only a single file associated to it. However, sometimes it can be convenient to have a single resource whose data is split across multiple files -- perhaps the data is large and having it in one file would be inconvenient. -To support this use case the `path` property `MAY` be an array of strings rather -than a single string: +To support this use case the `path` property `MAY` be an array of strings rather than a single string: -``` -"path": [ "myfile1.csv", "myfile2.csv" ] +```json +{ + "path": ["myfile1.csv", "myfile2.csv"] +} ``` -It is NOT permitted to mix fully qualified URLs and relative paths in a `path` array: strings `MUST either all be relative paths or all URLs. +It is NOT permitted to mix fully qualified URLs and relative paths in a `path` array: strings `MUST` either all be relative paths or all URLs. -**NOTE:** All files in the array `MUST` be similar in terms of structure, format etc. Implementors `MUST` be able to concatenate together the files in the simplest way and treat the result as one large file. For tabular data there is the issue of header rows. See the [Tabular Data Package spec][tdp] for more on this. +:::note +All files in the array `MUST` be similar in terms of structure, format etc. Implementors `MUST` be able to concatenate together the files in the simplest way and treat the result as one large file. For tabular data there is the issue of header rows. See the [Tabular Data Package spec](https://specs.frictionlessdata.io/tabular-data-package/) for more on this. +::: -#### `data` Inline Data +#### Inline Data Resource data rather than being stored in external files can be shipped `inline` on a Resource using the `data` property. The value of the data property can be any type of data. However, restrictions of JSON require that the value be a string so for binary data you will need to encode (e.g. to Base64). Information on the type and encoding of the value of the data property SHOULD be provided by the format (or mediatype) property and the encoding property. -Specifically: the value of the data property `MUST` be: +The value of the data property `MUST` be either: -- EITHER: a **JSON** array or **Object**- the data is then assumed to be JSON data and SHOULD be processed as such -- OR: a **JSON** string - in this case the format or mediatype properties `MUST` be provided. +- **JSON array or object**: the data is then assumed to be JSON data and SHOULD be processed as such +- **JSON string**: in this case the format or mediatype properties `MUST` be provided. Thus, a consumer of resource object `MAY` assume if no format or mediatype property is provided that the data is JSON and attempt to process it as such. -**Examples 1 - inline JSON:** +For example, inline JSON: +```json +{ + "resources": [ { - ... - "resources": [ - { - "format": "json", - # some json data e.g. - "data": [ - { "a": 1, "b": 2 }, - { .... } - ] - } - ] + "format": "json", + "data": [{ "a": 1, "b": 2 }] } + ] +} +``` -**Example 2 - inline CSV:** +Or inline CSV: +```json +{ + "resources": [ { - ... - "resources": [ - { - "format": "csv", - "data": "A,B,C\n1,2,3\n4,5,6" - } - ] + "format": "csv", + "data": "A,B,C\n1,2,3\n4,5,6" } + ] +} +``` -### Metadata Properties - -#### Required Properties - -A descriptor `MUST` contain the following properties: - -#### `name` - -A resource `MUST` contain a `name` property. The name is a simple name or identifier to be used for this resource. - -- It `MUST` be unique amongst all resources in this data package. -- It `SHOULD` be human-readable and consist only of lowercase alphanumeric characters plus ".", "-" and "\_". -- It would be usual for the name to correspond to the file name (minus the extension) of the data file the resource describes. - -#### Recommended Properties - -#### `profile` +:::note[Backward Compatibility] +Prior to release 1.0.0-beta.18 (Nov 17 2016) there was a `url` property distinct from `path`. In order to support backwards compatibility, implementors `MAY` want to automatically convert a `url` property to a `path` property and issue a warning. +::: -A string identifying the [profile][profile] of this descriptor as per the [profiles][profile] specification. +### `profile` -[profile]: /profiles/ +A string identifying the profile of this descriptor as per the [profiles](https://specs.frictionlessdata.io/profiles/) specification. Examples: -```javascript +```json { "profile": "tabular-data-resource" } ``` -``` +```json { "profile": "http://example.com/my-profiles-json-schema.json" } ``` -#### Optional Properties +### `title` + +Title or label for the resource. + +### `description` + +Description of the resource. + +### `format` + +Would be expected to be the standard file extension for this type of resource.For example, `csv`, `xls`, `json` etc. + +### `mediatype` + +Te mediatype/mimetype of the resource e.g. "text/csv", or "application/vnd.ms-excel". Mediatypes are maintained by the Internet Assigned Numbers Authority (IANA) in a [media type registry](https://www.iana.org/assignments/media-types/media-types.xhtml). + +### `encoding` + +The character encoding of resource's data file (only applicable for textual files). The value `SHOULD` be one of the "Preferred MIME Names" for [a character encoding registered with IANA](http://www.iana.org/assignments/character-sets/character-sets.xhtml). If no value for this property is specified then the encoding `SHOULD` be detected on the implementation level. It is `RECOMMENDED` to use UTF-8 (without BOM) as a default encoding for textual files. + +### `bytes` + +Size of the file in bytes. + +### `hash` + +The MD5 hash for this resource. Other algorithms can be indicated by prefixing the hash's value with the algorithm name in lower-case. For example: -A descriptor `MAY` contain any number of additional properties. Common properties include: +```json +{ + "hash": "sha1:8843d7f92416211de9ebb963ff4ce28125932878" +} +``` -- `title`: a title or label for the resource. -- `description`: a description of the resource. -- `format`: 'csv', 'xls', 'json' etc. Would be expected to be the standard file - extension for this type of resource. -- `mediatype`: the mediatype/mimetype of the resource e.g. "text/csv", or "application/vnd.ms-excel". Mediatypes are maintained by the Internet Assigned Numbers Authority (IANA) in a [media type registry](https://www.iana.org/assignments/media-types/media-types.xhtml). -- `encoding`: the character encoding of resource's data file (only applicable for textual files). The value `SHOULD` be one of the "Preferred MIME Names" for [a character encoding registered with IANA][iana]. If no value for this property is specified then the encoding `SHOULD` be detected on the implementation level. It is `RECOMMENDED` to use UTF-8 (without BOM) as a default encoding for textual files. -- `bytes`: size of the file in bytes. -- `hash`: the MD5 hash for this resource. Other algorithms can be indicated by prefixing - the hash's value with the algorithm name in lower-case. For example: +### `sources` - "hash": "sha1:8843d7f92416211de9ebb963ff4ce28125932878" +List of data sources as for [Data Package](../data-package/#sources). -- `sources`: as for [Data Package metadata][dp]. -- `licenses`: as for [Data Package metadata][dp]. If not specified the resource - inherits from the data package. +### `licenses` -### Resource Schemas +List of licenses as for [Data Package](../data-package/#licenses). If not specified the resource inherits from the data package. + +### `schema` A Data Resource `MAY` have a `schema` property to describe the schema of the resource data. @@ -253,9 +229,32 @@ The value for the `schema` property on a `resource` MUST be an `object` represen If a `string` it must be a [url-or-path as defined above](#url-or-path), that is a fully qualified http URL or a relative POSIX path. The file at the location specified by this url-or-path string `MUST` be a JSON document containing the schema. -NOTE: the Data Package specification places no restrictions on the form of the schema Object. This flexibility enables specific communities to define schemas appropriate for the data they manage. As an example, the [Tabular Data Package][tdp] specification requires the schema to conform to [Table Schema][ts]. +NOTE: the Data Package specification places no restrictions on the form of the schema Object. This flexibility enables specific communities to define schemas appropriate for the data they manage. As an example, the [Tabular Data Package](https://specs.frictionlessdata.io/tabular-data-package/) specification requires the schema to conform to [Table Schema](../table-schema/). + +## URL or Path -[tdp]: /tabular-data-package/ -[ts]: /table-schema/ -[iana]: http://www.iana.org/assignments/character-sets/character-sets.xhtml -[dp]: /data-package/ +A `url-or-path` is a `string` with the following additional constraints: + +- `MUST` either be a URL or a POSIX path +- [URLs](https://en.wikipedia.org/wiki/Uniform_Resource_Locator) `MUST` be fully qualified. `MUST` be using either http or https scheme. (Absence of a scheme indicates `MUST` be a POSIX path) +- [POSIX paths](https://en.wikipedia.org/wiki/Path_%28computing%29#POSIX_pathname_definition) (unix-style with `/` as separator) are supported for referencing local files, with the security restraint that they `MUST` be relative siblings or children of the descriptor. Absolute paths `/`, relative parent paths `../`, hidden folders starting from a dot `.hidden` `MUST` NOT be used. + +Example of a fully qualified url: + +```json +{ + "path": "http://ex.datapackages.org/big-csv/my-big.csv" +} +``` + +Example of a relative path that this will work both as a relative path on disk and online: + +```json +{ + "path": "my-data-directory/my-csv.csv" +} +``` + +:::caution[Security] +`/` (absolute path) and `../` (relative parent path) are forbidden to avoid security vulnerabilities when implementing data package software. These limitations on resource `path` ensure that resource paths only point to files within the data package directory and its subdirectories. This prevents data package software being exploited by a malicious user to gain unintended access to sensitive information. For example, suppose a data package hosting service stores packages on disk and allows access via an API. A malicious user uploads a data package with a resource path like `/etc/passwd`. The user then requests the data for that resource and the server naively opens `/etc/passwd` and returns that data to the caller. +::: diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 39864f21..5a146adb 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -25,9 +25,9 @@ The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `S Table Schema is a simple language- and implementation-agnostic way to declare a schema for tabular data. Table Schema is well suited for use cases around handling and validating tabular data in text formats such as CSV, but its utility extends well beyond this core usage, towards a range of applications where data benefits from a portable schema format. -### Concepts +## Concepts -#### Tabular data +### Tabular Data Tabular data consists of a set of rows. Each row has a set of fields (columns). We usually expect that each row has the same set of fields and thus we can talk about _the_ fields for the table as a whole. @@ -35,25 +35,29 @@ In case of tables in spreadsheets or CSV files we often interpret the first row To illustrate, here's a classic spreadsheet table: - field field - | | - | | - V V +```text +field field + | | + | | + V V - A | B | C | D <--- Row (Header) - ------------------------------------ - valA | valB | valC | valD <--- Row - ... + A | B | C | D <--- Row (Header) + ------------------------------------ + valA | valB | valC | valD <--- Row + ... +``` In JSON, a table would be: - [ - { "A": value, "B": value, ... }, - { "A": value, "B": value, ... }, - ... - ] +```json +[ + { "A": value, "B": value, ... }, + { "A": value, "B": value, ... }, + ... +] +``` -#### Physical and logical representation +### Data Representation In order to talk about the representation and processing of tabular data from text-based sources, it is useful to introduce the concepts of the _physical_ and the _logical_ representation of data. @@ -73,12 +77,9 @@ The descriptor `MAY` have the additional properties set out below and `MAY` cont The following is an illustration of this structure: -```javascript +```json { - // fields is an ordered list of field descriptors - // one for each field (column) in the table "fields": [ - // a field-descriptor { "name": "name of field (e.g. column name)", "title": "A nicer human readable label or title for the field", @@ -88,26 +89,27 @@ The following is an illustration of this structure: "description": "A description for the field" ... }, - ... more field descriptors + ... ], - // (optional) specification of missing values "missingValues": [ ... ], - // (optional) specification of the primary key - "primaryKey": ... - // (optional) specification of the foreign keys - "foreignKeys": ... + "primaryKey": [ ... ] + "foreignKeys": [... ] } ``` ## Properties -### `fields` +### Schema + +A Table Schema descriptor `MAY` contain these standard properties: -A Table Schema descriptor `MUST` contain a property `fields`. `fields` `MUST` be an array where each entry in the array is a field descriptor as defined below. +#### `fields` [required] + +A Table Schema descriptor `MUST` contain a property `fields`. `fields` `MUST` be an array where each entry in the array is a [field descriptor](#field) as defined below. The way Table Schema `fields` are mapped onto the data source fields are defined by the `fieldsMatch` property. By default, the most strict approach is applied, i.e. fields in the data source `MUST` completely match the elements in the `fields` array, both in number and order. Using different options below, a data producer can relax requirements for the data source. -### `fieldsMatch` +#### `fieldsMatch` A Table Schema descriptor `MAY` contain a property `fieldsMatch` that `MUST` be a string with the following possible values and the `exact` value by default: @@ -117,16 +119,183 @@ A Table Schema descriptor `MAY` contain a property `fieldsMatch` that `MUST` be - **superset**: The data source `MUST` only have fields defined in the `fields` array, but `MAY` have fewer. Fields `MUST` be mapped by their names. - **partial**: The data source `MUST` have at least one field defined in the `fields` array. Fields `MUST` be mapped by their names. -## Field Properties +#### `missingValues` + +Many datasets arrive with missing data values, either because a value was not collected or it never existed. Missing values may be indicated simply by the value being empty in other cases a special value may have been used e.g. `-`, `NaN`, `0`, `-9999` etc. + +`missingValues` dictates which string values `MUST` be treated as `null` values. This conversion to `null` is done before any other attempted type-specific string conversion. The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. Providing the empty list `[]` means that no conversion to null will be done, on any value. + +`missingValues` `MUST` be an `array` where each entry is a `string`. + +**Why strings**: `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`. + +Examples: + +```text +"missingValues": [""] +"missingValues": ["-"] +"missingValues": ["NaN", "-"] +``` + +#### `primaryKey` -A field descriptor `MUST` be a JSON `object` that describes a single field. The -descriptor provides additional human-readable documentation for a field, as -well as additional information that can be used to validate the field or create -a user interface for data entry. +A primary key is a field or set of fields that uniquely identifies each row in the table. Per SQL standards, the fields cannot be `null`, so their use in the primary key is equivalent to adding `required: true` to their [`constraints`](#constraints). + +The `primaryKey` entry in the schema `object` is optional. If present it specifies the primary key for this table. + +The `primaryKey`, if present, `MUST` be an array of strings with each string corresponding to one of the field `name` values in the `fields` array (denoting that the primary key is made up of those fields). It is acceptable to have an array with a single value (indicating just one field in the primary key). Strictly, order of values in the array does not matter. However, it is `RECOMMENDED` that one follow the order the fields in the `fields` has as client applications `MAY` utilize the order of the primary key list (e.g. in concatenating values together). + +Here's an example: + +```json +"schema": { + "fields": [ + { + "name": "a" + }, + { + "name": "b" + }, + { + "name": "c" + }, + ... + ], + "primaryKey": ["a", "c"] +} +``` + +:::note[Backward Compatibility] +Data consumer MUST support the `primaryKey` property in a form of a single string e.g. `primaryKey: a` which was a part of the `v1.0` of the specification. +::: + +#### `uniqueKeys` + +A unique key is a field or a set of fields that are required to have unique logical values in each row in the table. It is directly modeled on the concept of unique constraint in SQL. + +The `uniqueKeys` property, if present, `MUST` be a non-empty array. Each entry in the array `MUST` be a `uniqueKey`. A `uniqueKey` `MUST` be an array of strings with each string corresponding to one of the field `name` values in the `fields` array, denoting that the unique key is made up of those fields. It is acceptable to have an array with a single value, indicating just one field in the unique key. + +An example of using the `uniqueKeys` property: + +```json +"fields": [ + { + "name": "a" + }, + { + "name": "b" + }, + { + "name": "c" + } +], +"uniqueKeys": [ + ["a"], + ["a", "b"], + ["a", "c"] +] +``` + +In the case of the definition above, the data in the table has to be considered valid only if: + +- each row has a unique logical value in the field `a` +- each row has a unique set of logical values in the fields `a` and `b` +- each row has a unique set of logical values in the fields `a` and `c` + +**Handling `null` values** + +All the field values that are on the logical level are considered to be `null` values `MUST` be excluded from the uniqueness check, as the `uniqueKeys` property is modeled on the concept of unique constraint in SQL. + +**Relation to `constraints.unique`** + +In contrast with `field.constraints.unique`, `uniqueKeys` allows to define uniqueness as a combination of fields. Both properties `SHOULD` be assessed separately. + +#### `foreignKeys` + +A foreign key is a reference where values in a field (or fields) on the table ('resource' in data package terminology) described by this Table Schema connect to values a field (or fields) on this or a separate table (resource). They are directly modelled on the concept of foreign keys in SQL. + +The `foreignKeys` property, if present, `MUST` be an Array. Each entry in the array `MUST` be a `foreignKey`. A `foreignKey` `MUST` be a `object` and `MUST` have the following properties: + +- `fields` - `fields` is an array of strings specifying the + field or fields on this resource that form the source part of the foreign + key. The structure of the array is as per `primaryKey` above. +- `reference` - `reference` `MUST` be a `object`. The `object` + - `MUST` have a property `fields` which is an array of strings of the same length as the outer `fields`, describing the field (or fields) references on the destination resource. The structure of the array is as per `primaryKey` above. + - `MAY` have a property `resource` which is the name of the resource within the current data package, i.e. the data package within which this Table Schema is located. For referencing another data resource the `resource` property `MUST` be provided. For self-referencing, i.e. references between fields in this Table Schema, the `resource` property `MUST` be omitted. + +Here's an example: + +```json +"resources": [ + { + "name": "state-codes", + "schema": { + "fields": [ + {"name": "code"} + ] + } + }, + { + "name": "population-by-state", + "schema": { + "fields": [ + {"name": "state-code"} + ], + "foreignKeys": [ + { + "fields": ["state-code"], + "reference": { + "resource": "state-codes", + "fields": ["code"] + } + } + ] + } + } +] +``` + +An example of a self-referencing foreign key: + +```json +"resources": [ + { + "name": "xxx", + "schema": { + "fields": [ + {"name": "parent"}, + {"name": "id"} + ], + "foreignKeys": [ + { + "fields": ["parent"], + "reference": { + "fields": ["id"] + } + } + ] + } + } +] +``` + +Foreign Keys create links between one Table Schema and another Table Schema, and implicitly between the data tables described by those Table Schemas. If the foreign key is referring to another Table Schema how is that other Table Schema discovered? The answer is that a Table Schema will usually be embedded inside some larger descriptor for a dataset, in particular as the schema for a resource in the resources array of a [Data Package](http://specs.frictionlessdata.io/data-package/). It is the use of Table Schema in this way that permits a meaningful use of a non-empty `resource` property on the foreign key. + +:::note[Backward Compatibility] +If the value of the `foreignKey.reference.resource` property is an empty string `""` a data consumer MUST interpret it as an omited property as an empty string for self-referencing was a part of the `v1.0` of the specification. +::: + +:::note[Backward Compatibility] +Data consumer MUST support the `foreignKey.fields` and `foreignKey.reference.fields` properties in a form of a single string e.g. `"fields": "a"` which was a part of the `v1.0` of the specification. +::: + +### Field + +A field descriptor `MUST` be a JSON `object` that describes a single field. The descriptor provides additional human-readable documentation for a field, as well as additional information that can be used to validate the field or create a user interface for data entry. Here is an illustration: -```javascript +```json { "name": "name of field (e.g. column name)", "title": "A nicer human readable label or title for the field", @@ -135,14 +304,14 @@ Here is an illustration: "example": "An example value for the field", "description": "A description for the field", "constraints": { - // a constraints-descriptor + ... } } ``` The field descriptor `object` `MAY` contain any number of other properties. Some specific properties are defined below. Of these, only the `name` property is `REQUIRED`. -### `name` +#### `name` [required] The field descriptor `MUST` contain a `name` property and it `MUST` be unique amongst other field names in this Table Schema. This property `SHOULD` correspond to the name of a column in the data file if it has a name. @@ -150,19 +319,35 @@ The field descriptor `MUST` contain a `name` property and it `MUST` be unique am If the `name` properties are not unique amongst a Table Schema a data consumer `MUST NOT` interpret it as an invalid descriptor as duplicate `name` properties were allowed in the `v1.0` of the specification. ::: -### `title` +#### `type` and `format` + +These properties are used to give the type of the field (string, number, etc.) - see below for more detail. If type is not provided a consumer `MUST` utilize the `any` type for the field instead of inferring it from the field's values. + +A field's `type` property is a string indicating the type of this field. + +A field's `format` property is a string, indicating a format for the field type. + +Both `type` and `format` are optional: in a field descriptor, the absence of a `type` property indicates that the field is of the type "any", and the absence of a `format` property indicates that the field's type `format` is "default". + +Types are based on the [type set of json-schema](http://tools.ietf.org/html/draft-zyp-json-schema-03#section-5.1) with some additions and minor modifications (cf other type lists include those in [Elasticsearch types](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html)). + +#### `title` A human readable label or title for the field -### `description` +#### `description` A description for this field e.g. "The recipient of the funds" -### `example` +#### `example` An example value for the field -### `missingValues` +#### `constraints` + +See [Field Constraints](#field-constraints) + +#### `missingValues` A list of missing values for this field as per [Missing Values](#missing-values) definition. If this property is defined, it takes precedence over the schema-level property and completely replaces it for the field without combining the values. @@ -186,33 +371,45 @@ A data consumer `MUST`: - interpret `""` and `NA` as missing values for `column1` - interpret only `-` as a missing value for `column2` -### Types and Formats +#### `rdfType` -`type` and `format` properties are used to give the type of the field (string, number, etc.) - see below for more detail. If type is not provided a consumer `MUST` utilize the `any` type for the field instead of inferring it from the field's values. +A richer, "semantic", description of the "type" of data in a given column `MAY` be provided using a `rdfType` property on a field descriptor. -A field's `type` property is a string indicating the type of this field. +The value of the `rdfType` property `MUST` be the URI of a RDF Class, that is an instance or subclass of [RDF Schema Class object](https://www.w3.org/TR/rdf-schema/#ch_class). -A field's `format` property is a string, indicating a format for the field type. +Here is an example using the Schema.org RDF Class `http://schema.org/Country`: -Both `type` and `format` are optional: in a field descriptor, the absence of a -`type` property indicates that the field is of the type "any", and the -absence of a `format` property indicates that the field's type `format` is -"default". +```text +| Country | Year Date | Value | +| ------- | --------- | ----- | +| US | 2010 | ... | +``` + +The corresponding Table Schema is: + +```json +{ + "fields": [ + { + "name": "Country", + "type": "string", + "rdfType": "http://schema.org/Country" + } + ... + } +} +``` -Types are based on the [type set of -json-schema](http://tools.ietf.org/html/draft-zyp-json-schema-03#section-5.1) -with some additions and minor modifications (cf other type lists include -those in [Elasticsearch -types](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html)). +## Field Types The type list with associated formats and other related properties is as follows. -#### string +### `string` The field contains strings, that is, sequences of characters. -`format`: +Supported formats: - **default**: any valid string. - **email**: A valid email address. @@ -220,16 +417,11 @@ The field contains strings, that is, sequences of characters. - **binary**: A base64 encoded string representing binary data. - **uuid**: A string that is a uuid. -#### number +### `number` The field contains numbers of any kind including decimals. -The lexical formatting follows that of decimal in [XMLSchema][xsd-decimal]: a -non-empty finite-length sequence of decimal digits separated by a period as a -decimal indicator. An optional leading sign is allowed. If the sign is omitted, -"+" is assumed. Leading and trailing zeroes are optional. If the fractional -part is zero, the period and following zero(es) can be omitted. For example: -'-1.23', '12678967.543233', '+100000.00', '210'. +The lexical formatting follows that of decimal in [XMLSchema](https://www.w3.org/TR/xmlschema-2/#decimal): a non-empty finite-length sequence of decimal digits separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted. For example: '-1.23', '12678967.543233', '+100000.00', '210'. The following special string values are permitted (case need not be respected): @@ -239,22 +431,15 @@ The following special string values are permitted (case need not be respected): A number `MAY` also have a trailing: -- exponent: this `MUST` consist of an E followed by an optional + or - sign - followed by one or more decimal digits (0-9) +- exponent: this `MUST` consist of an E followed by an optional + or - sign followed by one or more decimal digits (0-9) This lexical formatting `MAY` be modified using these additional properties: -- **decimalChar**: A string whose value is used to represent a decimal point - within the number. The default value is ".". -- **groupChar**: A string whose value is used to group digits within the - number. This property does not have a default value. A common value is "," e.g. "100,000". +- **decimalChar**: A string whose value is used to represent a decimal point within the number. The default value is ".". +- **groupChar**: A string whose value is used to group digits within the number. This property does not have a default value. A common value is "," e.g. "100,000". - **bareNumber**: a boolean field with a default of `true`. If `true` the physical contents of this field `MUST` follow the formatting constraints already set out. If `false` the contents of this field may contain leading and/or trailing non-numeric characters (which implementors `MUST` therefore strip). The purpose of `bareNumber` is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g. `95%` or leading characters such as currencies e.g. `€95` or `EUR 95`. Note that it is entirely up to implementors what, if anything, they do with stripped text. -`format`: no options (other than the default). - -[xsd-decimal]: https://www.w3.org/TR/xmlschema-2/#decimal - -#### integer +### `integer` The field contains integers - that is whole numbers. @@ -262,13 +447,10 @@ Integer values are indicated in the standard way for any valid integer. This lexical formatting `MAY` be modified using these additional properties: -- **groupChar**: A string whose value is used to group digits within the - integer. This property does not have a default value. A common value is "," e.g. "100,000". +- **groupChar**: A string whose value is used to group digits within the integer. This property does not have a default value. A common value is "," e.g. "100,000". - **bareNumber**: a boolean field with a default of `true`. If `true` the physical contents of this field `MUST` follow the formatting constraints already set out. If `false` the contents of this field may contain leading and/or trailing non-numeric characters (which implementors `MUST` therefore strip). The purpose of `bareNumber` is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g. `95%` or leading characters such as currencies e.g. `€95` or `EUR 95`. Note that it is entirely up to implementors what, if anything, they do with stripped text. -`format`: no options (other than the default). - -#### boolean +### `boolean` The field contains boolean (true/false) data. @@ -279,21 +461,15 @@ The boolean field can be customised with these additional properties: - **trueValues**: `[ "true", "True", "TRUE", "1" ]` - **falseValues**: `[ "false", "False", "FALSE", "0" ]` -`format`: no options (other than the default). - -#### object +### `object` The field contains a valid JSON object. -`format`: no options (other than the default). - -#### array +### `array` The field contains a valid JSON array. -`format`: no options (other than the default). - -#### list +### `list` The field contains data that is an ordered one-level depth collection of primitive values with a fixed item type. In the lexical representation, the field `MUST` contain a string with values separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. In comparison to the `array` type, the `list` type is directly modelled on the concept of SQL typed collections. @@ -304,91 +480,73 @@ The list field can be customised with these additional properties: - **delimiter**: specifies the character sequence which separates lexically represented list items. If not present, the default is `,` (comma). - **itemType**: specifies the list item type in terms of existent Table Schema types. If present, it `MUST` be one of `string`, `integer`, `boolean`, `number`, `datetme`, `date`, and `time`. If not present, the default is `string`. A data consumer `MUST` process list items as it were individual values of the corresponding data type. Note, that on lexical level only default formats are supported, for example, for a list with `itemType` set to `date`, items have to be in default form for dates i.e. `yyyy-mm-dd`. -#### datetime +### `datetime` The field contains a date with a time. -`format`: +Supported formats: - **default**: The lexical representation `MUST` be in a form defined by [XML Schema](https://www.w3.org/TR/xmlschema-2/#dateTime) containing required date and time parts, followed by optional milliseconds and timezone parts, for example, `2024-01-26T15:00:00` or `2024-01-26T15:00:00.300-05:00`. -- **\**: values in this field can be parsed according to ``. `` `MUST` follow the syntax of [standard Python / C strptime][strptime]. Values in the this field `SHOULD` be parsable by Python / C standard `strptime` using ``. Example for `"format": ""%d/%m/%Y %H:%M:%S"` which would correspond to a date with time like: `12/11/2018 09:15:32`. +- **\**: values in this field can be parsed according to ``. `` `MUST` follow the syntax of [standard Python / C strptime](https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior). Values in the this field `SHOULD` be parsable by Python / C standard `strptime` using ``. Example for `"format": ""%d/%m/%Y %H:%M:%S"` which would correspond to a date with time like: `12/11/2018 09:15:32`. - **any**: Any parsable representation of the value. The implementing library can attempt to parse the datetime via a range of strategies. An example is `dateutil.parser.parse` from the `python-dateutils` library. It is `NOT RECOMMENDED` to use `any` format as it might cause interoperability issues. -#### date +### `date` The field contains a date without a time. -`format`: +Supported formats: - **default**: The lexical representation `MUST` be `yyyy-mm-dd` e.g. `2024-01-26` - **\**: The same as for `datetime` - **any**: The same as for `datetime` -#### time +### `time` The field contains a time without a date. -`format`: +Supported formats: - **default**: The lexical representation `MUST` be `hh:mm:ss` e.g. `15:00:00` - **\**: The same as for `datetime` - **any**: The same as for `datetime` -#### year - -A calendar year as per [XMLSchema `gYear`][xsd-gyear]. - -Usual lexical representation is `YYYY`. There are no format options. +### `year` -[xsd-gyear]: https://www.w3.org/TR/xmlschema-2/#gYear +A calendar year as per [XMLSchema `gYear`](https://www.w3.org/TR/xmlschema-2/#gYear). Usual lexical representation is `YYYY`. There are no format options. -#### yearmonth +### `yearmonth` -A specific month in a specific year as per [XMLSchema -`gYearMonth`][xsd-gyearmonth]. +A specific month in a specific year as per [XMLSchema `gYearMonth`](https://www.w3.org/TR/xmlschema-2/#gYearMonth). Usual lexical representation is: `YYYY-MM`. There are no format options. -Usual lexical representation is: `YYYY-MM`. There are no format options. - -[xsd-gyearmonth]: https://www.w3.org/TR/xmlschema-2/#gYearMonth - -#### duration +### `duration` A duration of time. -We follow the definition of [XML Schema duration datatype][xsd-duration] directly -and that definition is implicitly inlined here. - -To summarize: the lexical representation for duration is the [ISO 8601][iso8601-duration] -extended format PnYnMnDTnHnMnS, where nY represents the number of years, nM the -number of months, nD the number of days, 'T' is the date/time separator, nH the -number of hours, nM the number of minutes and nS the number of seconds. The -number of seconds can include decimal digits to arbitrary precision. Date and -time elements including their designator `MAY` be omitted if their value is zero, -and lower order elements `MAY` also be omitted for reduced precision. +We follow the definition of [XML Schema duration datatype](http://www.w3.org/TR/xmlschema-2/#duration) directly and that definition is implicitly inlined here. -`format`: no options (other than the default). +To summarize: the lexical representation for duration is the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) extended format PnYnMnDTnHnMnS, where nY represents the number of years, nM the number of months, nD the number of days, 'T' is the date/time separator, nH the number of hours, nM the number of minutes and nS the number of seconds. The number of seconds can include decimal digits to arbitrary precision. Date and time elements including their designator `MAY` be omitted if their value is zero, and lower order elements `MAY` also be omitted for reduced precision. -#### geopoint +### `geopoint` The field contains data describing a geographic point. -`format`: +Supported formats: - **default**: A string of the pattern "lon, lat", where each value is a number, and `lon` is the longitude and `lat` is the latitude (note the space is optional after the `,`). E.g. `"90.50, 45.50"`. - **array**: A JSON array, or a string parsable as a JSON array, of exactly two items, where each item is a number, and the first item is `lon` and the second item is `lat` e.g. `[90.50, 45.50]` - **object**: A JSON object with exactly two keys, `lat` and `lon` and each value is a number e.g. `{"lon": 90.50, "lat": 45.50}` -#### geojson +### `geojson` The field contains a JSON object according to GeoJSON or TopoJSON spec. -`format`: +Supported formats: - **default**: A geojson object as per the [GeoJSON spec](http://geojson.org/). - **topojson**: A topojson object as per the [TopoJSON spec](https://github.com/topojson/topojson-specification/blob/master/README.md) -#### any +### `any` The field contains values of a unspecified or mixed type. A data consumer `MUST NOT` perform any processing on this field's values and `MUST` interpret them as it is in the data source. This data type is directly modelled on the concept of the `any` type of strongly typed object-oriented languages like [TypeScript](https://www.typescriptlang.org/docs/handbook/2/everyday-types.html#any). @@ -433,426 +591,105 @@ While this JSON data file will have logical values as below: Note, that for the CSV data source the `id` field is interpreted as a string because CSV supports only one data type i.e. string, and for the JSON data source the `id` field is interpreted as an integer because JSON supports a numeric data type and the value was declared as an integer. Also, for the Table Schema above a `type` property for each field can be omitted as it is a default field type. -### Rich Types - -A richer, "semantic", description of the "type" of data in a given column `MAY` -be provided using a `rdfType` property on a field descriptor. - -The value of the `rdfType` property `MUST` be the URI of a RDF Class, that is an instance or subclass of [RDF Schema Class object][rdfs-class] - -Here is an example using the Schema.org RDF Class `http://schema.org/Country`: - -``` -| Country | Year Date | Value | -| ------- | --------- | ----- | -| US | 2010 | ... | -``` - -The corresponding Table Schema is: - -```javascript - { - fields: [ - { - "name": "Country", - "type": "string", - "rdfType": "http://schema.org/Country" - } - ... - } - } -``` - -[rdfs-class]: https://www.w3.org/TR/rdf-schema/#ch_class - -### Constraints +## Field Constraints -The `constraints` property on Table Schema Fields can be used by consumers to list constraints for validating field values. For example, validating the data in a [Tabular Data Resource][tdr] against its Table Schema; or as a means to validate data being collected or updated via a data entry interface. - -[tdr]: http://specs.frictionlessdata.io/tabular-data-resource/ +The `constraints` property on Table Schema Fields can be used by consumers to list constraints for validating field values. For example, validating the data in a [Tabular Data Resource](https://specs.frictionlessdata.io/tabular-data-package/) against its Table Schema; or as a means to validate data being collected or updated via a data entry interface. All constraints `MUST` be tested against the logical representation of data, and the physical representation of constraint values `MAY` be primitive types as possible in JSON, or represented as strings that are castable with the `type` and `format` rules of the field. -A constraints descriptor `MUST` be a JSON `object` and `MAY` contain one or more of the following -properties. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- Property - - Type - - Applies to - - Description -
- required - - boolean - - All - - Indicates whether this field cannot be null. If required is false (the default), then null is allowed. See the section on missingValues for how, in the physical representation of the data, strings can represent null values. -
- unique - - boolean - - All - - If true, then all values for that field `MUST` be unique within the data file in which it is found. -
- minLength - - integer - - collections (string, array, object) - - An integer that specifies the minimum length of a value. -
- maxLength - - integer - - collections (string, array, object) - - An integer that specifies the maximum length of a value. -
- minimum - - integer, number, date, time, datetime, duration, year, yearmonth - - integer, number, date, time, datetime, duration, year, yearmonth - - Specifies a minimum value for a field. This is different to minLength which checks the number of items in the value. A minimum value constraint checks whether a field value is greater than or equal to the specified value. The range checking depends on the type of the field. E.g. an integer field may have a minimum value of 100; a date field might have a minimum date. If a minimum value constraint is specified then the field descriptor MUST contain a type key. -
- maximum - - integer, number, date, time, datetime, duration, year, yearmonth - - integer, number, date, time, datetime, duration, year, yearmonth - - As for minimum, but specifies a maximum value for a field. -
- exclusiveMinimum - - integer, number, date, time, datetime, duration, year, yearmonth - - integer, number, date, time, datetime, duration, year, yearmonth - - As for minimum, but for expressing exclusive range. -
- exclusiveMaximum - - integer, number, date, time, datetime, duration, year, yearmonth - - integer, number, date, time, datetime, duration, year, yearmonth - - As for maximum, but for expressing exclusive range. -
- jsonSchema - - object - - array, object - A valid JSON Schema object to validate field values. If a field value conforms to the provided JSON Schema then this field value is valid. -
- pattern - - string - - string - - A regular expression that can be used to test field values. If the regular expression matches then the value is valid. The values of this field MUST conform to the standard XML Schema regular expression syntax. -
- enum - - array - - All - - The value of the field `MUST` exactly match a value in the enum array. -
- -**Implementors**: - -- Implementations `SHOULD` report an error if an attempt is made to evaluate a value against an unsupported constraint. -- A constraints descriptor `MAY` contain multiple constraints, in which case implementations `MUST` apply all the constraints when determining if a field value is valid. -- Constraints `MUST` be applied on the logical representation of field values and constraint values. - -## Other Properties +A constraints descriptor `MUST` be a JSON `object` and `MAY` contain one or more of the following properties: -In additional to field descriptors, there are the following "table level" properties. - -### Missing Values - -Many datasets arrive with missing data values, either because a value was not collected or it never existed. Missing values may be indicated simply by the value being empty in other cases a special value may have been used e.g. `-`, `NaN`, `0`, `-9999` etc. +### `required` -`missingValues` dictates which string values `MUST` be treated as `null` values. This conversion to `null` is done before any other attempted type-specific string conversion. -The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. -Providing the empty list `[]` means that no conversion to null will be done, on any value. +- **Type**: boolean +- **Fields**: all -`missingValues` `MUST` be an `array` where each entry is a `string`. +Indicates whether this field cannot be `null`. If required is `false` (the default), then `null` is allowed. See the section on `missingValues` for how, in the physical representation of the data, strings can represent `null` values. -**Why strings**: `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`. +### `unique` -Examples: +- **Type**: boolean +- **Fields**: all -```javascript -"missingValues": [""] -"missingValues": ["-"] -"missingValues": ["NaN", "-"] -``` +If `true`, then all values for that field `MUST` be unique within the data file in which it is found. -### Primary Key +### `minLength` -A primary key is a field or set of fields that uniquely identifies each row in -the table. Per SQL standards, the fields cannot be `null`, so their use in the -primary key is equivalent to adding `required: true` to their -[`constraints`](#constraints). +- **Type**: integer +- **Fields**: collections (string, array, object) -The `primaryKey` entry in the schema `object` is optional. If present it specifies -the primary key for this table. +An integer that specifies the minimum length of a value. -The `primaryKey`, if present, `MUST` be an array of strings with each string corresponding to one of the field `name` values in the `fields` array (denoting that the primary key is made up of those fields). It is acceptable to have an array with a single value (indicating just one field in the primary key). Strictly, order of values in the array does not matter. However, it is `RECOMMENDED` that one follow the order the fields in the `fields` has as client applications `MAY` utilize the order of the primary key list (e.g. in concatenating values together). +### `maxLength` -Here's an example: +- **Type**: integer +- **Fields**: collections (string, array, object) -```json -"schema": { - "fields": [ - { - "name": "a" - }, - { - "name": "b" - }, - { - "name": "c" - }, - ... - ], - "primaryKey": ["a", "c"] -} -``` +An integer that specifies the maximum length of a value. -:::note[Backward Compatibility] -Data consumer MUST support the `primaryKey` property in a form of a single string e.g. `primaryKey: a` which was a part of the `v1.0` of the specification. -::: +### `minimum` -### Unique Keys +- **Type**: integer, number, date, time, datetime, duration, year, yearmonth +- **Fields**: integer, number, date, time, datetime, duration, year, yearmonth -A unique key is a field or a set of fields that are required to have unique logical values in each row in the table. It is directly modeled on the concept of unique constraint in SQL. +Specifies a minimum value for a field. This is different to `minLength` which checks the number of items in the value. A `minimum` value constraint checks whether a field value is greater than or equal to the specified value. The range checking depends on the `type` of the field. E.g. an integer field may have a minimum value of 100; a date field might have a minimum date. If a `minimum` value constraint is specified then the field descriptor `MUST` contain a `type` key. -The `uniqueKeys` property, if present, `MUST` be a non-empty array. Each entry in the array `MUST` be a `uniqueKey`. A `uniqueKey` `MUST` be an array of strings with each string corresponding to one of the field `name` values in the `fields` array, denoting that the unique key is made up of those fields. It is acceptable to have an array with a single value, indicating just one field in the unique key. +### `maximum` -An example of using the `uniqueKeys` property: +- **Type**: integer, number, date, time, datetime, duration, year, yearmonth +- **Fields**: integer, number, date, time, datetime, duration, year, yearmonth -```json -"fields": [ - { - "name": "a" - }, - { - "name": "b" - }, - { - "name": "c" - } -], -"uniqueKeys": [ - ["a"], - ["a", "b"], - ["a", "c"] -] -``` +As for `minimum`, but specifies a maximum value for a field. -In the case of the definition above, the data in the table has to be considered valid only if: +### `exclusiveMinimum` -- each row has a unique logical value in the field `a` -- each row has a unique set of logical values in the fields `a` and `b` -- each row has a unique set of logical values in the fields `a` and `c` +- **Type**: integer, number, date, time, datetime, duration, year, yearmonth +- **Fields**: integer, number, date, time, datetime, duration, year, yearmonth -#### Handling `null` values +As for `minimum`, but for expressing exclusive range. -All the field values that are on the logical level are considered to be `null` values `MUST` be excluded from the uniqueness check, as the `uniqueKeys` property is modeled on the concept of unique constraint in SQL. +### `exclusiveMaximum` -#### Relation to `constraints.unique` +- **Type**: integer, number, date, time, datetime, duration, year, yearmonth +- **Fields**: integer, number, date, time, datetime, duration, year, yearmonth -In contrast with `field.constraints.unique`, `uniqueKeys` allows to define uniqueness as a combination of fields. Both properties `SHOULD` be assessed separately. +As for `maximum`, but for expressing exclusive range. -### Foreign Keys +### `jsonSchema` -A foreign key is a reference where values in a field (or fields) on the -table ('resource' in data package terminology) described by this Table Schema -connect to values a field (or fields) on this or a separate table (resource). -They are directly modelled on the concept of foreign keys in SQL. +- **Type**: object +- **Fields**: array, object -The `foreignKeys` property, if present, `MUST` be an Array. Each entry in the -array `MUST` be a `foreignKey`. A `foreignKey` `MUST` be a `object` and `MUST` have the following properties: +A valid JSON Schema object to validate field values. If a field value conforms to the provided JSON Schema then this field value is valid. -- `fields` - `fields` is an array of strings specifying the - field or fields on this resource that form the source part of the foreign - key. The structure of the array is as per `primaryKey` above. -- `reference` - `reference` `MUST` be a `object`. The `object` - - `MUST` have a property `fields` which is an array of strings of the same length as the outer `fields`, describing the field (or fields) references on the destination resource. The structure of the array is as per `primaryKey` above. - - `MAY` have a property `resource` which is the name of the resource within the current data package, i.e. the data package within which this Table Schema is located. For referencing another data resource the `resource` property `MUST` be provided. For self-referencing, i.e. references between fields in this Table Schema, the `resource` property `MUST` be omitted. +### `pattern` -Here's an example: +- **Type**: string +- **Fields**: string -```json -"resources": [ - { - "name": "state-codes", - "schema": { - "fields": [ - {"name": "code"} - ] - } - }, - { - "name": "population-by-state", - "schema": { - "fields": [ - {"name": "state-code"} - ], - "foreignKeys": [ - { - "fields": ["state-code"], - "reference": { - "resource": "state-codes", - "fields": ["code"] - } - } - ] - } - } -] -``` +A regular expression that can be used to test field values. If the regular expression matches then the value is valid. The values of this field `MUST` conform to the standard [XML Schema regular expression syntax](http://www.w3.org/TR/xmlschema-2/#regexs). -An example of a self-referencing foreign key: +### `enum` -```json -"resources": [ - { - "name": "xxx", - "schema": { - "fields": [ - {"name": "parent"}, - {"name": "id"} - ], - "foreignKeys": [ - { - "fields": ["parent"], - "reference": { - "fields": ["id"] - } - } - ] - } - } -] -``` +- **Type**: array +- **Fields**: all -Foreign Keys create links between one Table Schema and another Table Schema, and implicitly between the data tables described by those Table Schemas. If the foreign key is referring to another Table Schema how is that other Table Schema discovered? The answer is that a Table Schema will usually be embedded inside some larger descriptor for a dataset, in particular as the schema for a resource in the resources array of a [Data Package](http://specs.frictionlessdata.io/data-package/). It is the use of Table Schema in this way that permits a meaningful use of a non-empty `resource` property on the foreign key. +The value of the field `MUST` exactly match one of the values in the `enum` array. -:::note[Backward Compatibility] -If the value of the `foreignKey.reference.resource` property is an empty string `""` a data consumer MUST interpret it as an omited property as an empty string for self-referencing was a part of the `v1.0` of the specification. -::: +:::note[Implementation Note] -:::note[Backward Compatibility] -Data consumer MUST support the `foreignKey.fields` and `foreignKey.reference.fields` properties in a form of a single string e.g. `"fields": "a"` which was a part of the `v1.0` of the specification. -::: +- Implementations `SHOULD` report an error if an attempt is made to evaluate a value against an unsupported constraint. +- A constraints descriptor `MAY` contain multiple constraints, in which case implementations `MUST` apply all the constraints when determining if a field value is valid. +- Constraints `MUST` be applied on the logical representation of field values and constraint values. + ::: -## Appendix: Related Work +## Related Work Table Schema draws content and/or inspiration from, among others, the following specifications and implementations: -- [XML Schema][] -- [Google BigQuery][] -- [JSON Schema][] -- [DSPL][] -- [HTML5 Forms][] -- [Elasticsearch][] - -[xml schema]: http://www.w3.org/TR/xmlschema-2/#built-in-primitive-datatypes -[google bigquery]: https://developers.google.com/bigquery/docs/import#loading_json_files -[json schema]: http://json-schema.org -[dspl]: https://developers.google.com/public-data/docs/schema/dspl18 -[html5 forms]: http://www.whatwg.org/specs/web-apps/current-work/#attr-input-typ -[elasticsearch]: http://www.elasticsearch.org/guide/reference/mapping/ -[strptime]: https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior -[iso8601-duration]: https://en.wikipedia.org/wiki/ISO_8601#Durations -[xsd-duration]: http://www.w3.org/TR/xmlschema-2/#duration +- [XML Schema](http://www.w3.org/TR/xmlschema-2/#built-in-primitive-datatypes) +- [Google BigQuery](https://developers.google.com/bigquery/docs/import#loading_json_files) +- [JSON Schema](http://json-schema.org) +- [DSPL](https://developers.google.com/public-data/docs/schema/dspl18) +- [HTML5 Forms](http://www.whatwg.org/specs/web-apps/current-work/#attr-input-typ) +- [Elasticsearch](http://www.elasticsearch.org/guide/reference/mapping/)