diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index d8e8d6d2..566a4794 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -129,9 +129,18 @@ Many datasets arrive with missing data values, either because a value was not co `missingValues` dictates which string values `MUST` be treated as `null` values. This conversion to `null` is done before any other attempted type-specific string conversion. The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. Providing the empty list `[]` means that no conversion to null will be done, on any value. -`missingValues` `MUST` be an `array` where each entry is a `string`. +`missingValues` `MUST` be an `array` where each entry is a `string`, or an `array` where each entry is an `object`. -**Why strings**: `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`. +If an `array` of `object`s is provided, each object `MUST` have a `value` and optional `label` property. The `value` property `MUST` be a `string` that matches the physical value of the field. The optional `label` property `MUST` be a `string` that provides a human-readable label for the missing value. For example: + +```json +"missingValues": [ + { "value": "", "label": "OMITTED" }, + { "value": "-99", "label": "REFUSED" } +] +``` + +**Why strings**: `missingValues` are specified as strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`. Examples: @@ -469,6 +478,59 @@ The boolean field can be customised with these additional properties: - **trueValues**: `[ "true", "True", "TRUE", "1" ]` - **falseValues**: `[ "false", "False", "FALSE", "0" ]` +### `categorical` + +The field contains categorical data, defined as data with a finite set of possible values that represent levels of a categorical variable. + +The `categorical` field type `MUST` have the property `categories` that defines the set of possible levels of the field. The `categories` property `MUST` be an array of strings, or an array of objects. + +When the `categories` property is an array of strings, the strings `MUST` be unique and `MUST` match the physical values of the field. For example: + +```json +{ + "name": "fruit", + "type": "categorical", + "categories": ["apple", "orange", "banana"] +} +``` + +When the `categories` property is an array of objects, each object `MUST` have a `value` and an optional `label` property. The `value` property `MUST` be a string that matches the physical value of the field when representing that level. The optional `label` property, when present, `MUST` be a string that provides a human-readable label for the level. For example, if the physical values `"0"`, `"1"`, and `"2"` were used as codes to represent the levels `apple`, `orange`, and `banana` in the previous example, the `categories` property would be defined as follows: + +```json +{ + "name": "fruit", + "type": "categorical", + "categories": [ + { "value": "0", "label": "apple" }, + { "value": "1", "label": "orange" }, + { "value": "2", "label": "banana" } + ] +} +``` + +The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the levels of the `categorical` have a natural order. When present, the `ordered` property `MUST` be a boolean. When `ordered` is `true`, implementations `SHOULD` interpret the order of the levels as defined in the `categories` property as the natural ordering of the levels, in ascending order. For example: + +```json +{ + "name": "agreementLevel", + "type": "categorical", + "categories": [ + { "value": "1", "label": "Strongly Disagree" }, + { "value": "2" }, + { "value": "3" }, + { "value": "4" }, + { "value": "5", "label": "Strongly Agree" } + ], + "ordered": true +} +``` + +When the property `ordered` is `false` or not present, implementations `SHOULD` assume that the levels of the `categorical` do not have a natural order. + +Although the `categorical` field type restricts a field to a finite set of possible values, similar to an [`enum`](#enum) constraint, the `categorical` field type enables data producers to explicitly indicate to implementations that a field `SHOULD` be loaded as a categorical data type (when supported by the implementation). By contrast, `enum` constraints simply add validation rules to existing field types. + +When an `enum` constraint is defined on a `categorical` field, the values in the `enum` constraint `MUST` be a subset of the logical values representing the levels of the `categorical`. Logical values of categorical levels are indicated by their labels, if present, or by their physical value, if a label is not present. + ### `object` The field contains a valid JSON object. @@ -684,6 +746,10 @@ A regular expression that can be used to test field values. If the regular expre The value of the field `MUST` exactly match one of the values in the `enum` array. +:::note[Backward Compatibility] +Many `v1.0` implementations imported fields with `enum` constraints as categorical data types. Starting in `v2.0` this behavior is discouraged in favor of explicit use of the [`categorical`](#categorical) field type. In `v2.0`, an `enum` constraint `SHOULD` be interpreted by implementations as a validation rule on an existing field type, and `SHOULD NOT` change the imported data type of the field. +::: + :::note[Implementation Note] - Implementations `SHOULD` report an error if an attempt is made to evaluate a value against an unsupported constraint.