Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Add a categorical field type #48

Closed
wants to merge 12 commits into from
70 changes: 68 additions & 2 deletions content/docs/specifications/table-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,9 +125,18 @@ Many datasets arrive with missing data values, either because a value was not co

`missingValues` dictates which string values `MUST` be treated as `null` values. This conversion to `null` is done before any other attempted type-specific string conversion. The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. Providing the empty list `[]` means that no conversion to null will be done, on any value.

`missingValues` `MUST` be an `array` where each entry is a `string`.
`missingValues` `MUST` be an `array` where each entry is a `string`, or an `array` where each entry is an `object`.

**Why strings**: `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`.
If an `array` of `object`s is provided, each object `MUST` have a `value` and optional `label` property. The `value` property `MUST` be a `string` that matches the physical value of the field. The optional `label` property `MUST` be a `string` that provides a human-readable label for the missing value. For example:

```json
"missingValues": [
{ "value": "", "label": "OMITTED" },
{ "value": "-99", "label": "REFUSED" }
]
```

**Why strings**: `missingValues` are specified as strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`.
khusmann marked this conversation as resolved.
Show resolved Hide resolved

Examples:

Expand Down Expand Up @@ -461,6 +470,59 @@ The boolean field can be customised with these additional properties:
- **trueValues**: `[ "true", "True", "TRUE", "1" ]`
- **falseValues**: `[ "false", "False", "FALSE", "0" ]`

### `categorical`

The field contains categorical data, defined as data with a finite set of possible values that represent levels of a categorical variable.

The `categorical` type facilitates interoperability with software packages that support categorical data types, including: Value labels or formats ([Stata](https://www.stata.com/manuals13/dlabel.pdf), [SAS](https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p1upn25lbfo6mkn1wncu4dyh9q91.htm), and [SPSS](https://www.ibm.com/docs/en/spss-statistics/beta?topic=data-adding-value-labels)), Categoricals ([Pandas](https://pandas.pydata.org/docs/user_guide/categorical.html), and [Polars](https://docs.pola.rs/user-guide/concepts/data-types/categoricals/)), Enums ([DuckDB](https://duckdb.org/docs/sql/data_types/enum.html)), Factors ([R](https://www.stat.berkeley.edu/~s133/factors.html)), and CategoricalVectors ([Julia](https://dataframes.juliadata.org/stable/man/categorical/)).

Although the `categorical` field type restricts a field to a finite set of possible values, like an [`enum`](#enum) constraint, the `categorical` field type enables data producers to explicitly indicate to implementations that a field `SHOULD` be loaded as a categorical data type (when supported by the implementation). By contrast, `enum` constraints simply add validation rules to existing field types. When an `enum` constraint is defined on a `categorical` field, the values in the `enum` constraint `MUST` be a subset of the physical values representing the levels of the `categorical`.

The `categorical` field type `MUST` have the property `categories` that defines the set of possible values of the field. The `categories` property `MUST` be an array of strings, or an array of objects.
peterdesmet marked this conversation as resolved.
Show resolved Hide resolved
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just an array of strings, why not also an array of integers, since you can specify integers for value in the expanded syntax? More generally, why limit ourselves here at all?


When the `categories` property is an array of strings, the strings `MUST` be unique and `MUST` match the physical values of the field. For example:

```json
{
"name": "fruit",
"type": "categorical",
"categories": ["apple", "orange", "banana"]
khusmann marked this conversation as resolved.
Show resolved Hide resolved
}
```

When the `categories` property is an array of objects, each object `MUST` have a `value` and an optional `label` property. The `value` property `MUST` be a string or number that matches the physical value of the field when representing that level. The optional `label` property, when present, `MUST` be a string that provides a human-readable label for the level. For example, if the codes `0`, `1`, and `2` were used as codes to represent the levels `apple`, `orange`, and `banana` in the previous example, the `categories` property would be defined as follows:
khusmann marked this conversation as resolved.
Show resolved Hide resolved

```json
{
"name": "fruit",
"type": "categorical",
"categories": [
{ "value": 0, "label": "apple" },
{ "value": 1, "label": "orange" },
{ "value": 2, "label": "banana" }
roll marked this conversation as resolved.
Show resolved Hide resolved
]
}
```

peterdesmet marked this conversation as resolved.
Show resolved Hide resolved
The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the levels of the `categorical` have a natural order. When present, the `ordered` property `MUST` be a boolean. When `ordered` is `true`, implementations `SHOULD` interpret the order of the levels as defined in the `categories` property as the natural ordering of the levels, in ascending order. In cases where the physical values are numeric and `ordered` is `true`, the order of the levels `SHOULD` match the numerical order of the values (e.g., 1, 2, 3, ...) to avoid ambiguity. For example:

```json
{
"name": "agreementLevel",
"type": "categorical",
"categories": [
{ "value": 1, "label": "Strongly Disagree" },
{ "value": 2 },
{ "value": 3 },
{ "value": 4 },
{ "value": 5, "label": "Strongly Agree" }
],
"ordered": true
khusmann marked this conversation as resolved.
Show resolved Hide resolved
}
```

When the property `ordered` is `false` or not present, implementations `SHOULD` assume that the levels of the `categorical` do not have a natural order.

### `object`

The field contains a valid JSON object.
Expand Down Expand Up @@ -676,6 +738,10 @@ A regular expression that can be used to test field values. If the regular expre

The value of the field `MUST` exactly match one of the values in the `enum` array.

:::note[Backward Compatibility]
Many `v1.0` implementations imported fields with `enum` constraints as categorical data types. Starting in `v2.0` this behavior is discouraged in favor of explicit use of the [`categorical`](#categorical) field type. In `v2.0`, an `enum` constraint `SHOULD` be interpreted by implementations as a validation rule on an existing field type, and `SHOULD NOT` change the imported data type of the field.
:::

:::note[Implementation Note]

- Implementations `SHOULD` report an error if an attempt is made to evaluate a value against an unsupported constraint.
Expand Down