From b3dacea3a461635e5c99c8c8f42a9e7b7e62c589 Mon Sep 17 00:00:00 2001 From: Kyle Husmann Date: Tue, 2 Apr 2024 16:02:22 -0700 Subject: [PATCH 01/10] First draft of spec for `categorical` field type Resolves https://github.com/frictionlessdata/specs/issues/875 --- content/docs/specifications/table-schema.md | 59 +++++++++++++++++++++ 1 file changed, 59 insertions(+) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 8bcbc5b2..980a0990 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -461,6 +461,65 @@ The boolean field can be customised with these additional properties: - **trueValues**: `[ "true", "True", "TRUE", "1" ]` - **falseValues**: `[ "false", "False", "FALSE", "0" ]` +### `categorical` + +The field contains categorical data, defined as data with a finite set of possible values that represent levels of a categorical variable. + +The `categorical` type facilitates interoperability with software packages that support categorical data types, including: + +- Value labels or formats ([Stata](https://www.stata.com/manuals13/dlabel.pdf), [SAS](https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p1upn25lbfo6mkn1wncu4dyh9q91.htm), and [SPSS](https://www.ibm.com/docs/en/spss-statistics/beta?topic=data-adding-value-labels)) +- Categoricals ([Pandas](https://pandas.pydata.org/docs/user_guide/categorical.html), and [Polars](https://docs.pola.rs/user-guide/concepts/data-types/categoricals/)) +- [Enums (DuckDB)](https://duckdb.org/docs/sql/data_types/enum.html) +- [Factors (R)](https://www.stat.berkeley.edu/~s133/factors.html) +- [CategoricalVectors (Julia)](https://dataframes.juliadata.org/stable/man/categorical/) + +Although [`enum`](#enum) constraints can provide similar functionality for validation purposes, the `categorical` type is intended for use when data producers want to explicitly indicate to implementations that the field `SHOULD` be loaded as a categorical data type when supported by the implementation. + +The `categorical` field type `MUST` have the property `categories` that defines the set of possible values of the field. The `categories` property `MUST` be an array of strings, or an array of objects. + +When the `categories` property is an array of strings, the strings `MUST` be unique and `MUST` match the physical values of the field. For example: + +```json +{ + "name": "fruit", + "type": "categorical", + "categories": ["apple", "orange", "banana"] +} +``` + +When the `categories` property is an array of objects, each object `MUST` have a `value` and an optional `label` property. The `value` property `MUST` be a string or number that matches the physical value of the field when representing that level. The optional `label` property, when present, `MUST` be a string that provides a human-readable label for the level. For example, if the codes `0`, `1`, and `2` were used as codes to represent the levels `apple`, `orange`, and `banana` in the previous example, the `categories` property would be defined as follows: + +```json +{ + "name": "fruit", + "type": "categorical", + "categories": [ + { "value": 0, "label": "apple" }, + { "value": 1, "label": "orange" }, + { "value": 2, "label": "banana" } + ] +} +``` + +The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the categories have a natural order. When present, the `ordered` property `MUST` be a boolean. For example: + +```json +{ + "name": "agreementLevel", + "type": "categorical", + "categories": [ + { "value": 1, "label": "Strongly Disagree" }, + { "value": 2 }, + { "value": 3 }, + { "value": 4 }, + { "value": 5, "label": "Strongly Agree" } + ], + "ordered": true +} +``` + +When the property `ordered` is not specified, implementations `MUST` assume a default value of `false`. + ### `object` The field contains a valid JSON object. From d66cdb138d079eee87454d388fdbb2d570a92c2f Mon Sep 17 00:00:00 2001 From: Kyle Husmann Date: Wed, 3 Apr 2024 09:19:16 -0700 Subject: [PATCH 02/10] Add support for labeled missingness Resolves https://github.com/frictionlessdata/specs/issues/880. When paired with the `categorical` type, this gives us full support for the value labels found in many statistical software packages. I have included it here rather than a separate PR because these issues are intertwined and there's a synergy in addressing them simultaneously. That said, if you would rather see this in a PR, let me know and I can revert. --- content/docs/specifications/table-schema.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index b1769ca3..bfb5bea9 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -125,9 +125,18 @@ Many datasets arrive with missing data values, either because a value was not co `missingValues` dictates which string values `MUST` be treated as `null` values. This conversion to `null` is done before any other attempted type-specific string conversion. The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. Providing the empty list `[]` means that no conversion to null will be done, on any value. -`missingValues` `MUST` be an `array` where each entry is a `string`. +`missingValues` `MUST` be an `array` where each entry is a `string`, or an `array` where each entry is an `object`. -**Why strings**: `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`. +If an `array` of `object`s is provided, each object `MUST` have a `value` and optional `label` property. The `value` property `MUST` be a `string` that matches the physical value of the field. The optional `label` property `MUST` be a `string` that provides a human-readable label for the missing value. For example: + +```json +"missingValues": [ + { "value": "", "label": "OMITTED" }, + { "value": "-99", "label": "REFUSED" } +] +``` + +**Why strings**: `missingValues` are specified as strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`. Examples: From 8a22065db769a6c7d44d41d926968d983cfa9e0e Mon Sep 17 00:00:00 2001 From: Kyle Husmann Date: Wed, 3 Apr 2024 09:59:40 -0700 Subject: [PATCH 03/10] clarify relationship to enum constraints --- content/docs/specifications/table-schema.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index bfb5bea9..d7ed3ce6 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -482,7 +482,7 @@ The `categorical` type facilitates interoperability with software packages that - [Factors (R)](https://www.stat.berkeley.edu/~s133/factors.html) - [CategoricalVectors (Julia)](https://dataframes.juliadata.org/stable/man/categorical/) -Although [`enum`](#enum) constraints can provide similar functionality for validation purposes, the `categorical` type is intended for use when data producers want to explicitly indicate to implementations that the field `SHOULD` be loaded as a categorical data type when supported by the implementation. +Although the `categorical` field type restricts a field to a finite set of possible values, like an [`enum`](#enum) constraint, the `categorical` field type enables data producers to explicitly indicate to implementations that a field `SHOULD` be loaded as a categorical data type (when supported by the implementation). By contrast, `enum` constraints simply add validation rules to existing field types. When an `enum` constraint is defined on a `categorical` field, the values in the `enum` constraint `MUST` be a subset of the physical values representing the levels of the `categorical`. The `categorical` field type `MUST` have the property `categories` that defines the set of possible values of the field. The `categories` property `MUST` be an array of strings, or an array of objects. From 10668937f9ce06ca62fe31eb68fc4411d90c4e18 Mon Sep 17 00:00:00 2001 From: Kyle Husmann Date: Mon, 8 Apr 2024 11:42:22 -0700 Subject: [PATCH 04/10] add backward compatibility note for enum constraints --- content/docs/specifications/table-schema.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index d7ed3ce6..0f679d42 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -744,6 +744,10 @@ A regular expression that can be used to test field values. If the regular expre The value of the field `MUST` exactly match one of the values in the `enum` array. +:::note[Backward Compatibility] +Many `v1.0` implementations imported fields with `enum` constraints as categorical data types. Starting in `v2.0` this behavior is discouraged in favor of explicit use of the [`categorical`](#categorical) field type. In `v2.0`, an `enum` constraint `SHOULD` be interpreted by implementations as a validation rule on an existing field type, and `SHOULD NOT` change the imported data type of the field. +::: + :::note[Implementation Note] - Implementations `SHOULD` report an error if an attempt is made to evaluate a value against an unsupported constraint. From 49acb5a91c3ae9844f4f59e5a9913357b3f2639e Mon Sep 17 00:00:00 2001 From: Kyle Husmann Date: Mon, 8 Apr 2024 12:02:24 -0700 Subject: [PATCH 05/10] clarify behavior of ordering property to match the order of level definition in the categories field --- content/docs/specifications/table-schema.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 0f679d42..ce0a6eb2 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -510,7 +510,7 @@ When the `categories` property is an array of objects, each object `MUST` have a } ``` -The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the categories have a natural order. When present, the `ordered` property `MUST` be a boolean. For example: +The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the levels of the `categorical` have a natural order. When present, the `ordered` property `MUST` be a boolean. When `ordered` is `true`, implementations `SHOULD` use the order of the levels as defined in the `categories` property as the natural order of the levels. For example: ```json { @@ -527,7 +527,7 @@ The `categorical` field type `MAY` additionally have the property `ordered` that } ``` -When the property `ordered` is not specified, implementations `MUST` assume a default value of `false`. +When the property `ordered` is `false` or not present, implementations `SHOULD` assume that the levels of the `categorical` do not have a natural order. ### `object` From 48a2e69848253bc5f6dd316df2d45e1b2b4d875c Mon Sep 17 00:00:00 2001 From: Kyle Husmann Date: Mon, 8 Apr 2024 12:53:01 -0700 Subject: [PATCH 06/10] collapse bulleted data type list into paragraph --- content/docs/specifications/table-schema.md | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index ce0a6eb2..222d7582 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -474,13 +474,7 @@ The boolean field can be customised with these additional properties: The field contains categorical data, defined as data with a finite set of possible values that represent levels of a categorical variable. -The `categorical` type facilitates interoperability with software packages that support categorical data types, including: - -- Value labels or formats ([Stata](https://www.stata.com/manuals13/dlabel.pdf), [SAS](https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p1upn25lbfo6mkn1wncu4dyh9q91.htm), and [SPSS](https://www.ibm.com/docs/en/spss-statistics/beta?topic=data-adding-value-labels)) -- Categoricals ([Pandas](https://pandas.pydata.org/docs/user_guide/categorical.html), and [Polars](https://docs.pola.rs/user-guide/concepts/data-types/categoricals/)) -- [Enums (DuckDB)](https://duckdb.org/docs/sql/data_types/enum.html) -- [Factors (R)](https://www.stat.berkeley.edu/~s133/factors.html) -- [CategoricalVectors (Julia)](https://dataframes.juliadata.org/stable/man/categorical/) +The `categorical` type facilitates interoperability with software packages that support categorical data types, including: Value labels or formats ([Stata](https://www.stata.com/manuals13/dlabel.pdf), [SAS](https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p1upn25lbfo6mkn1wncu4dyh9q91.htm), and [SPSS](https://www.ibm.com/docs/en/spss-statistics/beta?topic=data-adding-value-labels)), Categoricals ([Pandas](https://pandas.pydata.org/docs/user_guide/categorical.html), and [Polars](https://docs.pola.rs/user-guide/concepts/data-types/categoricals/)), Enums ([DuckDB](https://duckdb.org/docs/sql/data_types/enum.html)), Factors ([R](https://www.stat.berkeley.edu/~s133/factors.html)), and CategoricalVectors ([Julia](https://dataframes.juliadata.org/stable/man/categorical/)). Although the `categorical` field type restricts a field to a finite set of possible values, like an [`enum`](#enum) constraint, the `categorical` field type enables data producers to explicitly indicate to implementations that a field `SHOULD` be loaded as a categorical data type (when supported by the implementation). By contrast, `enum` constraints simply add validation rules to existing field types. When an `enum` constraint is defined on a `categorical` field, the values in the `enum` constraint `MUST` be a subset of the physical values representing the levels of the `categorical`. From 16407389ce6657c830714cb3fa8e4d522f534889 Mon Sep 17 00:00:00 2001 From: Kyle Husmann Date: Mon, 8 Apr 2024 18:43:18 -0700 Subject: [PATCH 07/10] clarify that the ordering of the levels in the prop is *ascending order* --- content/docs/specifications/table-schema.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 222d7582..2c857e4c 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -504,7 +504,7 @@ When the `categories` property is an array of objects, each object `MUST` have a } ``` -The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the levels of the `categorical` have a natural order. When present, the `ordered` property `MUST` be a boolean. When `ordered` is `true`, implementations `SHOULD` use the order of the levels as defined in the `categories` property as the natural order of the levels. For example: +The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the levels of the `categorical` have a natural order. When present, the `ordered` property `MUST` be a boolean. When `ordered` is `true`, implementations `SHOULD` interpret the order of the levels as defined in the `categories` property as the natural ordering of the levels, in ascending order. For example: ```json { From ea5a0e7c9b0714d3015f5a0753e2aa47ad881da0 Mon Sep 17 00:00:00 2001 From: Kyle Husmann Date: Tue, 9 Apr 2024 10:02:32 -0700 Subject: [PATCH 08/10] clarify ordering of levels when physical values are numeric & ordered = true --- content/docs/specifications/table-schema.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 2c857e4c..1cbe69e8 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -504,7 +504,7 @@ When the `categories` property is an array of objects, each object `MUST` have a } ``` -The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the levels of the `categorical` have a natural order. When present, the `ordered` property `MUST` be a boolean. When `ordered` is `true`, implementations `SHOULD` interpret the order of the levels as defined in the `categories` property as the natural ordering of the levels, in ascending order. For example: +The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the levels of the `categorical` have a natural order. When present, the `ordered` property `MUST` be a boolean. When `ordered` is `true`, implementations `SHOULD` interpret the order of the levels as defined in the `categories` property as the natural ordering of the levels, in ascending order. In cases where the physical values are numeric and `ordered` is `true`, the order of the levels `SHOULD` match the numerical order of the values (e.g., 1, 2, 3, ...) to avoid ambiguity. For example: ```json { From 4aa0d58870a574bdd0387fae7ee508e03230f0cb Mon Sep 17 00:00:00 2001 From: Kyle Husmann Date: Mon, 22 Apr 2024 18:40:02 -0700 Subject: [PATCH 09/10] convert to fully physical/lexical values --- content/docs/specifications/table-schema.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index 9c832751..c2f4afef 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -498,32 +498,32 @@ When the `categories` property is an array of strings, the strings `MUST` be uni } ``` -When the `categories` property is an array of objects, each object `MUST` have a `value` and an optional `label` property. The `value` property `MUST` be a string or number that matches the physical value of the field when representing that level. The optional `label` property, when present, `MUST` be a string that provides a human-readable label for the level. For example, if the codes `0`, `1`, and `2` were used as codes to represent the levels `apple`, `orange`, and `banana` in the previous example, the `categories` property would be defined as follows: +When the `categories` property is an array of objects, each object `MUST` have a `value` and an optional `label` property. The `value` property `MUST` be a string that matches the physical value of the field when representing that level. The optional `label` property, when present, `MUST` be a string that provides a human-readable label for the level. For example, if the codes `"0"`, `"1"`, and `"2"` were used as codes to represent the levels `apple`, `orange`, and `banana` in the previous example, the `categories` property would be defined as follows: ```json { "name": "fruit", "type": "categorical", "categories": [ - { "value": 0, "label": "apple" }, - { "value": 1, "label": "orange" }, - { "value": 2, "label": "banana" } + { "value": "0", "label": "apple" }, + { "value": "1", "label": "orange" }, + { "value": "2", "label": "banana" } ] } ``` -The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the levels of the `categorical` have a natural order. When present, the `ordered` property `MUST` be a boolean. When `ordered` is `true`, implementations `SHOULD` interpret the order of the levels as defined in the `categories` property as the natural ordering of the levels, in ascending order. In cases where the physical values are numeric and `ordered` is `true`, the order of the levels `SHOULD` match the numerical order of the values (e.g., 1, 2, 3, ...) to avoid ambiguity. For example: +The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the levels of the `categorical` have a natural order. When present, the `ordered` property `MUST` be a boolean. When `ordered` is `true`, implementations `SHOULD` interpret the order of the levels as defined in the `categories` property as the natural ordering of the levels, in ascending order. For example: ```json { "name": "agreementLevel", "type": "categorical", "categories": [ - { "value": 1, "label": "Strongly Disagree" }, - { "value": 2 }, - { "value": 3 }, - { "value": 4 }, - { "value": 5, "label": "Strongly Agree" } + { "value": "1", "label": "Strongly Disagree" }, + { "value": "2" }, + { "value": "3" }, + { "value": "4" }, + { "value": "5", "label": "Strongly Agree" } ], "ordered": true } From e493e604783e82a2ed7acdd7b970ab169016a412 Mon Sep 17 00:00:00 2001 From: Kyle Husmann Date: Mon, 22 Apr 2024 18:52:04 -0700 Subject: [PATCH 10/10] re-arrange note about enum to end; define representation of logical values --- content/docs/specifications/table-schema.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index c2f4afef..566a4794 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -482,11 +482,7 @@ The boolean field can be customised with these additional properties: The field contains categorical data, defined as data with a finite set of possible values that represent levels of a categorical variable. -The `categorical` type facilitates interoperability with software packages that support categorical data types, including: Value labels or formats ([Stata](https://www.stata.com/manuals13/dlabel.pdf), [SAS](https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p1upn25lbfo6mkn1wncu4dyh9q91.htm), and [SPSS](https://www.ibm.com/docs/en/spss-statistics/beta?topic=data-adding-value-labels)), Categoricals ([Pandas](https://pandas.pydata.org/docs/user_guide/categorical.html), and [Polars](https://docs.pola.rs/user-guide/concepts/data-types/categoricals/)), Enums ([DuckDB](https://duckdb.org/docs/sql/data_types/enum.html)), Factors ([R](https://www.stat.berkeley.edu/~s133/factors.html)), and CategoricalVectors ([Julia](https://dataframes.juliadata.org/stable/man/categorical/)). - -Although the `categorical` field type restricts a field to a finite set of possible values, like an [`enum`](#enum) constraint, the `categorical` field type enables data producers to explicitly indicate to implementations that a field `SHOULD` be loaded as a categorical data type (when supported by the implementation). By contrast, `enum` constraints simply add validation rules to existing field types. When an `enum` constraint is defined on a `categorical` field, the values in the `enum` constraint `MUST` be a subset of the physical values representing the levels of the `categorical`. - -The `categorical` field type `MUST` have the property `categories` that defines the set of possible values of the field. The `categories` property `MUST` be an array of strings, or an array of objects. +The `categorical` field type `MUST` have the property `categories` that defines the set of possible levels of the field. The `categories` property `MUST` be an array of strings, or an array of objects. When the `categories` property is an array of strings, the strings `MUST` be unique and `MUST` match the physical values of the field. For example: @@ -498,7 +494,7 @@ When the `categories` property is an array of strings, the strings `MUST` be uni } ``` -When the `categories` property is an array of objects, each object `MUST` have a `value` and an optional `label` property. The `value` property `MUST` be a string that matches the physical value of the field when representing that level. The optional `label` property, when present, `MUST` be a string that provides a human-readable label for the level. For example, if the codes `"0"`, `"1"`, and `"2"` were used as codes to represent the levels `apple`, `orange`, and `banana` in the previous example, the `categories` property would be defined as follows: +When the `categories` property is an array of objects, each object `MUST` have a `value` and an optional `label` property. The `value` property `MUST` be a string that matches the physical value of the field when representing that level. The optional `label` property, when present, `MUST` be a string that provides a human-readable label for the level. For example, if the physical values `"0"`, `"1"`, and `"2"` were used as codes to represent the levels `apple`, `orange`, and `banana` in the previous example, the `categories` property would be defined as follows: ```json { @@ -531,6 +527,10 @@ The `categorical` field type `MAY` additionally have the property `ordered` that When the property `ordered` is `false` or not present, implementations `SHOULD` assume that the levels of the `categorical` do not have a natural order. +Although the `categorical` field type restricts a field to a finite set of possible values, similar to an [`enum`](#enum) constraint, the `categorical` field type enables data producers to explicitly indicate to implementations that a field `SHOULD` be loaded as a categorical data type (when supported by the implementation). By contrast, `enum` constraints simply add validation rules to existing field types. + +When an `enum` constraint is defined on a `categorical` field, the values in the `enum` constraint `MUST` be a subset of the logical values representing the levels of the `categorical`. Logical values of categorical levels are indicated by their labels, if present, or by their physical value, if a label is not present. + ### `object` The field contains a valid JSON object.