Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Add a categorical field type #48

Closed
wants to merge 12 commits into from

Conversation

khusmann
Copy link
Contributor

@khusmann khusmann commented Apr 2, 2024

Overview

A first pass at defining a categorical field type, designed to facilitate interoperability with software packages that support categorical data types. Needs work still, but just wanted to keep the creative juices flowing on this issue!

Paging @pschumm and @peterdesmet...

@khusmann khusmann marked this pull request as draft April 2, 2024 23:29
@nichtich
Copy link

nichtich commented Apr 3, 2024

Looks good but the reference to enum could be improved: does categories imply enum? If there are categories, these SHOULD also be used for validation shouldn't they? Is it allowed to have categories with values a and b but enum with a different set of values (e.g. d and e)?

In addition I'd support a feature to reference categories defined elsewhere, but this is another issue and could be added later.

Copy link
Member

@roll roll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing @khusmann! Looks great!

I added a few comments

content/docs/specifications/table-schema.md Show resolved Hide resolved
content/docs/specifications/table-schema.md Outdated Show resolved Hide resolved
content/docs/specifications/table-schema.md Show resolved Hide resolved
content/docs/specifications/table-schema.md Outdated Show resolved Hide resolved
@khusmann
Copy link
Contributor Author

khusmann commented Apr 3, 2024

@nichtich Great questions.

If there are categories, these SHOULD also be used for validation shouldn't they?

Yes, definitely.

Is it allowed to have categories with values a and b but enum with a different set of values (e.g. d and e)?

This should not be allowed. Enum constraints on categoricals should only include values from the valid set of levels. So a categorical with categories a, b, and c could have an enum constraint with some combination of a, b, or c.

does categories imply enum?

No, and enum constraints don't imply categorical either: enum constraints are a validation rule, a categorical is a type. It should be fully possible to have a categorical with "categories": ["a", "b", "c"] with a constraint of "enum": ["a", "b"]. This would do two things:

  1. Only allow values "a" and "b" in that field
  2. When imported into R et al, be imported as a factor type with levels = a, b, c.

In practice these situations come up when you have a survey item with multiple options, but you know for some reason (like a design rule) that participants should only select some subset of the options.

Would you suggest we put these clarifications with enum in the text of categorical or in the enum section?

In addition I'd support a feature to reference categories defined elsewhere, but this is another issue and could be added later.

Agreed!

Resolves frictionlessdata/datapackage#880. When
paired with the `categorical` type, this gives us full support for the
value labels found in many statistical software packages.

I have included it here rather than a separate PR because these
issues are intertwined and there's a synergy in addressing them
simultaneously. That said, if you would rather see this in a PR, let me
know and I can revert.
@khusmann
Copy link
Contributor Author

khusmann commented Apr 3, 2024

Based on @roll's comments above, I've added a commit that adds support for labeled missingness.

As I mention in the commit log, I think the issues of categorical labels and labeled missingness are intertwined and there's a synergy in addressing them simultaneously. That said, if you would rather see this in a separate PR, let me
know and I can revert.

@roll
Copy link
Member

roll commented Apr 4, 2024

@khusmann
Thanks! Shall we promote it to non-draft pull request? TBH being draft doesn't make a lot of difference except for it doesn't trigger a preview deployment =)

@khusmann khusmann marked this pull request as ready for review April 4, 2024 14:34
@nichtich
Copy link

nichtich commented Apr 4, 2024

Is it allowed to have categories with values a and b but enum with a different set of values (e.g. d and e)?

This should not be allowed. Enum constraints on categoricals should only include values from the valid set of levels. So a categorical with categories a, b, and c could have an enum constraint with some combination of a, b, or c.

Then this should be documented. I'd assume that categories imply a default enum if no enum is specified. Otherwise we can have a perfectly valid record that has value x in a field of categories a, b, c.

@khusmann
Copy link
Contributor Author

khusmann commented Apr 4, 2024

@nichtich

Then this should be documented.

What do you think about the language in the latest revision? (Last sentence in particular)

Although the categorical field type restricts a field to a finite set of possible values, like an enum constraint, the categorical field type enables data producers to explicitly indicate to implementations that a field SHOULD be loaded as a categorical data type (when supported by the implementation). By contrast, enum constraints simply add validation rules to existing field types. When an enum constraint is defined on a categorical field, the values in the enum constraint MUST be a subset of the physical values representing the levels of the categorical.

I don't think categorical type implies a default enum constraint because I see type validation and constraint validation as separate -- A categorical with levels "a", "b", "c", should throw a type validation error when validating string value "x" in the same way that an integer field will throw a type validation error on a string value "x".

Similarly, if we try to put an "enum": ["1", "2", "x"] constraint on an integer field, this is not allowed, right?

@pschumm
Copy link

pschumm commented Apr 8, 2024

I don't think categorical type implies a default enum constraint because I see type validation and constraint validation as separate -- A categorical with levels "a", "b", "c", should throw a type validation error when validating string value "x" in the same way that an integer field will throw a type validation error on a string value "x".

I agree that the conceptual distinction between type and constraint dictates that it should be possible, at least in principle, to use an enum constraint to restrict to only a subset of categories. Personally, as a data producer, I prefer to keep schema as general as possible and to put file-specific validation rules in an inquiry; for example, a skip pattern in a survey that effectively limits to a subset of categories. But it should be possible to use an enum constraint in this way, just like you can do with other types.

Thanks very much @khusmann for doing the work to translate this from a pattern into a spec. I think that with your modifications in response to the various comments above, it now looks to be in excellent shape and accommodates full use of categorical variables.

I would point out that there is one possible use of value labels or formats (only relevant for Stata, SAS or SPSS) that this does not accommodate; namely, the case where you want to label only a few values that are not missing values but you don't want to have to enumerate all possible values in the schema. For example, you might have a top-coded age variable where you want to label the value 90 with "90 or older" but you don't want to have to enumerate all of the integers between 1-90. This may or may not be something you want to treat as categorical in your analyses, and so I don't really see this as a limitation in the way you've defined a categorical type. But it does prevent you from packing all of the information you need to define your value labels or formats into the schema.

I think this is conceptually distinct from the new categorical type and also from the new, expanded missingValues, so I think we can leave it aside for now. IMO this PR is now good to go.

@ezwelty
Copy link

ezwelty commented Apr 8, 2024

I wasn't feeling swayed by the proposal ("but we already have enum!") until I saw that the proposed categories could handle the common use case of labeled values. This seems worth the addition of a new type. I would suggest that categories objects should allow additional properties, as I can immediately see the utility for avoiding the need to define each value as plain text in field.description. For example:

[{
  "value": "apple",
  "description": "Any member of the Malus genus"  
},
{
  "value": "citrus",
  "description": "Any member of the Citrus genus"
}]

Could be rendered as part of the field's description as:

  • apple: Any member of the Malus genus
  • citrus: Any member of the Citrus genus

The overlap between this type's categories and an enum constraint do raise questions. Regarding data validation, I'm not concerned: the categories would raise a type error and enum a constraint error. Implementations could add fancier metadata checks (e.g. enum is a subset of categories), but that shouldn't be a requirement, since illogical (if valid) schemas would necessarily be discovered during data validation.

I'm more concerned about conversion. Going forward, should implementations drop any behavior that assumed a categorical from the presence of enum, and instead only do so for a categorical type? I'm thinking about conversion between Table Schema and SQLAlchemy or the reading of Tabular Data Resources into Python or R. In PostgreSQL, for example, I suppose enum would no longer be represented as an Enum (as is currently the case) and instead encoded as a column-level CHECK, but categories, when possible, would now be encoded as an Enum type?

Note that not all systems supported unordered categoricals (e.g. PostgreSQL Enum), so it is maybe worth mentioning that the lexical/physical order of the categories in the schema should be taken as their order in this case?

@peterdesmet
Copy link
Member

Thanks @khusmann for translating this into a PR 🎉

@pschumm
Copy link

pschumm commented Apr 8, 2024

Note that not all systems supported unordered categoricals (e.g. PostgreSQL Enum), so it is maybe worth mentioning that the lexical/physical order of the categories in the schema should be taken as their order in this case?

This is a good point that I missed; I agree that the spec should be clear on this. Since lexical order is not always meaningful, I think physical order of the array should be used by default in creating the corresponding target data representation (e.g., in creating a categorical or value label). This does create an ambiguity however in a case like:

{
  "name": "fruit",
  "type": "categorical",
  "categories": [
    { "value": 2, "label": "banana" },
    { "value": 0, "label": "apple" },
    { "value": 1, "label": "orange" }
  ]
}

where in Pandas you would create pd.Categorical(['banana','apple','orange']) but when creating a value label (e.g., Stata or SPSS) or format (e.g., SAS) you would do so based on the integer values, which in this case do not appear in lexical order. I think we could perhaps handle this by saying something like "The physical order, not the lexical order, of the categories SHOULD be used by implementations when creating a target representation of the data (e.g., a categorical in Pandas or factor variable in R) unless the data's original values are maintained (e.g., when creating a value label in Stata or SPSS, or a format in SAS)." Thoughts?

@pschumm
Copy link

pschumm commented Apr 8, 2024

I would suggest that categories objects should allow additional properties, as I can immediately see the utility for avoiding the need to define each value as plain text in field.description.

Just to be clear, are you proposing to permit arbitrary additional properties, or only predefined properties like description whose name is already used at the field level? And if the former, would we suggest that people place them in a special namespace (e.g., custom) as with the addition of other 3rd party properties?

(Parenthetical comment: Your mention of description makes me feel that the use of title (at the field level) versus label (at the category level) is a bit inconsistent, though if anything, I prefer label to title in both places (e.g., we typically use a field's title as its variable label in Stata or SAS). Oh well—foolish consistency and all that, methinks...)

@khusmann
Copy link
Contributor Author

khusmann commented Apr 8, 2024

@pschumm

I would point out that there is one possible use of value labels or formats (only relevant for Stata, SAS or SPSS) that this does not accommodate; namely, the case where you want to label only a few values that are not missing values but you don't want to have to enumerate all possible values in the schema.
I think this is conceptually distinct from the new categorical type and also from the new, expanded missingValues, so I think we can leave it aside for now. IMO this PR is now good to go.

Good point. And I agree that this functionality is conceptually distinct; we can address it in a future pattern.

@ezwelty

I would suggest that categories objects should allow additional properties, as I can immediately see the utility for avoiding the need to define each value as plain text in field.description

Precisely! I think this will open up a lot of patterns / extensions for defining category-level metadata that can be used to render documentation / code books / data dictionaries.

This isn't something that we need to explicitly allow in the spec though, right? (Because all frictionless descriptors can be extended by user defined properties and future patterns / extensions?)

I'm more concerned about conversion…

I think the old behavior in many implementations (that wasn't defined in the spec, as @peterdesmet pointed out) of converting enum constraints into categorical types was a conflation of field type parsing and constraint validation. Specifying a constraint on a field type should not change the type of the field; it should only constrain the possible values of that type. One of the benefits of having an explicit categorical type is that it creates an explicit separation between the roles of field type validation and constraint validation.

If the categorical type is included in V2 of the spec, it gives us a clear differentiation point for implementation behavior on this front though. Implementations can keep the old enum-constraint-type-conversion behavior for V1 specs, but moving forward these should now be specified as categorical field types (when authors want the field to be imported as such).

I've added this as a backward compatibility note in the enum section as suggested by @peterdesmet

Note that not all systems supported unordered categoricals (e.g. PostgreSQL Enum), so it is maybe worth mentioning that the lexical/physical order of the categories in the schema should be taken as their order in this case?

Good catch. I've just updated the spec to explicitly address this:

The categorical field type MAY additionally have the property ordered that indicates whether the levels of the categorical have a natural order. When present, the ordered property MUST be a boolean. When ordered is true, implementations SHOULD use the order of the levels as defined in the categories property as the natural order of the levels.

@pschumm

This does create an ambiguity however in a case like:

For the case you provide, I don't think there's any issue because it's an unordered categorical and so it should not matter if the implementation internally represents it as pd.Categorical(['banana','apple','orange']), pd.Categorical(['apple', 'orange', 'banana']), etc.

For ordered categoricals I think it presents more of an issue for SAS / SPSS / Stata implementations because they order their ordinal vars by way of the codes. If a producer created a coded ordinal variable where for whatever reason they wanted the lexical ordering of the codes to be different from the order they actually intended for the levels, then this would not be immediately import-able into SAS / SPSS / Stata without further transformation.

I think this is enough of a weird / exceptional case we can leave this up to implementations to resolve. For example, if I was trying to import data into SAS / SPSS / Stata and this very special case came up, I'd want it to trigger some user intervention to resolve the conflict between the ordering of the codes and the ordering of the levels. (The user needs to choose between preserving the ordering, or preserving the codes).

For the spec definition, I think we can just leave it with the simple blanket "SHOULD use the order of the levels as defined in the categories property" and leave it to implementations to decide how to best interactively resolve these edge cases when converting between conflicting formats. What do you think?

@khusmann
Copy link
Contributor Author

khusmann commented Apr 9, 2024

Note -- just realized the direction of ordering of levels was not explicitly defined; just added a commit to clarify that it is ascending order.

@pschumm
Copy link

pschumm commented Apr 9, 2024

For the spec definition, I think we can just leave it with the simple blanket "SHOULD use the order of the levels as defined in the categories property" and leave it to implementations to decide how to best interactively resolve these edge cases when converting between conflicting formats. What do you think?

I agree with everything you said, but I still think it makes sense to include guidance for data creators in the case where the values are integers (i.e., where in Stata, SAS or SPSS the variable would presumably be represented by a labeled/formatted integer). Something like "In cases where the physical values are integers and ordered is true, the categories SHOULD be listed in numerical order of the values (e.g., 1, 2, 3, ...) to avoid ambiguity."

@khusmann
Copy link
Contributor Author

khusmann commented Apr 9, 2024

@pschumm

Ah, yeah, that's not bad at all! Added your line in the latest commit with minor modifications:

In cases where the physical values are numeric and ordered is true, the order of the levels SHOULD match the numerical order of the values (e.g., 1, 2, 3, ...) to avoid ambiguity.

Changed integer to numeric because I figured it'd be weird to require ordering in this case for integers but not for other numeric types.

@ezwelty
Copy link

ezwelty commented Apr 9, 2024

@khusmann I notice that the spec allows a categorical of mixed type (?). This introduces some problems. First, every category value needs to be checked to determine the data type or whether it is of mixed type. In the latter case, strange things could happen: systems that cannot support this need to default to something (presumably string), and for systems that can, they need to check each value to determine the type (e.g. is "0" in this CSV 0 or "0"?), which could potentially be ambiguous, e.g. categories: [0, 0.0, "0"]. Do we want to allow this, or add an valueType/itemType or similar?

After thinking the whole idea over, I do feel some hesitation. The Table Schema spec is intended mainly for data publishing: here is the data and here is the metadata you'll need to correctly interpret it. It largely avoids the question of how the data should be represented once loaded (e.g. no distinction between different integer, decimal, or floating point formats), but this proposal pushes the spec in that direction: "here are some strings, but please load them as a (un)ordered categorical".

@khusmann
Copy link
Contributor Author

khusmann commented Apr 9, 2024

@ezwelty

I notice that the spec allows a categorical of mixed type (?).

As @peterdesmet pointed out, I think this ambiguity will be resolved if/when #49 is accepted, because we can replace "string or number" with "native value".

But I think I agree with your point here regarding the larger implementation issue native values represent across the spec (not limited to categoricals): frictionlessdata/datapackage#49 (comment)

If #49 is not accepted, I think the value property in categories should always be type string, for the same reason that missingValues are always type string.

But when/if #49 is accepted, and missingValues allows native values, then I think the value property in categoricals should also be native values.

So I don't see this as an issue / limitation of the current proposal; it will be resolved once #49 is resolved.

The Table Schema spec is intended mainly for data publishing: here is the data and here is the metadata you'll need to correctly interpret it

I agree! But this is the very reason we need categorical types -- there's a massive amount of research data published as coded categoricals (e.g. 1: MALE, 2: FEMALE). So without labels you'd just get a stream of meaningless integers instead of level labels. You need the label metadata in order to correctly interpret it.

Similarly, in order to perform the correct kinds of statistical analyses (or visualizations) when making use of the data, you need to know if the categorical was ordered or unordered, and have the exhaustive list of what all the possible levels were (that may or may not be all represented in the data).

To me, the argument for inclusion of categorical as its own field type is that they are distinct logical data types in the same way dates, times, durations, etc. are distinct logical data types in the spec: Yes, they can be physically stored as numbers or strings, but have a distinct abstract set of properties that make them logically different from numbers and strings. And these properties have implications for both the validation and usage of the type (and display of the type in UI widgets). So I'd argue "here are some strings, but please load them as a (un)ordered categorical" is no different from "here are some strings, but please load them as a date".

(Also, the spec does have an integer field type! We just don't distinguish between floating point and decimal types at the present).

@khusmann
Copy link
Contributor Author

khusmann commented Apr 9, 2024

If #49 is not accepted, I think the value property in categories should always be type string, for the same reason that missingValues are always type string.

But when/if #49 is accepted, and missingValues allows native values, then I think the value property in categoricals should also be native values.

Actually, I'm realizing the former case would conflict with the recent addition of requiring levels with numeric values to be listed in order, so that would need to be revisited in that event.

In any case, whatever direction we take for missingValues, trueValues, and falseValues (strings vs native values) can simply be dropped in here.

@pschumm
Copy link

pschumm commented Apr 9, 2024

At the risk of being a bit duplicative (apologies in advance):

After thinking the whole idea over, I do feel some hesitation. The Table Schema spec is intended mainly for data publishing: here is the data and here is the metadata you'll need to correctly interpret it. It largely avoids the question of how the data should be represented once loaded (e.g. no distinction between different integer, decimal, or floating point formats)...

I totally agree; in fact, the original rationale behind a previous version of this concept stated that explicitly. The additional information provided is intended solely as metadata indicating how to interpret the field:

  1. It is categorical, and here is the full set of possible categories.
  2. In cases where the data are stored in encoded form, here is how to interpret each value (i.e., via its label).
  3. Finally, is there a natural ordering to the categories or not.

Note that (1) is not the same as the information provided by an enum constraint as @khusmann notes above; the former is a statement about the inherent type, while the latter is an assertion about the values in this particular resource. It is understood that software will vary widely in its ability to utilize these metadata, and this is entirely an implementation detail.

@khusmann I notice that the spec allows a categorical of mixed type (?). This introduces some problems. First, every category value needs to be checked to determine the data type or whether it is of mixed type. In the latter case, strange things could happen: systems that cannot support this need to default to something (presumably string), and for systems that can, they need to check each value to determine the type (e.g. is "0" in this CSV 0 or "0"?), which could potentially be ambiguous, e.g. categories: [0, 0.0, "0"]. Do we want to allow this, or add an valueType/itemType or similar?

This is a fair point, though to some extent I agree with @khusmann that this is at least partially addressed by #49. AFAIK only Pandas permits you to define a mixed type categorical, presumably treating all of the values as dtype object, so other software would have to come up with a sensible way of handling this in cases where it encounters a mixed type categorical. But I'd rather not exclude this possibility and would hate to complicate things further with an additional property if we can avoid it.

@pschumm
Copy link

pschumm commented Apr 9, 2024

Actually, I'm realizing the former case would conflict with the recent addition of requiring levels with numeric values to be listed in order, so that would need to be revisited in that event.

I believe your latest change suggests this but does not require it (which is consistent with what I had suggested).

I don't think we need to get hung up on this, and am glad to retract it if folks prefer. I believe it is only relevant for Stata, SAS and SPSS, where a categorical with all numeric values would be cast to a numeric variable. I agree that the order of the array should, in cases where ordered is true, solely determine the order of the categories. But that would not work in the case of Stata, SAS or SPSS if the variable was cast to numeric, leaving the data consumer to wonder whether (1) this was simply an error or oversight in the way the schema was generated (possibly following manual editing), and the numeric order of the cast values is correct; (2) the field definition is exactly as intended and cannot be represented straightforwardly in Stata, SAS or SPSS without recoding the values; or (3) there is a more serious error and the field is not fully interpretable.

In sum, since a major reason for introducing categorical support is to facilitate use of Frictionless by people working with these software packages, I think this suggestion (not requirement but only suggestion) would be helpful, but I also agree with a recent comment (can't recall where) that the specs shouldn't really be referring to specific software packages. So I'm glad to defer to group opinion on this.

@ezwelty ezwelty mentioned this pull request Apr 10, 2024
@ezwelty
Copy link

ezwelty commented Apr 10, 2024

Thank you @khusmann and @pschumm for your responses. I'm feeling better about this now :)

Sorry for missing #49, I agree that this partly addresses the issues I brought up. I'm working down the list from the latest weekly updated, and hadn't yet made it that far. I still see the potential for ambiguity, which I brought up in that issue (frictionlessdata/datapackage#49 (comment)), and which would point towards an optional arrayItem or similar for types not natively supported by JSON.

@khusmann
Copy link
Contributor Author

khusmann commented Apr 10, 2024

I still see the potential for ambiguity, which I brought up in that issue (#49 (comment)), and which would point towards an optional arrayItem or similar for types not natively supported by JSON.

I very much agree with you there! All is well and good when the data backing the field is always strings from delimited textual data, but when native types are adopted (as it seems they will be), the underlying native type driving categories becomes ambiguous, in the same way the underlying type driving missingValues, trueValues, falseValues becomes ambiguous, (as well as the source native type's role in the validation of the field).

With native types, I think we get two options -- define arrayItem or similar props as you say, that explicitly declare the source type being imported, or have native type -> JSON coercion rules (as suggested by @pschumm).

Or, here's another potential option -- we could limit the native types that are allowed to represent the categorical, when categorical data types are not supported in the native format. It looks like this is already being done on other types in the native type proposal, so this fits in well. We would restrict it to 1) native categorical types 2) native string representations of categoricals 3) native numeric representations of categoricals

This way if someone tried to define a categorical field type using a native date type (as you showed in an example), this would not be allowed.

If I follow the example for number:

Native Representaiton

If supported, categorical values MUST be natively represented by a data format. If not supported, values MUST be represented as native strings or numbers following the rules below.

Thoughts?

Copy link
Member

@peterdesmet peterdesmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khusmann I've made two suggestions to already allow categories to point to URL or Path with the object (cf. schema)

content/docs/specifications/table-schema.md Outdated Show resolved Hide resolved
content/docs/specifications/table-schema.md Show resolved Hide resolved
@khusmann
Copy link
Contributor Author

@peterdesmet

@khusmann I've made two suggestions to already allow categories to point to URL or Path with the object (cf. schema)

Thanks for these suggestions! I'm a little hesitant to merge in this round because I think we should let the discussion simmer in frictionlessdata/datapackage#888 a bit more to better define our approach for external definitions.

As written, referencing a JSON in categories would require a different JSON file for each category list, right? I think that might get unwieldy for a lot of category lists -- in a mid-size project I'm currently working on that would mean ~30 json files for categorical scales I reuse across the project, each ranging only 5-10 lines.

I'd rather have a system closer to what @nichtich suggested here: frictionlessdata/datapackage#875 (comment) , where the categoryTypes property could point to a single JSON with ALL of the category definitions. It could also work for a list of paths to JSON files with named category lists that could be merged, similar to what we've talked about for merging external schema definitions.

Would you be ok with keeping the current proposal as it stands, and then we can add external category lists in a future iteration after we've had more discussion about the pros/cons of different approaches?

@peterdesmet
Copy link
Member

Would you be ok with keeping the current proposal as it stands, and then we can add external category lists in a future iteration

Sure, makes sense!

@roll
Copy link
Member

roll commented Apr 18, 2024

Hi @khusmann!

Is it not a draft anymore? 😃

@khusmann khusmann changed the title [Draft] Add a categorical field type Add a categorical field type Apr 18, 2024
@khusmann
Copy link
Contributor Author

@roll

Is it not a draft anymore? 😃

I would say it's looking pretty stable now... just removed [Draft] from the title, if that's what you were looking for? Let me know if you need anything else to keep this moving forward...

@roll
Copy link
Member

roll commented Apr 19, 2024

I would say it's looking pretty stable now... just removed [Draft] from the title, if that's what you were looking for? Let me know if you need anything else to keep this moving forward...

Thanks! Yea, it was a draft, although quite ready, I think. We need a quorum now 😃


Although the `categorical` field type restricts a field to a finite set of possible values, like an [`enum`](#enum) constraint, the `categorical` field type enables data producers to explicitly indicate to implementations that a field `SHOULD` be loaded as a categorical data type (when supported by the implementation). By contrast, `enum` constraints simply add validation rules to existing field types. When an `enum` constraint is defined on a `categorical` field, the values in the `enum` constraint `MUST` be a subset of the physical values representing the levels of the `categorical`.

The `categorical` field type `MUST` have the property `categories` that defines the set of possible values of the field. The `categories` property `MUST` be an array of strings, or an array of objects.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just an array of strings, why not also an array of integers, since you can specify integers for value in the expanded syntax? More generally, why limit ourselves here at all?

@khusmann
Copy link
Contributor Author

khusmann commented Apr 23, 2024

I think this PR is getting pulled in two different directions because of #49, so I'm splitting this PR into two separate ones. I've edited this one to be all-in on lexical/physical values, and will push a native value version shortly.

The edits required to support native values were a little more involved than I thought, because as @pschumm mentioned, different native formats have different levels of support for level types, labels, and ordering. So the categories prop becomes optional, because it may already be described by the native format.

This is the beauty, I think, of the lexical (physical) values approach – we don't need to think about all the possible native types we could be converting from and how they interact with this type, because all we're looking at are textual representations of the values (as if coming from a CSV) without any other type information from the native format (like representation type, levels, order) attached.

I also addressed @ezwelty's excellent point about the enum constraints from the data representation thread. I think we actually had this wrong in our earlier draft: constraints operate on logical values, not physical values. This ambiguity is actually a little easier to solve than the native value issue, because we can define what the logical representation of a categorical level should be when used in the spec. Here, I've defined it as:

Logical values of categorical levels are indicated by their labels, if present, or by their physical value, if a label is not present.

Because physical values are by definition strings, and the labels are strings, the values in the enum constraint will always be strings.

Other changes:

  • I took out the reference to all of the implementations of categorical types because it started feeling distracting from the type definition; as @ezwelty mentioned earlier it focuses too much on "how the data should be represented once loaded".

  • I removed the "SHOULDbe listed in numerical order" now that everything is string (to keep things simple). We could re-add this with something like, "If the physical values representing the levels are convertible into numeric values, then…"

Copy link

@ezwelty ezwelty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things stood out to me as being unintuitive or unclear. Thanks for all your hard work and for putting up with my pickiness.

}
```

When the `categories` property is an array of objects, each object `MUST` have a `value` and an optional `label` property. The `value` property `MUST` be a string that matches the physical value of the field when representing that level. The optional `label` property, when present, `MUST` be a string that provides a human-readable label for the level. For example, if the physical values `"0"`, `"1"`, and `"2"` were used as codes to represent the levels `apple`, `orange`, and `banana` in the previous example, the `categories` property would be defined as follows:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to make sure I understand the use of physical here, do you mean that the categorical type can only be used on a field whose data is stored as string, or would this also be valid on e.g. values stored as numbers in JSON, which would be cast to strings before comparison?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original intent of physical here is in the same way it's used in missingValues or trueValues in v1. They are string values that are used to match on the source value before it is cast to logical, as described in the "why strings" section in missingValues.

So in my reading of the v1 spec, "missingValues": ["-99"] will match on both a CSV "-99" as well as a JSON numeric -99. Similarly "trueValues": ["1"] would match on a CSV "1", JSON number 1 or SQLite integer 1. This allows the same schema definition to have similar behavior across formats. So here a "value": "0" will also match on a field with JSON number 0, or SQLite integer 0 (or a SPSS integer 0 @pschumm).

...but yeah, it's not ideal, for all the reasons we've been talking about in #49. It works great for delimited text, but with typed/binary formats it starts getting hazy. (It doesn't help that v1 later says physical values can include type info... but that is contradicted by the fact that missingValues (and trueValues) are all string)

I keep finding myself thinking that for well-defined support of typed/binary formats we'd need a format-dependent "native value" validation layer (like TableDialect, but just asserts native value types) before values get to TableSchema for final cast to "logical value" along with logical validation. Or something like that. Our challenge is that we're trying to do both steps in one layer with TableSchema, which forces us to make these uncomfortable trade-offs & coercions.

Back to categorical here though -- are you favoring we take out the reference to string? Or are you thinking more in an itemType property direction?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your intention is good, but isn't actually what v1 had in mind (or at least how it has been interpreted in the core implementation) because it was thinking that trueValues and missingValues would be for CSV only, since JSON has native null and true and therefore does not require such string → logical conversions. To me, the key is this line from the constraints section:

Say we have the following:

{
  "profile": "tabular-data-resource",
  "name": "resource",
  "data": [
    {
      "nullable": 1,
      "truthy": 1
    },
    {
      "nullable": "1",
      "truthy": "1"
    },
    {
      "nullable": null,
      "truthy": true
    }
  ],
  "schema": {
    "fields": [
      {
        "name": "nullable",
        "type": "any"
      },
      {
        "name": "truthy",
        "type": "boolean",
        "trueValues": ["1"]
      }
    ],
    "missingValues": ["1"]
  }
}

frictionless reads this data as follows:

resource = frictionless.Resource('resource.json')
r.to_pandas()
#   nullable  truthy
# 0        1   False
# 1     None   False
# 2     None    True

Note that only "1" was cast to None, not 1. (I have no idea why both "1" and 1 are loaded as False (@roll ?).

So I think what you are suggesting here is actually something "new": category values are strings, and should be matched to field values after casting these field values to string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or at least how it has been interpreted in the core implementation)

Agreed, I can definitely see that now after the discussion in the data representation PR. I've been treating the bits in the implementation that handle native values in these ways as somewhat undefined extensions because the core spec was not clear on these fronts. But I guess enough people have been making use of this interpretation that it's the de facto standard at this point.

it was thinking that trueValues and missingValues would be for CSV only,

Right – there's many parts of the spec that are CSV-specific that I think are going to create more ambiguity the more native formats we try to support with the same schema model.

since JSON has native null and true and therefore does not require such string → logical conversions.

missingValues is still necessary in JSON, when data sets use missing codes. For example:

 "data": [
    {
      "field1": 1,
      "field2": 1
    },
    {
      "field1": 2,
      "field2": 3
    },
    {
      "field1": -99,
      "field2": "OMITTED"
    }
    {
      "field1": -98,
      "field2": "REFUSED"
    }
  ],

Similarly, trueValues is necessary for formats like SQLite, which doesn't have native boolean (unless we allow booleans to implicitly cast from integer types).

So I think what you are suggesting here is actually something "new": category values are strings, and should be matched to field values after casting these field values to string.

Yeah, the way I've been thinking about the spec is definitely diverges from the current implementation, but not just re: categorical values – I've been thinking that all physical values should be matched as string in the spec, to avoid ambiguity with logical types that can be equally stored as native numbers or strings (which include dates and time intervals as well as categoricals). But I realize this has its own set of problems too (it replaces type coercion ambiguity with type serialization ambiguity) – so perhaps we should just run with what we're already doing in the python implementation, as the Data Representation PR suggests.

—------------

Back to categorical values though: I would say categorical values are neither strings nor numbers, in the same way Dates are neither strings nor numbers – they're a discrete logical type. I think the logical type of categorical should always be represented as string in the schema (i.e. in constraints) so they do not get mixed in with numeric values, but some categoricals should have the option in implementations to be loaded as their numeric codes (instead of labels) if desired. (e.g. r.to_pandas(categorical_codes = True))

What if we include a valueType property (as you mentioned near the beginning) to make the conversion more explicit?

{
  "name": "fruit",
  "type": "categorical",
  "valueType": "integer",
  "categories": [
    { "value": 0, "label": "apple" },
    { "value": 1, "label": "orange" },
    { "value": 2, "label": "banana" }
  ]
}

This would give us a categorical with logical values apple, orange, and banana, but implementations would have the option to load the coded values 0, 1, 2 instead of logical values if desired. Similarly,

{
  "name": "agreement_scale",
  "type": "categorical",
  "valueType": "integer",
  "categories": [0, 1, 2]
}

Would give us a categorical with logical values "0", "1", "2", but alternatively loadable as integer codes if desired. (When no labels are given, codes are converted into string labels to form the logical type).

This side-steps the whole native values issue because our "valueType" is now explicit and no longer tied to the native format. To summarize: categorical values have labels and codes. Labels are the logical values of the categorical and always represented via string. Codes provide are an alternative representation of the logical values and are either string or integer (specified via valueType).

Thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the example above not be simplified to:

{
  "name": "fruit",
  "type": "integer",
  "categories": [
    { "value": 0, "label": "apple" },
    { "value": 1, "label": "orange" },
    { "value": 2, "label": "banana" }
  ]
}

The existence of the categories field then implies that the field can/should be interpreted as a categorical variable?

Copy link
Contributor Author

@khusmann khusmann May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the example above not be simplified to:

Sure, this points us back in a similar direction as enumLabels, as described here. The key difference here is that we infer categorical based on the existence of the categories (formerly enumLabels) prop rather than looking in constraints as we did with enumLabels. I think this is an improvement on enumLabels, because it decouples type validation and constraint validation.

So for a categorical without codes we'd have something like:

{
  "name": "fruit",
  "type": "string",
  "categories": [ "apple", "orange", "banana" ]
}

Logical values of the categorical would then necessarily be their primitive type, so constraints would be specified as follows:

{
  "name": "fruit",
  "type": "string",
  "categories": [ "apple", "orange", "banana" ],
  "constraints": {
    "enum": ["apple"]
  }
}
{
  "name": "fruit",
  "type": "integer",
  "categories": [
    { "value": 0, "label": "apple" },
    { "value": 1, "label": "orange" },
    { "value": 2, "label": "banana" }
  ],
  "constraints": {
    "enum": [0]
  }
}

(We would then also have a boolean property something like categoriesOrdered to indicate ordering)

If we decide that the logical values of categoricals in frictionless are indeed the values of their primitive type (rather than represented by their labels), then perhaps this is the approach we should take.

I'm liking this idea -- @ezwelty @pschumm what do you think?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khusmann I've been hesitant about defining labels as the logical values of a categorical because they are optional and, when present, not equal to what is actually stored in the source file. So I think this is a good approach, since it takes care of the typing issues and implementations can as before (perhaps with control from the user) load fields with categories into categorical data types.

}
```

The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the levels of the `categorical` have a natural order. When present, the `ordered` property `MUST` be a boolean. When `ordered` is `true`, implementations `SHOULD` interpret the order of the levels as defined in the `categories` property as the natural ordering of the levels, in ascending order. For example:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean by ascending order. If the order is defined by how they are physically ordered in categories, then no sorting is needed, so ascending/descending does not play a role.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ascending order here is relating to the idea that the order of the list is low to high, not high to low. Should we say something like,

When ordered is true, implementations SHOULD interpret the order of the levels as defined in the categories property as the natural ordering of the levels, where the first level represents the "lowest" level

Or something like that?

Copy link

@ezwelty ezwelty May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also seems confusing, because it might be interpreted as referring to low vs. high logical values, rather than their position.

implementations SHOULD interpret the order of the levels as the order in which they are defined [alternate: listed] in the categories property.

To me, that is clear. The categories [c, a, b] are ordered as given – [c, a, b] – and not e.g. [a, b, c] as might be understood based on the terms "ascending order" or that the first level is the "lowest" level.

p.s. I don't think the word "level" is needed, and would be better replaced by "category" throughout. So we have:

implementations SHOULD interpret the order of the categories as the order in which they are defined [alternate: listed] in the categories property.


Although the `categorical` field type restricts a field to a finite set of possible values, similar to an [`enum`](#enum) constraint, the `categorical` field type enables data producers to explicitly indicate to implementations that a field `SHOULD` be loaded as a categorical data type (when supported by the implementation). By contrast, `enum` constraints simply add validation rules to existing field types.

When an `enum` constraint is defined on a `categorical` field, the values in the `enum` constraint `MUST` be a subset of the logical values representing the levels of the `categorical`. Logical values of categorical levels are indicated by their labels, if present, or by their physical value, if a label is not present.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definition of the logical value seems worth highlighting as it may not be intuitive. For the example above, it would result in ["Strongly Disagree", "2", "3", "4", "Strongly Agree"]. Frankly, partial use of labels seems so unusual given this definition that I would stick to just two examples: one with only values (fruit), one with label for each value with an ordered: true property (survey responses), and then maybe use the freed-up space to clarify that even though data file contains the values "1"..., enum would be ["Strongly Disagree", ...].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the challenge here is that a logical categorical is defined by its abstract levels, not the values that represent the levels. So in the wild you'll see

  1. Categoricals with levels represented by meaningless values that correspond to labels (1: MALE, 2: FEMALE)
  2. Categoricals with levels represented by meaningful values ("MALE", "FEMALE")
  3. Categoricals with levels represented by meaningful values that all correspond to labels (A 1-3 agreement scale where 1: "Disagree", 2: "Neutral", 3: "Agree")
  4. Categoricals with levels represented by meaningful values where some correspond to labels (like the example you highlighted).

For 1-3, I think it's clear that the logical values should be represented by the labels, or the values when the labels aren't available.

4 is sort of a weird case, as you say. But still very common. I prefer logical values ["Strongly Disagree", "2", "3", "4", "Strongly Agree"] to ["1", "2", "3", "4", "5"] for consistency with the above rule, and because it helps distinguish the logical values from its underlying codes. It's the result I'd want if I imported this into R or Pandas, for example, where I don't have logical types that simultaneously store both code and label representation.

But yeah, I agree it's awkward to jump to such an unusual example... Maybe do what you're suggesting but then make some small mention of what to do when it is partially labeled?

@pschumm what are your thoughts?

@khusmann
Copy link
Contributor Author

A few things stood out to me as being unintuitive or unclear. Thanks for all your hard work and for putting up with my pickiness.

Your comments are quite appreciated! Did you get a chance to look at the native values version (#62)? I'm curious how you see these comparing. (Particularly on the definition of logical values -- do you think the approach of enums "match the full level definition" might be better here as well for the incomplete labeled cases?)

@djvanderlaan
Copy link

I had the following as a remark at some lines of code of the pull request, but I think, because it is more general and also concerns missing values and (frictionlessdata/datapackage#62), I put it as a general comment:

The fact that a field contains a categorical variable, does not mean that everybody will always want to work the the labels. Expecially during data preparation and processing, most analysts that I know actually prefer to work with the original codes. Labels are often long text fields. Therefore, it is easier to make typing errors (e.g. fruit == "Apple"). Labels will also more frequently change while codes are more stable. And (string) codes might contain hierarchical information. Therefore, I personally prefer the native values version (frictionlessdata/datapackage#62) as that makes it easier/more natural to switch between the different representations.

Something similar goes for missing values. During processing and data preparation and analysis (e.g. missing data analysis), it can be important to distuingish between "" = "empty field"; "98" = "not applicable" and "99" = "did not anwer". The current spec seems to require that these values are always converted to missing values:

missingValues dictates which string values MUST be treated as null values. This conversion to null is done before any other attempted type-specific string conversion.

That "" should be treated a missing can be specified in the csv specification. The other values can be handled by using logical values for missingValues in the field schema.

@khusmann
Copy link
Contributor Author

The fact that a field contains a categorical variable, does not mean that everybody will always want to work the the labels. Expecially during data preparation and processing, most analysts that I know actually prefer to work with the original codes.

Right – it should be up to the implementation to choose whether values or labels are loaded by default. Arguments can be made on either side regarding these use of values vs labels in processing pipelines: for example, although codes can be easier to type, using labels helps to prevent using the wrong code for the wrong field. When everything is all numeric, there's no indication that a particular code belongs to a particular field.

The question at hand, I think, is how we want to define / reference logical categorical values within a frictionless schema. I've leaned towards labels, because as @pschumm has pointed out in the past, codes are often arbitrary and software specific. But I can go either way on this.

Personally, I like to give my categorical levels short labels, and then store the potentially long field text (and other extended info) as an additional metadata. This way they're easy to type and there's no danger of getting numeric values mixed between fields.

The current spec seems to require that these values are always converted to missing values:

Ah, good point! We should change that so that it's clear missing values can be loaded as logical values when this is supported by the implementation. How about something like this:

missingValues dictates which values MUST be treated as missing values. Depending on implementation support for representing interlaced logical values and missing values, implementations MAY offer different ways of handling missingness when loading a field, including but not limited to: converting all missing values to null, loading missing values inline with a field's logical values, or loading the missing values for a field in a separate, additional column.

@djvanderlaan
Copy link

Personally, I like to give my categorical levels short labels, and then store the potentially long field text (and other extended info) as an additional metadata. This way they're easy to type and there's no danger of getting numeric values mixed between fields.

I agree. Unfortunately we are not always able to choose the labels we get with a data set. I also see labels that are more like descriptions than actual labels.

The question at hand, I think, is how we want to define / reference logical categorical values within a frictionless schema. I've leaned towards labels, because as @pschumm has pointed out in the past, codes are often arbitrary and software specific. But I can go either way on this.

My preference would be to use the values as they are present in the data set. Although codes can sometimes be arbitrary, for a given data set they are not. I see the categories more as a layer on top that a user or tool may or may not want to use.

Ah, good point! We should change that so that it's clear missing values can be loaded as logical values when this is supported by the implementation. How about something like this:

(With this point we deviate a bit from the discussion on categorical types; perhaps this should be in another issue). I personally understand "interlaced logical values and missing values" but I am not sure this is clear for everyone. Perhaps that part is not really needed:

missingValues dictates which values SHOULD be treated as missing values. Depending on implementation support for representing missing values, implementations MAY offer different ways of handling missingness when loading a field, including but not limited to: converting all missing values to null, loading missing values inline with a field's logical values, or loading the missing values for a field in a separate, additional column.

I also changed the MUST to SHOULD as "loading missing values inline with a field's logical values" is not really treating it as missing values.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for labeled missingness Promote the Enum Labels and Ordering pattern to the Table Schema spec?
8 participants