-
Notifications
You must be signed in to change notification settings - Fork 5
Conversation
Looks good but the reference to In addition I'd support a feature to reference categories defined elsewhere, but this is another issue and could be added later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing @khusmann! Looks great!
I added a few comments
@nichtich Great questions.
Yes, definitely.
This should not be allowed. Enum constraints on categoricals should only include values from the valid set of levels. So a categorical with categories a, b, and c could have an enum constraint with some combination of a, b, or c.
No, and enum constraints don't imply categorical either: enum constraints are a validation rule, a categorical is a type. It should be fully possible to have a categorical with
In practice these situations come up when you have a survey item with multiple options, but you know for some reason (like a design rule) that participants should only select some subset of the options. Would you suggest we put these clarifications with
Agreed! |
Resolves frictionlessdata/datapackage#880. When paired with the `categorical` type, this gives us full support for the value labels found in many statistical software packages. I have included it here rather than a separate PR because these issues are intertwined and there's a synergy in addressing them simultaneously. That said, if you would rather see this in a PR, let me know and I can revert.
Based on @roll's comments above, I've added a commit that adds support for labeled missingness. As I mention in the commit log, I think the issues of categorical labels and labeled missingness are intertwined and there's a synergy in addressing them simultaneously. That said, if you would rather see this in a separate PR, let me |
@khusmann |
Then this should be documented. I'd assume that |
What do you think about the language in the latest revision? (Last sentence in particular)
I don't think categorical type implies a default Similarly, if we try to put an |
I agree that the conceptual distinction between type and constraint dictates that it should be possible, at least in principle, to use an Thanks very much @khusmann for doing the work to translate this from a pattern into a spec. I think that with your modifications in response to the various comments above, it now looks to be in excellent shape and accommodates full use of categorical variables. I would point out that there is one possible use of value labels or formats (only relevant for Stata, SAS or SPSS) that this does not accommodate; namely, the case where you want to label only a few values that are not missing values but you don't want to have to enumerate all possible values in the schema. For example, you might have a top-coded age variable where you want to label the value 90 with "90 or older" but you don't want to have to enumerate all of the integers between 1-90. This may or may not be something you want to treat as categorical in your analyses, and so I don't really see this as a limitation in the way you've defined a I think this is conceptually distinct from the new |
I wasn't feeling swayed by the proposal ("but we already have [{
"value": "apple",
"description": "Any member of the Malus genus"
},
{
"value": "citrus",
"description": "Any member of the Citrus genus"
}] Could be rendered as part of the field's description as:
The overlap between this type's I'm more concerned about conversion. Going forward, should implementations drop any behavior that assumed a categorical from the presence of Note that not all systems supported unordered categoricals (e.g. PostgreSQL Enum), so it is maybe worth mentioning that the lexical/physical order of the categories in the schema should be taken as their order in this case? |
Thanks @khusmann for translating this into a PR 🎉
|
This is a good point that I missed; I agree that the spec should be clear on this. Since lexical order is not always meaningful, I think physical order of the array should be used by default in creating the corresponding target data representation (e.g., in creating a categorical or value label). This does create an ambiguity however in a case like: {
"name": "fruit",
"type": "categorical",
"categories": [
{ "value": 2, "label": "banana" },
{ "value": 0, "label": "apple" },
{ "value": 1, "label": "orange" }
]
} where in Pandas you would create |
Just to be clear, are you proposing to permit arbitrary additional properties, or only predefined properties like (Parenthetical comment: Your mention of |
…inition in the categories field
Good point. And I agree that this functionality is conceptually distinct; we can address it in a future pattern.
Precisely! I think this will open up a lot of patterns / extensions for defining category-level metadata that can be used to render documentation / code books / data dictionaries. This isn't something that we need to explicitly allow in the spec though, right? (Because all frictionless descriptors can be extended by user defined properties and future patterns / extensions?)
I think the old behavior in many implementations (that wasn't defined in the spec, as @peterdesmet pointed out) of converting If the I've added this as a backward compatibility note in the
Good catch. I've just updated the spec to explicitly address this:
For the case you provide, I don't think there's any issue because it's an unordered categorical and so it should not matter if the implementation internally represents it as For ordered categoricals I think it presents more of an issue for SAS / SPSS / Stata implementations because they order their ordinal vars by way of the codes. If a producer created a coded ordinal variable where for whatever reason they wanted the lexical ordering of the codes to be different from the order they actually intended for the levels, then this would not be immediately import-able into SAS / SPSS / Stata without further transformation. I think this is enough of a weird / exceptional case we can leave this up to implementations to resolve. For example, if I was trying to import data into SAS / SPSS / Stata and this very special case came up, I'd want it to trigger some user intervention to resolve the conflict between the ordering of the codes and the ordering of the levels. (The user needs to choose between preserving the ordering, or preserving the codes). For the spec definition, I think we can just leave it with the simple blanket " |
Note -- just realized the direction of ordering of levels was not explicitly defined; just added a commit to clarify that it is ascending order. |
I agree with everything you said, but I still think it makes sense to include guidance for data creators in the case where the |
Ah, yeah, that's not bad at all! Added your line in the latest commit with minor modifications:
Changed |
@khusmann I notice that the spec allows a categorical of mixed type (?). This introduces some problems. First, every category value needs to be checked to determine the data type or whether it is of mixed type. In the latter case, strange things could happen: systems that cannot support this need to default to something (presumably string), and for systems that can, they need to check each value to determine the type (e.g. is "0" in this CSV After thinking the whole idea over, I do feel some hesitation. The Table Schema spec is intended mainly for data publishing: here is the data and here is the metadata you'll need to correctly interpret it. It largely avoids the question of how the data should be represented once loaded (e.g. no distinction between different integer, decimal, or floating point formats), but this proposal pushes the spec in that direction: "here are some strings, but please load them as a (un)ordered categorical". |
As @peterdesmet pointed out, I think this ambiguity will be resolved if/when #49 is accepted, because we can replace "string or number" with "native value". But I think I agree with your point here regarding the larger implementation issue native values represent across the spec (not limited to categoricals): frictionlessdata/datapackage#49 (comment) If #49 is not accepted, I think the But when/if #49 is accepted, and So I don't see this as an issue / limitation of the current proposal; it will be resolved once #49 is resolved.
I agree! But this is the very reason we need categorical types -- there's a massive amount of research data published as coded categoricals (e.g. 1: MALE, 2: FEMALE). So without labels you'd just get a stream of meaningless integers instead of level labels. You need the label metadata in order to correctly interpret it. Similarly, in order to perform the correct kinds of statistical analyses (or visualizations) when making use of the data, you need to know if the categorical was ordered or unordered, and have the exhaustive list of what all the possible levels were (that may or may not be all represented in the data). To me, the argument for inclusion of (Also, the spec does have an integer field type! We just don't distinguish between floating point and decimal types at the present). |
Actually, I'm realizing the former case would conflict with the recent addition of requiring levels with numeric values to be listed in order, so that would need to be revisited in that event. In any case, whatever direction we take for |
At the risk of being a bit duplicative (apologies in advance):
I totally agree; in fact, the original rationale behind a previous version of this concept stated that explicitly. The additional information provided is intended solely as metadata indicating how to interpret the field:
Note that (1) is not the same as the information provided by an
This is a fair point, though to some extent I agree with @khusmann that this is at least partially addressed by #49. AFAIK only Pandas permits you to define a mixed type categorical, presumably treating all of the values as dtype |
I believe your latest change suggests this but does not require it (which is consistent with what I had suggested). I don't think we need to get hung up on this, and am glad to retract it if folks prefer. I believe it is only relevant for Stata, SAS and SPSS, where a categorical with all numeric values would be cast to a numeric variable. I agree that the order of the array should, in cases where In sum, since a major reason for introducing categorical support is to facilitate use of Frictionless by people working with these software packages, I think this suggestion (not requirement but only suggestion) would be helpful, but I also agree with a recent comment (can't recall where) that the specs shouldn't really be referring to specific software packages. So I'm glad to defer to group opinion on this. |
Thank you @khusmann and @pschumm for your responses. I'm feeling better about this now :) Sorry for missing #49, I agree that this partly addresses the issues I brought up. I'm working down the list from the latest weekly updated, and hadn't yet made it that far. I still see the potential for ambiguity, which I brought up in that issue (frictionlessdata/datapackage#49 (comment)), and which would point towards an optional |
I very much agree with you there! All is well and good when the data backing the field is always With native types, I think we get two options -- define Or, here's another potential option -- we could limit the native types that are allowed to represent the categorical, when categorical data types are not supported in the native format. It looks like this is already being done on other types in the native type proposal, so this fits in well. We would restrict it to 1) native categorical types 2) native string representations of categoricals 3) native numeric representations of categoricals This way if someone tried to define a categorical field type using a native date type (as you showed in an example), this would not be allowed. If I follow the example for
Thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khusmann I've made two suggestions to already allow categories
to point to URL or Path with the object (cf. schema
)
Thanks for these suggestions! I'm a little hesitant to merge in this round because I think we should let the discussion simmer in frictionlessdata/datapackage#888 a bit more to better define our approach for external definitions. As written, referencing a JSON in I'd rather have a system closer to what @nichtich suggested here: frictionlessdata/datapackage#875 (comment) , where the Would you be ok with keeping the current proposal as it stands, and then we can add external category lists in a future iteration after we've had more discussion about the pros/cons of different approaches? |
Sure, makes sense! |
Hi @khusmann! Is it not a draft anymore? 😃 |
categorical
field typecategorical
field type
I would say it's looking pretty stable now... just removed [Draft] from the title, if that's what you were looking for? Let me know if you need anything else to keep this moving forward... |
Thanks! Yea, it was a draft, although quite ready, I think. We need a quorum now 😃 |
|
||
Although the `categorical` field type restricts a field to a finite set of possible values, like an [`enum`](#enum) constraint, the `categorical` field type enables data producers to explicitly indicate to implementations that a field `SHOULD` be loaded as a categorical data type (when supported by the implementation). By contrast, `enum` constraints simply add validation rules to existing field types. When an `enum` constraint is defined on a `categorical` field, the values in the `enum` constraint `MUST` be a subset of the physical values representing the levels of the `categorical`. | ||
|
||
The `categorical` field type `MUST` have the property `categories` that defines the set of possible values of the field. The `categories` property `MUST` be an array of strings, or an array of objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why just an array of strings, why not also an array of integers, since you can specify integers for value
in the expanded syntax? More generally, why limit ourselves here at all?
I think this PR is getting pulled in two different directions because of #49, so I'm splitting this PR into two separate ones. I've edited this one to be all-in on lexical/physical values, and will push a native value version shortly. The edits required to support native values were a little more involved than I thought, because as @pschumm mentioned, different native formats have different levels of support for level types, labels, and ordering. So the This is the beauty, I think, of the lexical (physical) values approach – we don't need to think about all the possible native types we could be converting from and how they interact with this type, because all we're looking at are textual representations of the values (as if coming from a CSV) without any other type information from the native format (like representation type, levels, order) attached. I also addressed @ezwelty's excellent point about the enum constraints from the data representation thread. I think we actually had this wrong in our earlier draft: constraints operate on logical values, not physical values. This ambiguity is actually a little easier to solve than the native value issue, because we can define what the logical representation of a categorical level should be when used in the spec. Here, I've defined it as:
Because physical values are by definition strings, and the labels are strings, the values in the enum constraint will always be strings. Other changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things stood out to me as being unintuitive or unclear. Thanks for all your hard work and for putting up with my pickiness.
} | ||
``` | ||
|
||
When the `categories` property is an array of objects, each object `MUST` have a `value` and an optional `label` property. The `value` property `MUST` be a string that matches the physical value of the field when representing that level. The optional `label` property, when present, `MUST` be a string that provides a human-readable label for the level. For example, if the physical values `"0"`, `"1"`, and `"2"` were used as codes to represent the levels `apple`, `orange`, and `banana` in the previous example, the `categories` property would be defined as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to make sure I understand the use of physical
here, do you mean that the categorical
type can only be used on a field whose data is stored as string, or would this also be valid on e.g. values stored as numbers in JSON, which would be cast to strings before comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original intent of physical
here is in the same way it's used in missingValues
or trueValues
in v1. They are string
values that are used to match on the source value before it is cast to logical, as described in the "why strings" section in missingValues.
So in my reading of the v1 spec, "missingValues": ["-99"]
will match on both a CSV "-99"
as well as a JSON numeric -99
. Similarly "trueValues": ["1"]
would match on a CSV "1"
, JSON number 1
or SQLite integer 1
. This allows the same schema definition to have similar behavior across formats. So here a "value": "0"
will also match on a field with JSON number 0
, or SQLite integer 0
(or a SPSS integer 0
@pschumm).
...but yeah, it's not ideal, for all the reasons we've been talking about in #49. It works great for delimited text, but with typed/binary formats it starts getting hazy. (It doesn't help that v1 later says physical
values can include type info... but that is contradicted by the fact that missingValues
(and trueValues
) are all string
)
I keep finding myself thinking that for well-defined support of typed/binary formats we'd need a format-dependent "native value" validation layer (like TableDialect, but just asserts native value types) before values get to TableSchema for final cast to "logical value" along with logical validation. Or something like that. Our challenge is that we're trying to do both steps in one layer with TableSchema, which forces us to make these uncomfortable trade-offs & coercions.
Back to categorical
here though -- are you favoring we take out the reference to string
? Or are you thinking more in an itemType
property direction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your intention is good, but isn't actually what v1 had in mind (or at least how it has been interpreted in the core implementation) because it was thinking that trueValues
and missingValues
would be for CSV only, since JSON has native null
and true
and therefore does not require such string → logical conversions. To me, the key is this line from the constraints section:
Say we have the following:
{
"profile": "tabular-data-resource",
"name": "resource",
"data": [
{
"nullable": 1,
"truthy": 1
},
{
"nullable": "1",
"truthy": "1"
},
{
"nullable": null,
"truthy": true
}
],
"schema": {
"fields": [
{
"name": "nullable",
"type": "any"
},
{
"name": "truthy",
"type": "boolean",
"trueValues": ["1"]
}
],
"missingValues": ["1"]
}
}
frictionless
reads this data as follows:
resource = frictionless.Resource('resource.json')
r.to_pandas()
# nullable truthy
# 0 1 False
# 1 None False
# 2 None True
Note that only "1"
was cast to None
, not 1
. (I have no idea why both "1"
and 1
are loaded as False
(@roll ?).
So I think what you are suggesting here is actually something "new": category values are strings, and should be matched to field values after casting these field values to string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or at least how it has been interpreted in the core implementation)
Agreed, I can definitely see that now after the discussion in the data representation PR. I've been treating the bits in the implementation that handle native values in these ways as somewhat undefined extensions because the core spec was not clear on these fronts. But I guess enough people have been making use of this interpretation that it's the de facto standard at this point.
it was thinking that
trueValues
andmissingValues
would be for CSV only,
Right – there's many parts of the spec that are CSV-specific that I think are going to create more ambiguity the more native formats we try to support with the same schema model.
since JSON has native
null
andtrue
and therefore does not require such string → logical conversions.
missingValues
is still necessary in JSON, when data sets use missing codes. For example:
"data": [
{
"field1": 1,
"field2": 1
},
{
"field1": 2,
"field2": 3
},
{
"field1": -99,
"field2": "OMITTED"
}
{
"field1": -98,
"field2": "REFUSED"
}
],
Similarly, trueValues
is necessary for formats like SQLite, which doesn't have native boolean (unless we allow booleans to implicitly cast from integer types).
So I think what you are suggesting here is actually something "new": category values are strings, and should be matched to field values after casting these field values to string.
Yeah, the way I've been thinking about the spec is definitely diverges from the current implementation, but not just re: categorical values – I've been thinking that all physical values should be matched as string in the spec, to avoid ambiguity with logical types that can be equally stored as native numbers or strings (which include dates and time intervals as well as categoricals). But I realize this has its own set of problems too (it replaces type coercion ambiguity with type serialization ambiguity) – so perhaps we should just run with what we're already doing in the python implementation, as the Data Representation PR suggests.
—------------
Back to categorical values though: I would say categorical values are neither strings nor numbers, in the same way Dates are neither strings nor numbers – they're a discrete logical type. I think the logical type of categorical should always be represented as string
in the schema (i.e. in constraints
) so they do not get mixed in with numeric values, but some categoricals should have the option in implementations to be loaded as their numeric codes (instead of labels) if desired. (e.g. r.to_pandas(categorical_codes = True)
)
What if we include a valueType
property (as you mentioned near the beginning) to make the conversion more explicit?
{
"name": "fruit",
"type": "categorical",
"valueType": "integer",
"categories": [
{ "value": 0, "label": "apple" },
{ "value": 1, "label": "orange" },
{ "value": 2, "label": "banana" }
]
}
This would give us a categorical with logical values apple
, orange
, and banana
, but implementations would have the option to load the coded values 0
, 1
, 2
instead of logical values if desired. Similarly,
{
"name": "agreement_scale",
"type": "categorical",
"valueType": "integer",
"categories": [0, 1, 2]
}
Would give us a categorical with logical values "0"
, "1"
, "2"
, but alternatively loadable as integer codes if desired. (When no labels are given, codes are converted into string labels to form the logical type).
This side-steps the whole native values issue because our "valueType" is now explicit and no longer tied to the native format. To summarize: categorical values have labels and codes. Labels are the logical values of the categorical and always represented via string
. Codes provide are an alternative representation of the logical values and are either string
or integer
(specified via valueType).
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the example above not be simplified to:
{
"name": "fruit",
"type": "integer",
"categories": [
{ "value": 0, "label": "apple" },
{ "value": 1, "label": "orange" },
{ "value": 2, "label": "banana" }
]
}
The existence of the categories
field then implies that the field can/should be interpreted as a categorical variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the example above not be simplified to:
Sure, this points us back in a similar direction as enumLabels
, as described here. The key difference here is that we infer categorical based on the existence of the categories
(formerly enumLabels
) prop rather than looking in constraints as we did with enumLabels
. I think this is an improvement on enumLabels
, because it decouples type validation and constraint validation.
So for a categorical without codes we'd have something like:
{
"name": "fruit",
"type": "string",
"categories": [ "apple", "orange", "banana" ]
}
Logical values of the categorical would then necessarily be their primitive type, so constraints would be specified as follows:
{
"name": "fruit",
"type": "string",
"categories": [ "apple", "orange", "banana" ],
"constraints": {
"enum": ["apple"]
}
}
{
"name": "fruit",
"type": "integer",
"categories": [
{ "value": 0, "label": "apple" },
{ "value": 1, "label": "orange" },
{ "value": 2, "label": "banana" }
],
"constraints": {
"enum": [0]
}
}
(We would then also have a boolean property something like categoriesOrdered
to indicate ordering)
If we decide that the logical values of categoricals in frictionless are indeed the values of their primitive type (rather than represented by their labels), then perhaps this is the approach we should take.
I'm liking this idea -- @ezwelty @pschumm what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khusmann I've been hesitant about defining labels as the logical values of a categorical because they are optional and, when present, not equal to what is actually stored in the source file. So I think this is a good approach, since it takes care of the typing issues and implementations can as before (perhaps with control from the user) load fields with categories
into categorical data types.
} | ||
``` | ||
|
||
The `categorical` field type `MAY` additionally have the property `ordered` that indicates whether the levels of the `categorical` have a natural order. When present, the `ordered` property `MUST` be a boolean. When `ordered` is `true`, implementations `SHOULD` interpret the order of the levels as defined in the `categories` property as the natural ordering of the levels, in ascending order. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you mean by ascending order
. If the order is defined by how they are physically ordered in categories
, then no sorting is needed, so ascending/descending does not play a role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ascending order here is relating to the idea that the order of the list is low to high, not high to low. Should we say something like,
When
ordered
istrue
, implementationsSHOULD
interpret the order of the levels as defined in thecategories
property as the natural ordering of the levels, where the first level represents the "lowest" level
Or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That also seems confusing, because it might be interpreted as referring to low vs. high logical values, rather than their position.
implementations
SHOULD
interpret the order of the levels as the order in which they are defined [alternate: listed] in thecategories
property.
To me, that is clear. The categories [c, a, b] are ordered as given – [c, a, b] – and not e.g. [a, b, c] as might be understood based on the terms "ascending order" or that the first level is the "lowest" level.
p.s. I don't think the word "level" is needed, and would be better replaced by "category" throughout. So we have:
implementations
SHOULD
interpret the order of the categories as the order in which they are defined [alternate: listed] in thecategories
property.
|
||
Although the `categorical` field type restricts a field to a finite set of possible values, similar to an [`enum`](#enum) constraint, the `categorical` field type enables data producers to explicitly indicate to implementations that a field `SHOULD` be loaded as a categorical data type (when supported by the implementation). By contrast, `enum` constraints simply add validation rules to existing field types. | ||
|
||
When an `enum` constraint is defined on a `categorical` field, the values in the `enum` constraint `MUST` be a subset of the logical values representing the levels of the `categorical`. Logical values of categorical levels are indicated by their labels, if present, or by their physical value, if a label is not present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definition of the logical
value seems worth highlighting as it may not be intuitive. For the example above, it would result in ["Strongly Disagree", "2", "3", "4", "Strongly Agree"]
. Frankly, partial use of labels seems so unusual given this definition that I would stick to just two examples: one with only values (fruit), one with label for each value with an ordered: true
property (survey responses), and then maybe use the freed-up space to clarify that even though data file contains the values "1"...
, enum
would be ["Strongly Disagree", ...]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the challenge here is that a logical categorical is defined by its abstract levels, not the values that represent the levels. So in the wild you'll see
- Categoricals with levels represented by meaningless values that correspond to labels (1: MALE, 2: FEMALE)
- Categoricals with levels represented by meaningful values ("MALE", "FEMALE")
- Categoricals with levels represented by meaningful values that all correspond to labels (A 1-3 agreement scale where 1: "Disagree", 2: "Neutral", 3: "Agree")
- Categoricals with levels represented by meaningful values where some correspond to labels (like the example you highlighted).
For 1-3, I think it's clear that the logical values should be represented by the labels, or the values when the labels aren't available.
4 is sort of a weird case, as you say. But still very common. I prefer logical values ["Strongly Disagree", "2", "3", "4", "Strongly Agree"]
to ["1", "2", "3", "4", "5"]
for consistency with the above rule, and because it helps distinguish the logical values from its underlying codes. It's the result I'd want if I imported this into R or Pandas, for example, where I don't have logical types that simultaneously store both code and label representation.
But yeah, I agree it's awkward to jump to such an unusual example... Maybe do what you're suggesting but then make some small mention of what to do when it is partially labeled?
@pschumm what are your thoughts?
Your comments are quite appreciated! Did you get a chance to look at the native values version (#62)? I'm curious how you see these comparing. (Particularly on the definition of logical values -- do you think the approach of |
I had the following as a remark at some lines of code of the pull request, but I think, because it is more general and also concerns missing values and (frictionlessdata/datapackage#62), I put it as a general comment: The fact that a field contains a categorical variable, does not mean that everybody will always want to work the the labels. Expecially during data preparation and processing, most analysts that I know actually prefer to work with the original codes. Labels are often long text fields. Therefore, it is easier to make typing errors (e.g. Something similar goes for missing values. During processing and data preparation and analysis (e.g. missing data analysis), it can be important to distuingish between
That |
Right – it should be up to the implementation to choose whether values or labels are loaded by default. Arguments can be made on either side regarding these use of values vs labels in processing pipelines: for example, although codes can be easier to type, using labels helps to prevent using the wrong code for the wrong field. When everything is all numeric, there's no indication that a particular code belongs to a particular field. The question at hand, I think, is how we want to define / reference logical categorical values within a frictionless schema. I've leaned towards labels, because as @pschumm has pointed out in the past, codes are often arbitrary and software specific. But I can go either way on this. Personally, I like to give my categorical levels short labels, and then store the potentially long field text (and other extended info) as an additional metadata. This way they're easy to type and there's no danger of getting numeric values mixed between fields.
Ah, good point! We should change that so that it's clear missing values can be loaded as logical values when this is supported by the implementation. How about something like this:
|
I agree. Unfortunately we are not always able to choose the labels we get with a data set. I also see labels that are more like descriptions than actual labels.
My preference would be to use the values as they are present in the data set. Although codes can sometimes be arbitrary, for a given data set they are not. I see the categories more as a layer on top that a user or tool may or may not want to use.
(With this point we deviate a bit from the discussion on categorical types; perhaps this should be in another issue). I personally understand "interlaced logical values and missing values" but I am not sure this is clear for everyone. Perhaps that part is not really needed:
I also changed the |
Overview
A first pass at defining a categorical field type, designed to facilitate interoperability with software packages that support categorical data types. Needs work still, but just wanted to keep the creative juices flowing on this issue!
Paging @pschumm and @peterdesmet...