-
Notifications
You must be signed in to change notification settings - Fork 5
Conversation
I think the comparison to type systems is not too relevant - these specs were developed against real world use cases of data wrangling where strings are the fundamental representation. Also, for example, databases in general don't have an "any" type. |
I am strongly in favor of making the types explicit and precise, including cases where the type can be anything. I'm less sure about what the default type should be..
I'm similarly unsure about guessing types. Personally, I love the convenience of it and implement similar things myself in some applications. But, I can also think of plenty of examples where visidata, pandas, etc. see an uncommon missing value indicator and then treat a whole numeric field as strings. Perhaps guess by default, with an an ability to disable guessing? If this is a feature, it might also be good to create a repo or file somewhere where the rules for guessing can be made explicit to help normalize behavior across implementations (and save time for the developers). Last thought: since the |
Since "fields": [
{
"name": "A field with an undefined type"
},
{
"name": "A field with type `any`",
"type": "any"
}
] Do we even need that difference? Should implementations treat it differently? frictionless-r currently does: |
in the current spec, these fields have the same type:
|
The way I tend to think about "any" vs. "undefined" is:
So the former could be used to indicate where there is uncertainty relating to the type of a field, and the later where we know that multiple types are allowed? But I think this is not the only interpretation and as you suggest, it might not be a necessary distinction? -- This discussion also makes me think of composite types (e.g. |
@peterdesmet about implementations treating this differently. I personally think it is a big no no and it was a real goal when we did the first implementations not to do type guessing or automatic / silent type coercion, as we were highly interested in being able to do strict validation, which is somewhat at odds with type guessing and type coercion. |
I think the I think the default behavior of a field with an unspecified type should be to return the physical value, without any parsing. But as @peterdesmet pointed out, this will always be |
My point is that current wording:
literally means that for the resource like this
an implementation SHOULD normalize it as:
and raise 3 type errors during validation (because strings will be expected but numbers are received). |
But if we have it like this
with no validation errors. |
Maybe there is a problem with the In
For example, here is a use case:
Read as:
PS. PSS. |
@roll I got you. imho I think
if this is the source data in the context of tabular data processing then perhaps we have a different problem? the scope for deterministic data interoperability routines, if the default type is That might be fine, but tbh is precisely the type of thing I can't stand about, say, pandas - I always saw the value proposition here as good and consistent validation enabled by an implementation independent schema - but, that might be out of touch with how frictionless is being used. |
I think sometimes it might be a desired behaviour actually. With an huge growth of the Pyarrow popularity one can use Table Schema only for metadata enrichment like:
Basically meaning "don't touch my data it's already interoperable just give me a tool for meta descriptions". I think it's a beauty of a Data Package concept that it brings at least some additional value almost for any data publishing scenario |
ok @roll you convinced me. |
@roll can you summarize the conclusion of this discussion? 😅 What is the default type? How should it be interpreted by implementations? |
@peterdesmet What would you say if I declare the goal as:
Or in @khusmann terms:
|
@roll @peterdesmet I have the sense that if we get the correctly terminology around physical/logical sorted (eg @akariv here frictionlessdata/datapackage#864 (comment) ) then, it also helps with describing this and having it clear for implementors. There is also some ambiguity to solve as to whether "unspecified" is the same as "any", and that is possibly why I have a slight preference still for calling the default type "unknown". |
I like @pwalsh 's suggestion of using "unknown" as a default type. I prefer this over my earlier suggestion to import as "string" by default, because it gives us the ability to distinguish between "fields defined with string values", and "fields that are string values because no field That said, type "any" is already in the spec... so I'm becoming convinced that we could use it as we would an "unknown" type. I think "unknown" is a better name for the reasons @pwalsh gives... but am OK going with "any" for legacy reasons. (I think it would be confusing to have both). |
Agreed! |
Deploying datapackage with Cloudflare Pages
|
Hi, I have updated the PR to this state:
WDYT? |
The problem with TypeScript's Per the TypeScript announcement:
Perhaps it's a pedantic distinction, but I think the behavior we're discussing for frictionless |
@khusmann |
Co-authored-by: Peter Desmet <[email protected]>
I vote yes but I think the documentation needs to be revised to explain what
|
Co-authored-by: Pieter Huybrechts <[email protected]>
Hi @peterdesmet @PietrH I think this is really important for all typed data sources like JSON or Parquet, as well, as it's important that it clarifies that types should not be guessed if not provided. As we're having a principal decision here, we will improve the wording in frictionlessdata/datapackage#864 If there is general support, it will be great if we can ship it with the draft release on April 1. PS. |
Co-authored-by: Peter Desmet <[email protected]>
@roll the example clarifies a lot. If |
Co-authored-by: Peter Desmet <[email protected]>
Thanks! ACCEPTED by WG (6/9) |
Rationale
I think due to historical reasons (there was no
any
type at the beginning) a field type defaulted tostring
although basically in any type system e.g. TypeScript or Python's typing if a type is not provided it defaults toany
(literally "not provided"). This change is in between an update and a bug fix in my opinionNote
This change is semantically breaking for very rare use cases although it's structurally not-breaking for implementations