Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Make default field type to be any #13

Merged
merged 12 commits into from
Mar 28, 2024
Merged

Make default field type to be any #13

merged 12 commits into from
Mar 28, 2024

Conversation

roll
Copy link
Member

@roll roll commented Jan 6, 2024


Rationale

I think due to historical reasons (there was no any type at the beginning) a field type defaulted to string although basically in any type system e.g. TypeScript or Python's typing if a type is not provided it defaults to any (literally "not provided"). This change is in between an update and a bug fix in my opinion

Note

This change is semantically breaking for very rare use cases although it's structurally not-breaking for implementations

@roll roll changed the title Fixed default field type Make default field type to be any Jan 6, 2024
@pwalsh
Copy link
Member

pwalsh commented Jan 8, 2024

I think the comparison to type systems is not too relevant - these specs were developed against real world use cases of data wrangling where strings are the fundamental representation.

Also, for example, databases in general don't have an "any" type.

@khughitt
Copy link

I am strongly in favor of making the types explicit and precise, including cases where the type can be anything.

I'm less sure about what the default type should be..

  1. in principle, I would prefer not to make assumptions about field types
  2. in practice, however, I would imagine strings are the most likely default field type, although it likely varies by field

I'm similarly unsure about guessing types.

Personally, I love the convenience of it and implement similar things myself in some applications.

But, I can also think of plenty of examples where visidata, pandas, etc. see an uncommon missing value indicator and then treat a whole numeric field as strings.

Perhaps guess by default, with an an ability to disable guessing?

If this is a feature, it might also be good to create a repo or file somewhere where the rules for guessing can be made explicit to help normalize behavior across implementations (and save time for the developers).

Last thought: since the any type also sort of implies something about the variable (i.e. that it is known to hold values of multiple types, or, that the developer doesn't know wth type TypeScript is expecting, and wishes their IDE would stop yelling at them..), perhaps a safer default might be something like unspecified or unknown?

@peterdesmet
Copy link
Member

Since type is an optional property in Table schema (and it should be kept that way), I'm unsure what the difference is between the following fields.

"fields": [
  {
    "name": "A field with an undefined type"
  },
  {
    "name": "A field with type `any`",
    "type": "any"
  }
]

Do we even need that difference? Should implementations treat it differently? frictionless-r currently does: any is converted to string, while the type is guessed if not provided.

@pwalsh
Copy link
Member

pwalsh commented Jan 10, 2024

@peterdesmet

in the current spec, these fields have the same type:

"fields": [
  {
    "name": "A field with an **undeclared** type"
  },
  {
    "name": "A field with type `string`",
    "type": "string"
  }
]

@khughitt
Copy link

The way I tend to think about "any" vs. "undefined" is:

  • "undefined" = no constraints and/or type unknown
  • "any" = no constraints by design

So the former could be used to indicate where there is uncertainty relating to the type of a field, and the later where we know that multiple types are allowed?

But I think this is not the only interpretation and as you suggest, it might not be a necessary distinction?

--

This discussion also makes me think of composite types (e.g. foo: str|int), which could also be useful to consider at some point..
(Just mentioning here to put on people's radar)

@pwalsh
Copy link
Member

pwalsh commented Jan 10, 2024

@peterdesmet about implementations treating this differently. I personally think it is a big no no and it was a real goal when we did the first implementations not to do type guessing or automatic / silent type coercion, as we were highly interested in being able to do strict validation, which is somewhat at odds with type guessing and type coercion.

@khusmann
Copy link
Contributor

khusmann commented Jan 10, 2024

I think the any type is only useful if we're using it to describe a data source where the underlying physical values can have dynamic types (e.g. an sqlite column).

I think the default behavior of a field with an unspecified type should be to return the physical value, without any parsing. But as @peterdesmet pointed out, this will always be string in the context of a table schema, because a table schema implies a textual data source (#864). For this reason I'd go so far as recommend we avoid / discourage / soft-deprecate the any type in table schemas (provided we're agreeing that "table schema" implies textual data)... any just doesn't make sense for textual data sources.

@roll
Copy link
Member Author

roll commented Jan 25, 2024

My point is that current wording:

type and format properties are used to give The type of the field (string, number etc) - see below for more detail. If type is not provided a consumer SHOULD assume a type of “string”.

literally means that for the resource like this

data:
 - [value]
 - [1]
 - [2]
 - [3]
schema:
  fields:
    - name: value

an implementation SHOULD normalize it as:

value
null
null
null

and raise 3 type errors during validation (because strings will be expected but numbers are received).

@roll
Copy link
Member Author

roll commented Jan 25, 2024

But if we have it like this If type is not provided a consumer SHOULD assume a type of "any" the output will be like this:

value
1
2
3

with no validation errors.

@roll
Copy link
Member Author

roll commented Jan 25, 2024

Maybe there is a problem with the any type definition and thus a different perception of what it is?

In frictionless-py we consider any type to be similar to programming language type systems (any == allow any type). So if a field has an any type it means that (basically no-op type):

  • no processing for this field
  • no type validation for this field

For example, here is a use case:

data:
 - [value]
 - [1]
 - ['some']
 - [true]
schema:
  fields:
    - name: value
      type: any

Read as:

value
1 # number
'some' # string
true # boolean

PS.
And yea similarly to frictionless-r in frictionless-py we don't follow current SHOULD regarding defaulting to strings. At the same time, frictionless-py doesn't do guessing of individual fields it defaults it to any (no-op)

PSS.
Pandas also has a type like this -- an object type. The type that allows any logical values. Polars has type unknown https://docs.pola.rs/py-polars/html/reference/api/polars.Unknown.html#polars.Unknown

@pwalsh
Copy link
Member

pwalsh commented Jan 25, 2024

@roll I got you. imho I think unknown is better than any but .....

value
1 # number
'some' # string
true # boolean

if this is the source data in the context of tabular data processing then perhaps we have a different problem? the scope for deterministic data interoperability routines, if the default type is any or unknown, or anything except string, is greatly reduced - at some point, if this is the incoming data, and all validation is bypassed (by default), then, what actually happens as data moves around becomes completely implementation dependent; can't round trip data, etc.

That might be fine, but tbh is precisely the type of thing I can't stand about, say, pandas - I always saw the value proposition here as good and consistent validation enabled by an implementation independent schema - but, that might be out of touch with how frictionless is being used.

@roll
Copy link
Member Author

roll commented Jan 26, 2024

if this is the incoming data, and all validation is bypassed (by default), then, what actually happens as data moves around becomes completely implementation dependent; can't round trip data, etc.

I think sometimes it might be a desired behaviour actually. With an huge growth of the Pyarrow popularity one can use Table Schema only for metadata enrichment like:

path: table.parquet
schema:
  fields:
    - name: field1
      description: ...
      dct:some: ...

Basically meaning "don't touch my data it's already interoperable just give me a tool for meta descriptions". I think it's a beauty of a Data Package concept that it brings at least some additional value almost for any data publishing scenario

@pwalsh
Copy link
Member

pwalsh commented Jan 26, 2024

ok @roll you convinced me.

@peterdesmet
Copy link
Member

@roll can you summarize the conclusion of this discussion? 😅 What is the default type? How should it be interpreted by implementations?

@roll
Copy link
Member Author

roll commented Jan 26, 2024

@peterdesmet
I'm still trying to wrap my head around the current and desired wording, as well as the actual meaning of the implementations.

What would you say if I declare the goal as:

On the reading and validation operations, a data consumer should not make any assumptions about a data type if it is not provided by a data publisher

Or in @khusmann terms:

I think the default behavior of a field with an unspecified type should be to return the physical (native format layer) value, without any parsing.

@pwalsh
Copy link
Member

pwalsh commented Jan 26, 2024

@roll @peterdesmet I have the sense that if we get the correctly terminology around physical/logical sorted (eg @akariv here frictionlessdata/datapackage#864 (comment) ) then, it also helps with describing this and having it clear for implementors.

There is also some ambiguity to solve as to whether "unspecified" is the same as "any", and that is possibly why I have a slight preference still for calling the default type "unknown".

@khusmann
Copy link
Contributor

I like @pwalsh 's suggestion of using "unknown" as a default type. I prefer this over my earlier suggestion to import as "string" by default, because it gives us the ability to distinguish between "fields defined with string values", and "fields that are string values because no field type was given`.

That said, type "any" is already in the spec... so I'm becoming convinced that we could use it as we would an "unknown" type. I think "unknown" is a better name for the reasons @pwalsh gives... but am OK going with "any" for legacy reasons. (I think it would be confusing to have both).

@peterdesmet
Copy link
Member

I think it would be confusing to have both

Agreed!

Copy link

cloudflare-workers-and-pages bot commented Feb 20, 2024

Deploying datapackage with  Cloudflare Pages  Cloudflare Pages

Latest commit: 4cd9eb1
Status: ✅  Deploy successful!
Preview URL: https://db532f45.datapackage.pages.dev
Branch Preview URL: https://827-fix-default-field-type.datapackage.pages.dev

View logs

@roll
Copy link
Member Author

roll commented Feb 20, 2024

Hi, I have updated the PR to this state:

type and format properties are used to give the type of the field (string, number etc) - see below for more detail. If type is not provided a consumer MUST utilize the any type for the field instead of inferring it from the field's values.

any: The field contains values of a unspecified or mixed type. A data consumer MUST NOT perform any processing on this field's values and MUST intepret them as it is in the data source. This data type is directly modelled on the concpet of the any type of strongly typed object-oriented languages like TypeScript.

WDYT?

@khusmann
Copy link
Contributor

This data type is directly modelled on the concpet of the any type of strongly typed object-oriented languages like TypeScript.

The problem with TypeScript's any type is it's not type safe; you can assign it to anything. By contrast, TypeScript's unknown requires a type assertion or narrowing in order to use.

Per the TypeScript announcement:

[The unknown type] is useful for APIs that want to signal "this can be any value, so you must perform some type of checking before you use it". This forces users to safely introspect returned values.

Perhaps it's a pedantic distinction, but I think the behavior we're discussing for frictionless any type is closer to TypeScript's unknown type rather than its any type. We want to communicate "each physical value cell in this field is a string containing a representation of who-knows-what; so it's up to you to parse that string to figure out what logical type it's representing before you use it".

@roll
Copy link
Member Author

roll commented Feb 21, 2024

@khusmann
That is a good point, but I think there is no literal mapping between Table Schema's any in this edition and programming languages anyway. By adding this sentence, I just attempted to say something like "it's conceptually like TS's any, Python's typing.Any or Scala's AnyType" if a a reader familiar with this concept. Maybe it's better just to omit this reference

Co-authored-by: Peter Desmet <[email protected]>
@ezwelty
Copy link

ezwelty commented Feb 26, 2024

I vote yes but I think the documentation needs to be revised to explain what any means for data as CSV vs JSON, lest we have endless confusion.

  • JSON
    • 0: 0 (integer)
    • '0': '0' (string)
    • ...
  • CSV (depending on the dialect)
    • 0: '0' (string)
    • "0": '0' (string)
    • ...

@roll roll added the candidate label Mar 14, 2024
@roll
Copy link
Member Author

roll commented Mar 14, 2024

Hi @ezwelty @PietrH
I have added an example

@roll
Copy link
Member Author

roll commented Mar 28, 2024

Hi @peterdesmet @PietrH
As it seems Paul changed his opinion here (not reflected in the voting yet) we can get a quorum on this one. WDYT?

I think this is really important for all typed data sources like JSON or Parquet, as well, as it's important that it clarifies that types should not be guessed if not provided. As we're having a principal decision here, we will improve the wording in frictionlessdata/datapackage#864

If there is general support, it will be great if we can ship it with the draft release on April 1.

PS.
Also, @khusmann @pschumm I'm not sure if you had a chance to review it after the approach change

@peterdesmet
Copy link
Member

@roll the example clarifies a lot. If type: any, interpret the field as supported by the data format, e.g. 1 data type for csv (string), 6 data types for json (string, number, boolean, null, object, array). Upvoted 👍 .

@roll
Copy link
Member Author

roll commented Mar 28, 2024

Thanks!

ACCEPTED by WG (6/9)

@roll roll merged commit bfa67f1 into main Mar 28, 2024
2 checks passed
@roll roll deleted the 827/fix-default-field-type branch March 28, 2024 09:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default field type: "string" -> "any"
7 participants