From 086c22f69a7b7453817edd42a1aab36cf1d679aa Mon Sep 17 00:00:00 2001 From: Kyle Husmann Date: Wed, 5 Jun 2024 17:25:58 -0700 Subject: [PATCH] add dialect.type property --- content/docs/specifications/table-dialect.md | 392 ++++++++++--------- 1 file changed, 214 insertions(+), 178 deletions(-) diff --git a/content/docs/specifications/table-dialect.md b/content/docs/specifications/table-dialect.md index 3c3d2b24..9129ee71 100644 --- a/content/docs/specifications/table-dialect.md +++ b/content/docs/specifications/table-dialect.md @@ -23,7 +23,7 @@ The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `S ## Introduction -Table Dialect defines set of properties that can be used by data producers and data consumers to ensure data interoperability in various [Tabular Data](../glossary/#tabular-data) formats such as CSV, JSON, or Excel. The main goal of this specification is to define a common language for defining tabular data dialects. It is not expected that all the properties are supported by all the Data Package implementations. An implementation `MUST` choose the most suitable strategy for communicating to the users if some relevant feature is not supported. +Table Dialect defines set of properties that can be used by data producers and data consumers to ensure data interoperability in various [Tabular Data](../glossary/#tabular-data) formats such as CSV, JSON, or Excel. The main goal of this specification is to define a common language for defining tabular data dialects. It is not expected that all the dialect types and properties are supported by all the Data Package implementations. An implementation `MUST` choose the most suitable strategy for communicating to the users if some relevant feature is not supported. Table Dialect is useful for programmes which might have to deal with multiple dialects of tabular files, but which can rely on being told out-of-band which dialect will be used in a given input stream. This reduces the need for heuristic inference of dialects, and simplifies the implementation of readers, which must juggle dialect inference, schema inference, unseekable input streams, character encoding issues, and the lazy reading of very large input streams. @@ -33,24 +33,19 @@ Table Dialect supersedes [CSV Dialect](https://specs.frictionlessdata.io/csv-dia ## Descriptor -Table Dialect descriptor `MUST` be a descriptor as per [Descriptor](../glossary/#descriptor) definition. A list of standard properties that can be included into a descriptor is defined in the [Properties](#properties) section. +Table Dialect descriptor `MUST` be a descriptor as per [Descriptor](../glossary/#descriptor) definition. The descriptor `MAY` include a `type` property to indicate which of optional dialect properties `SHOULD` be considered when reading the target data format. When a `type` property is not provided, it `SHOULD` be assumed that the descriptor is a `delimited` dialect type. A list of standard dialect types and their properties are defined in the [Dialect Types](#types) section. A list of additional dialect properties that apply to multiple dialect types are defined in the [Additional Dialect Properties](#properties) section. An example of a Table Dialect descriptor: ```json { + "type": "delimited", "header": false, "delimiter": ";", "quoteChar": "'" } ``` -## Properties - -Table dialect defines individual properties that regulates data producing and consuming for different groups of targeted formats, as well, as general rules that applies for multiple data format groups. Note, that property grouping is only illustrative, if a property is suitable for a format group that is not mentioned in this specification it still can be used as far as the property definition and semantics are respected. - -A property `MUST` be ignored if it is no applicable for an arbitrary data format. For example, SQL databases do not have a concept of a header row. - For the sake of simplicity, most of examples are written in the CSV data format. For example, this data file without providing any Table Dialect properties: ```csv @@ -66,164 +61,17 @@ id,name {id: 2, name: "orange"} ``` -### General - -General properties are format-agnostic. Usually, there are useful for defining dialects for delimiter-based and spreadsheet-based formats like CSV or Excel. - -#### `$schema` - -A root level Table Dialect descriptor `MAY` have a `$schema` property that `MUST` be a profile as per [Profile](../glossary/#profile) definition that `MUST` include all the metadata constraints required by this specification. - -The default value is `https://datapackage.org/profiles/1.0/tabledialect.json` and the recommended value is `https://datapackage.org/profiles/2.0/tabledialect.json`. - -#### `header` - -A Table Dialect descriptor `MAY` have the `header` property that `MUST` be boolean with default value `true`. This property indicates whether the file includes a header row. If `true` the first row in the file `MUST` be interpreted as a header row, not data. - -For example, this data file: - -```csv -1,apple -2,orange -``` - -With this dialect definition: - -```json -{ - "header": false -} -``` - -`SHOULD` output this data: - -```javascript -{field1: 1, field2: "apple"} -{field1: 2, field2: "orange"} -``` - -Where `field1` and `field2` names are implementation-specific and used here only for illustrative purpose. - -#### `headerRows` - -A Table Dialect descriptor `MAY` have the `headerRows` property that `MUST` be an array of positive integers starting from 1 with default value `[1]`. This property specifies the row numbers for the header. It is `RECOMMENDED` to be used for multiline-header files. - -For example, this data file: - -```csv -fruit -id,name -1,apple -2,orange -``` - -With this dialect definition: +## Dialect Types -```json -{ - "headerRows": [1, 2] -} -``` +The supported dialect types and their properties are defined as follows. -`SHOULD` output this data: +### `delimited` -```javascript -{"fruit id": 1, "fruit name": "apple"} -{"fruit id": 2, "fruit name": "orange"} -``` - -#### `headerJoin` - -A Table Dialect descriptor `MAY` have the `headerJoin` property that `MUST` be a string with default value `" "`. This property specifies how multiline-header files have to join the resulting header rows. - -For example, this data file: - -```csv -fruit -id,name -1,apple -2,orange -``` - -With this dialect definition: - -```json -{ - "headerRows": [1, 2], - "headerJoin": "-" -} -``` - -`SHOULD` output this data: - -```javascript -{"fruit-id": 1, "fruit-name": "apple"} -{"fruit-id": 2, "fruit-name": "orange"} -``` - -#### `commentRows` - -A Table Dialect descriptor `MAY` have the `commentRows` property that `MUST` be an array of positive integers starting from 1; undefined by default. This property specifies what rows have to be omitted from the data. - -For example, this data file: - -```csv -id,name -#fruits -1,apple -2,orange -``` - -With this dialect definition: - -```json -{ - "commentRows": [2] -} -``` - -`SHOULD` output this data: - -```javascript -{id: 1, name: "apple"} -{id: 2, name: "orange"} -``` - -#### `commentChar` - -A Table Dialect descriptor `MAY` have the `commentChar` property that `MUST` be a string of one or more characters; undefined by default. This property specifies what rows have to be omitted from the data based on the row's first characters. - -For example, this data file: - -```csv -id,name -#fruits -1,apple -2,orange -``` - -With this dialect definition: - -```json -{ - "commentChar": "#" -} -``` - -`SHOULD` output this data: - -```javascript -{id: 1, name: "apple"} -{id: 2, name: "orange"} -``` - -### Delimited - -Delimited formats is a group of textual formats such as CSV and TSV. +`delimited` dialect types apply to delimited textual formats such as CSV and TSV. #### `delimiter` -A Table Dialect descriptor `MAY` have the `delimiter` property that `MUST` be a string; with default value `,` (comma). This property specifies the character sequence which separates fields in the data file. +A `delimited` Table Dialect descriptor `MAY` have the `delimiter` property that `MUST` be a string; with default value `,` (comma). This property specifies the character sequence which separates fields in the data file. For example, this data file: @@ -237,6 +85,7 @@ With this dialect definition: ```json { + "type": "delimited", "delimiter": "|" } ``` @@ -250,7 +99,7 @@ With this dialect definition: #### `lineTerminator` -A Table Dialect descriptor `MAY` have the `lineTerminator` property that `MUST` be a string; with default value `\r\n`. This property specifies the character sequence which terminates rows. +A `delimited` Table Dialect descriptor `MAY` have the `lineTerminator` property that `MUST` be a string; with default value `\r\n`. This property specifies the character sequence which terminates rows. For example, this data file: @@ -262,6 +111,7 @@ With this dialect definition: ```json { + "type": "delimited", "lineTerminator": ";" } ``` @@ -275,7 +125,7 @@ With this dialect definition: #### `quoteChar` -A Table Dialect descriptor `MAY` have the `quoteChar` property that `MUST` be a string of one character length with default value `"` (double quote). This property specifies a character to use for quoting in case the `delimiter` needs to be used inside a data cell. +A `delimited` Table Dialect descriptor `MAY` have the `quoteChar` property that `MUST` be a string of one character length with default value `"` (double quote). This property specifies a character to use for quoting in case the `delimiter` needs to be used inside a data cell. For example, this data file: @@ -289,6 +139,7 @@ With this dialect definition: ```json { + "type": "delimited", "quoteChar": "'" } ``` @@ -302,7 +153,7 @@ With this dialect definition: #### `doubleQuote` -A Table Dialect descriptor `MAY` have the `doubleQuote` property that `MUST` be boolean with default value `true`. This property controls the handling of `quoteChar` inside data cells. If true, two consecutive quotes are interpreted as one. +A `delimited` Table Dialect descriptor `MAY` have the `doubleQuote` property that `MUST` be boolean with default value `true`. This property controls the handling of `quoteChar` inside data cells. If true, two consecutive quotes are interpreted as one. For example, this data file: @@ -316,6 +167,7 @@ With this dialect definition: ```json { + "type": "delimited", "doubleQuote": true } ``` @@ -329,7 +181,7 @@ With this dialect definition: #### `escapeChar` -A Table Dialect descriptor `MAY` have the `escapeChar` property that `MUST` be a string of one character length; undefined by default. This property specifies a one-character string to use for escaping, for example, `\`, mutually exclusive with `quoteChar`. +A `delimited` Table Dialect descriptor `MAY` have the `escapeChar` property that `MUST` be a string of one character length; undefined by default. This property specifies a one-character string to use for escaping, for example, `\`, mutually exclusive with `quoteChar`. For example, this data file: @@ -343,6 +195,7 @@ With this dialect definition: ```json { + "type": "delimited", "escapeChar": "|" } ``` @@ -356,7 +209,7 @@ With this dialect definition: #### `nullSequence` -A Table Dialect descriptor `MAY` have the `nullSequence` property that `MUST` be a string; undefined by default. This property specifies specifies the null sequence, for example, `\N`. +A `delimited` Table Dialect descriptor `MAY` have the `nullSequence` property that `MUST` be a string; undefined by default. This property specifies specifies the null sequence, for example, `\N`. For example, this data file: @@ -370,6 +223,7 @@ With this dialect definition: ```json { + "type": "delimited", "nullSequence": "NA" } ``` @@ -383,7 +237,7 @@ With this dialect definition: #### `skipInitialSpace` -A Table Dialect descriptor `MAY` have the `skipInitialSpace` property that `MUST` be boolean with default value `false`. This property specifies how to interpret whitespace which immediately follows a delimiter; if `false`, it means that whitespace immediately after a delimiter is treated as part of the following field. +A `delimited` Table Dialect descriptor `MAY` have the `skipInitialSpace` property that `MUST` be boolean with default value `false`. This property specifies how to interpret whitespace which immediately follows a delimiter; if `false`, it means that whitespace immediately after a delimiter is treated as part of the following field. For example, this data file: @@ -397,6 +251,7 @@ With this dialect definition: ```json { + "type": "delimited", "skipInitialSpace": true } ``` @@ -408,13 +263,13 @@ With this dialect definition: {id: 2, name: "orange"} ``` -### Structured +### `structured` -Structured formats is a group of structured or semi-structured formats such as JSON and YAML. +`structured` dialect types apply to structured or semi-structured formats such as JSON and YAML. #### `property` -A Table Dialect descriptor `MAY` have the `property` property that `MUST` be a string; undefined by default. This property specifies where a data array is located in the data structure. +A `structured` Table Dialect descriptor `MAY` have the `property` property that `MUST` be a string; undefined by default. This property specifies where a data array is located in the data structure. For example, this data file: @@ -431,6 +286,7 @@ With this dialect definition: ```json { + "type": "structured", "property": "rows" } ``` @@ -444,7 +300,7 @@ With this dialect definition: #### `itemType` -A Table Dialect descriptor `MAY` have the `itemType` property that `MUST` be a string with value `array` or `object`; undefined by default. This property specifies whether the data `property` contains an array of arrays or an array of objects. +A `structured` Table Dialect descriptor `MAY` have the `itemType` property that `MUST` be a string with value `array` or `object`; undefined by default. This property specifies whether the data `property` contains an array of arrays or an array of objects. For example, this data file: @@ -460,6 +316,7 @@ With this dialect definition: ```json { + "type": "structured", "itemType": "array" } ``` @@ -473,7 +330,7 @@ With this dialect definition: #### `itemKeys` -A Table Dialect descriptor `MAY` have the `itemKeys` property that `MUST` be array of strings; undefined by default. This property specifies the way of extracting rows from data arrays with `itemType` is `object`. +A `structured` Table Dialect descriptor `MAY` have the `itemKeys` property that `MUST` be array of strings; undefined by default. This property specifies the way of extracting rows from data arrays with `itemType` is `object`. For example, this data file: @@ -488,6 +345,7 @@ With this dialect definition: ```json { + "type": "structured", "itemKeys": ["id", "name"] } ``` @@ -499,13 +357,13 @@ With this dialect definition: {id: 2, name: "orange"} ``` -### Spreadsheet +### `spreadsheet` -Spreadsheet formats is a group of sheet-based formats such as Excel or ODS. +`spreadsheet` dialect types apply to sheet-based data formats such as Excel or ODS. #### `sheetNumber` -A Table Dialect descriptor `MAY` have the `sheetNumber` property that `MUST` be an integer with default value `1`. This property specifies a sheet number of a table in the spreadsheet file. +A `spreadsheet` Table Dialect descriptor `MAY` have the `sheetNumber` property that `MUST` be an integer with default value `1`. This property specifies a sheet number of a table in the spreadsheet file. For example, this data file: @@ -518,6 +376,7 @@ With this dialect definition: ```json { + "type": "spreadsheet", "sheetNumber": 2 } ``` @@ -526,7 +385,7 @@ With this dialect definition: #### `sheetName` -A Table Dialect descriptor `MAY` have the `sheetName` property that `MUST` be a string; undefined by default. This property specifies a sheet name of a table in the spreadsheet file. +A `spreadsheet` Table Dialect descriptor `MAY` have the `sheetName` property that `MUST` be a string; undefined by default. This property specifies a sheet name of a table in the spreadsheet file. For example, this data file: @@ -539,19 +398,20 @@ With this dialect definition: ```json { + "type": "spreadsheet", "sheetName": "Sheet 2" } ``` `SHOULD` output the data from the second sheet. -### Database +### `database` -Database formats is a group of formats accessing data from databases like SQLite. +`database` dialect types apply to formats accessing data from databases like SQLite. #### `table` -A Table Dialect descriptor `MAY` have the `table` property that `MUST` be a string; undefined by default. This property specifies a name of the table in the database. +A `database` Table Dialect descriptor `MAY` have the `table` property that `MUST` be a string; undefined by default. This property specifies a name of the table in the database. For example, the database with the tables below: @@ -564,18 +424,194 @@ With this dialect definition: ```json { + "type": "database", "table": "table2" } ``` `SHOULD` output the data from the second table. +## Additional Dialect Properties + +The following dialect properties apply to multiple dialect types. + +### `$schema` + +- **Property Type**: object +- **Dialect Types**: all + +A root level Table Dialect descriptor `MAY` have a `$schema` property that `MUST` be a profile as per [Profile](../glossary/#profile) definition that `MUST` include all the metadata constraints required by this specification. + +The default value is `https://datapackage.org/profiles/1.0/tabledialect.json` and the recommended value is `https://datapackage.org/profiles/2.0/tabledialect.json`. + +### `header` + +- **Property Type**: boolean +- **Dialect Types**: delimited, spreadsheet + +A Table Dialect descriptor `MAY` have the `header` property that `MUST` be boolean with default value `true`. This property indicates whether the file includes a header row. If `true` the first row in the file `MUST` be interpreted as a header row, not data. + +For example, this data file: + +```csv +1,apple +2,orange +``` + +With this dialect definition: + +```json +{ + "type": "delimited", + "header": false +} +``` + +`SHOULD` output this data: + +```javascript +{field1: 1, field2: "apple"} +{field1: 2, field2: "orange"} +``` + +Where `field1` and `field2` names are implementation-specific and used here only for illustrative purpose. + +### `headerRows` + +- **Property Type**: array of positive integers +- **Dialect Types**: delimited, spreadsheet + +A Table Dialect descriptor `MAY` have the `headerRows` property that `MUST` be an array of positive integers starting from 1 with default value `[1]`. This property specifies the row numbers for the header. It is `RECOMMENDED` to be used for multiline-header files. + +For example, this data file: + +```csv +fruit +id,name +1,apple +2,orange +``` + +With this dialect definition: + +```json +{ + "type": "delimited", + "headerRows": [1, 2] +} +``` + +`SHOULD` output this data: + +```javascript +{"fruit id": 1, "fruit name": "apple"} +{"fruit id": 2, "fruit name": "orange"} +``` + +### `headerJoin` + +- **Property Type**: string +- **Dialect Types**: delimited, spreadsheet + +A Table Dialect descriptor `MAY` have the `headerJoin` property that `MUST` be a string with default value `" "`. This property specifies how multiline-header files have to join the resulting header rows. + +For example, this data file: + +```csv +fruit +id,name +1,apple +2,orange +``` + +With this dialect definition: + +```json +{ + "type": "delimited", + "headerRows": [1, 2], + "headerJoin": "-" +} +``` + +`SHOULD` output this data: + +```javascript +{"fruit-id": 1, "fruit-name": "apple"} +{"fruit-id": 2, "fruit-name": "orange"} +``` + +### `commentRows` + +- **Property Type**: array of positive integers +- **Dialect Types**: delimited, spreadsheet + +A Table Dialect descriptor `MAY` have the `commentRows` property that `MUST` be an array of positive integers starting from 1; undefined by default. This property specifies what rows have to be omitted from the data. + +For example, this data file: + +```csv +id,name +#fruits +1,apple +2,orange +``` + +With this dialect definition: + +```json +{ + "type": "delimited", + "commentRows": [2] +} +``` + +`SHOULD` output this data: + +```javascript +{id: 1, name: "apple"} +{id: 2, name: "orange"} +``` + +### `commentChar` + +- **Property Type**: string +- **Dialect Types**: delimited, spreadsheet + +A Table Dialect descriptor `MAY` have the `commentChar` property that `MUST` be a string of one or more characters; undefined by default. This property specifies what rows have to be omitted from the data based on the row's first characters. + +For example, this data file: + +```csv +id,name +#fruits +1,apple +2,orange +``` + +With this dialect definition: + +```json +{ + "type": "delimited", + "commentChar": "#" +} +``` + +`SHOULD` output this data: + +```javascript +{id: 1, name: "apple"} +{id: 2, name: "orange"} +``` + ## Example An example of a well-defined Table Dialect descriptor for a CSV format: ```json { + "type": "delimited", "header": false, "commentChar": "#" "delimiter": ";",