Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Commit

Permalink
Table dialect reorganized (#76)
Browse files Browse the repository at this point in the history
* Don't group properties

* List properties per format

* Split introduction, so it is clearer what is expected of implementations

* Move example up

* Mention Table Schema

* Fix typo

* Fix link
  • Loading branch information
peterdesmet authored Jun 24, 2024
1 parent 67f1d9b commit cd7a867
Showing 1 changed file with 96 additions and 57 deletions.
153 changes: 96 additions & 57 deletions content/docs/specifications/table-dialect.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@ The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `S

## Introduction

Table Dialect defines set of properties that can be used by data producers and data consumers to ensure data interoperability in various [Tabular Data](../glossary/#tabular-data) formats such as CSV, JSON, or Excel. The main goal of this specification is to define a common language for defining tabular data dialects. It is not expected that all the properties are supported by all the Data Package implementations. An implementation `MUST` choose the most suitable strategy for communicating to the users if some relevant feature is not supported.
Table Dialect defines set of properties that can be used by data producers and data consumers to ensure data interoperability in various [Tabular Data](../glossary/#tabular-data) formats such as CSV, JSON, or Excel. The main goal of this specification is to define a common language for defining tabular data dialects.

It is not expected that all the properties are supported by all the Data Package implementations. An implementation `MUST` choose the most suitable strategy for communicating to the users if some relevant feature is not supported.

Table Dialect is useful for programmes which might have to deal with multiple dialects of tabular files, but which can rely on being told out-of-band which dialect will be used in a given input stream. This reduces the need for heuristic inference of dialects, and simplifies the implementation of readers, which must juggle dialect inference, schema inference, unseekable input streams, character encoding issues, and the lazy reading of very large input streams.

Expand All @@ -45,9 +47,9 @@ An example of a Table Dialect descriptor:
}
```

## Properties
## Tabular Data Formats

Table dialect defines individual properties that regulates data producing and consuming for different groups of targeted formats, as well, as general rules that applies for multiple data format groups. Note, that property grouping is only illustrative, if a property is suitable for a format group that is not mentioned in this specification it still can be used as far as the property definition and semantics are respected.
Table Dialect can be used for different data formats, such as delimited text files, semi-structured formats and spreadsheets. Some [properties](#properties) are generic and can be used for multiple formats, while others are specific to one format.

A property `MUST` be ignored if it is no applicable for an arbitrary data format. For example, SQL databases do not have a concept of a header row.

Expand All @@ -66,17 +68,86 @@ id,name
{id: 2, name: "orange"}
```

### General
### Delimited

Delimited formats is a group of textual formats such as CSV and TSV. Their charactistics can be expressed the following properties:

- [$schema](#schema): `https://datapackage.org/profiles/1.0/tabledialect.json` by default
- [header](#header): `true` by default
- [headerRows](#headerRows): `1` by default
- [headerJoin](#headerJoin): ` ` by default
- [commentRows](#commentRows): undefined by default
- [commentChar](#commentChar): undefined by default
- [delimiter](#delimiter): `,` by default
- [lineTerminator](#lineTerminator): `\r\n` by default
- [quoteChar](#quoteChar): `"` by default
- [doubleQuote](#doubleQuote): `true` by default
- [escapeChar](#escapeChar): undefined by default
- [nullSequence](#nullSequence): undefined by default
- [skipInitialSpace](#skipInitialSpace): `false` by default

An example of a well-defined Table Dialect descriptor for a CSV format:

```json
{
"header": false,
"commentChar": "#"
"delimiter": ";",
"doubleQuote": true,
"lineTerminator": "\r\n",
"quoteChar": "\"",
"skipInitialSpace": true,
}
```

### Structured

Structured formats is a group of structured or semi-structured formats such as JSON and YAML. Their charactistics can be expressed the following properties:

- [$schema](#schema): `https://datapackage.org/profiles/1.0/tabledialect.json` by default
- [header](#header): `true` by default TODO
- [headerRows](#headerRows): `1` by default TODO
- [headerJoin](#headerJoin): ` ` by default TODO
- [commentRows](#commentRows): undefined by default TODO
- [commentChar](#commentChar): undefined by default TODO
- [property](#property): undefined by default
- [itemType](#itemType): undefined by default
- [itemKeys](#itemKeys): undefined by default

### Spreadsheet

Spreadsheet formats is a group of sheet-based formats such as Excel or ODS. Their charactistics can be expressed the following properties:

- [$schema](#schema): `https://datapackage.org/profiles/1.0/tabledialect.json` by default
- [header](#header): `true` by default TODO
- [headerRows](#headerRows): `1` by default TODO
- [headerJoin](#headerJoin): ` ` by default TODO
- [commentRows](#commentRows): undefined by default TODO
- [commentChar](#commentChar): undefined by default TODO
- [sheetNumber](#sheetNumber): `1` by default
- [sheetName](#sheetName): undefined by default

### Database

Database formats is a group of formats accessing data from databases like SQLite. Their charactistics can be expressed the following properties:

General properties are format-agnostic. Usually, there are useful for defining dialects for delimiter-based and spreadsheet-based formats like CSV or Excel.
- [$schema](#schema): `https://datapackage.org/profiles/1.0/tabledialect.json` by default
- [header](#header): `true` by default TODO
- [headerRows](#headerRows): `1` by default TODO
- [headerJoin](#headerJoin): ` ` by default TODO
- [commentRows](#commentRows): undefined by default TODO
- [commentChar](#commentChar): undefined by default TODO
- [table](#table): undefined by default

#### `$schema`
## Properties

### `$schema`

A root level Table Dialect descriptor `MAY` have a `$schema` property that `MUST` be a profile as per [Profile](../glossary/#profile) definition that `MUST` include all the metadata constraints required by this specification.

The default value is `https://datapackage.org/profiles/1.0/tabledialect.json` and the recommended value is `https://datapackage.org/profiles/2.0/tabledialect.json`.

#### `header`
### `header`

A Table Dialect descriptor `MAY` have the `header` property that `MUST` be boolean with default value `true`. This property indicates whether the file includes a header row. If `true` the first row in the file `MUST` be interpreted as a header row, not data.

Expand Down Expand Up @@ -104,7 +175,7 @@ With this dialect definition:

Where `field1` and `field2` names are implementation-specific and used here only for illustrative purpose.

#### `headerRows`
### `headerRows`

A Table Dialect descriptor `MAY` have the `headerRows` property that `MUST` be an array of positive integers starting from 1 with default value `[1]`. This property specifies the row numbers for the header. It is `RECOMMENDED` to be used for multiline-header files.

Expand Down Expand Up @@ -132,7 +203,7 @@ With this dialect definition:
{"fruit id": 2, "fruit name": "orange"}
```

#### `headerJoin`
### `headerJoin`

A Table Dialect descriptor `MAY` have the `headerJoin` property that `MUST` be a string with default value `" "`. This property specifies how multiline-header files have to join the resulting header rows.

Expand Down Expand Up @@ -161,7 +232,7 @@ With this dialect definition:
{"fruit-id": 2, "fruit-name": "orange"}
```

#### `commentRows`
### `commentRows`

A Table Dialect descriptor `MAY` have the `commentRows` property that `MUST` be an array of positive integers starting from 1; undefined by default. This property specifies what rows have to be omitted from the data.

Expand Down Expand Up @@ -189,7 +260,7 @@ With this dialect definition:
{id: 2, name: "orange"}
```

#### `commentChar`
### `commentChar`

A Table Dialect descriptor `MAY` have the `commentChar` property that `MUST` be a string of one or more characters; undefined by default. This property specifies what rows have to be omitted from the data based on the row's first characters.

Expand Down Expand Up @@ -217,11 +288,7 @@ With this dialect definition:
{id: 2, name: "orange"}
```

### Delimited

Delimited formats is a group of textual formats such as CSV and TSV.

#### `delimiter`
### `delimiter`

A Table Dialect descriptor `MAY` have the `delimiter` property that `MUST` be a string; with default value `,` (comma). This property specifies the character sequence which separates fields in the data file.

Expand All @@ -248,7 +315,7 @@ With this dialect definition:
{id: 2, name: "orange"}
```

#### `lineTerminator`
### `lineTerminator`

A Table Dialect descriptor `MAY` have the `lineTerminator` property that `MUST` be a string; with default value `\r\n`. This property specifies the character sequence which terminates rows.

Expand All @@ -273,7 +340,7 @@ With this dialect definition:
{id: 2, name: "orange"}
```

#### `quoteChar`
### `quoteChar`

A Table Dialect descriptor `MAY` have the `quoteChar` property that `MUST` be a string of one character length with default value `"` (double quote). This property specifies a character to use for quoting in case the `delimiter` needs to be used inside a data cell.

Expand All @@ -300,7 +367,7 @@ With this dialect definition:
{id: 2, name: "orange,fruits"}
```

#### `doubleQuote`
### `doubleQuote`

A Table Dialect descriptor `MAY` have the `doubleQuote` property that `MUST` be boolean with default value `true`. This property controls the handling of `quoteChar` inside data cells. If true, two consecutive quotes are interpreted as one.

Expand All @@ -327,7 +394,7 @@ With this dialect definition:
{id: 2, name: 'orange"fruits'}
```

#### `escapeChar`
### `escapeChar`

A Table Dialect descriptor `MAY` have the `escapeChar` property that `MUST` be a string of one character length; undefined by default. This property specifies a one-character string to use for escaping, for example, `\`, mutually exclusive with `quoteChar`.

Expand All @@ -354,7 +421,7 @@ With this dialect definition:
{id: 2, name: "orange,fruits"}
```

#### `nullSequence`
### `nullSequence`

A Table Dialect descriptor `MAY` have the `nullSequence` property that `MUST` be a string; undefined by default. This property specifies specifies the null sequence, for example, `\N`.

Expand All @@ -381,7 +448,7 @@ With this dialect definition:
{id: 2, name: null}
```

#### `skipInitialSpace`
### `skipInitialSpace`

A Table Dialect descriptor `MAY` have the `skipInitialSpace` property that `MUST` be boolean with default value `false`. This property specifies how to interpret whitespace which immediately follows a delimiter; if `false`, it means that whitespace immediately after a delimiter is treated as part of the following field.

Expand All @@ -408,11 +475,7 @@ With this dialect definition:
{id: 2, name: "orange"}
```

### Structured

Structured formats is a group of structured or semi-structured formats such as JSON and YAML.

#### `property`
### `property`

A Table Dialect descriptor `MAY` have the `property` property that `MUST` be a string; undefined by default. This property specifies where a data array is located in the data structure.

Expand Down Expand Up @@ -442,7 +505,7 @@ With this dialect definition:
{id: 2, name: "orange"}
```

#### `itemType`
### `itemType`

A Table Dialect descriptor `MAY` have the `itemType` property that `MUST` be a string with value `array` or `object`; undefined by default. This property specifies whether the data `property` contains an array of arrays or an array of objects.

Expand Down Expand Up @@ -471,7 +534,7 @@ With this dialect definition:
{id: 2, name: "orange"}
```

#### `itemKeys`
### `itemKeys`

A Table Dialect descriptor `MAY` have the `itemKeys` property that `MUST` be array of strings; undefined by default. This property specifies the way of extracting rows from data arrays with `itemType` is `object`.

Expand Down Expand Up @@ -499,11 +562,7 @@ With this dialect definition:
{id: 2, name: "orange"}
```

### Spreadsheet

Spreadsheet formats is a group of sheet-based formats such as Excel or ODS.

#### `sheetNumber`
### `sheetNumber`

A Table Dialect descriptor `MAY` have the `sheetNumber` property that `MUST` be an integer with default value `1`. This property specifies a sheet number of a table in the spreadsheet file.

Expand All @@ -524,7 +583,7 @@ With this dialect definition:

`SHOULD` output the data from the second sheet.

#### `sheetName`
### `sheetName`

A Table Dialect descriptor `MAY` have the `sheetName` property that `MUST` be a string; undefined by default. This property specifies a sheet name of a table in the spreadsheet file.

Expand All @@ -545,11 +604,7 @@ With this dialect definition:

`SHOULD` output the data from the second sheet.

### Database

Database formats is a group of formats accessing data from databases like SQLite.

#### `table`
### `table`

A Table Dialect descriptor `MAY` have the `table` property that `MUST` be a string; undefined by default. This property specifies a name of the table in the database.

Expand All @@ -570,25 +625,9 @@ With this dialect definition:

`SHOULD` output the data from the second table.

## Example

An example of a well-defined Table Dialect descriptor for a CSV format:

```json
{
"header": false,
"commentChar": "#"
"delimiter": ";",
"doubleQuote": true,
"lineTerminator": "\r\n",
"quoteChar": "\"",
"skipInitialSpace": true,
}
```

## Excluded

Table Dialect has nothing to do with the names, contents or types of the headers or data within the CSV file, only how it is formatted. However, CSV Dialect does allow the presence or absence of a header to be specified, similarly to [RFC4180](http://www.ietf.org/rfc/rfc4180.txt).
Table Dialect has nothing to do with the names, contents or types of the headers or data within the CSV file (see [Table Schema](../table-schema/) instead), only how it is formatted. However, Table Dialect does allow the presence or absence of a header to be specified, similarly to [RFC4180](http://www.ietf.org/rfc/rfc4180.txt).

Table Dialect is also orthogonal to the character encoding used in the CSV file. Note that it is possible for files in CSV format to contain data in more than one encoding.

Expand Down

0 comments on commit cd7a867

Please sign in to comment.