AirbyteRecords are required to conform to the Airbyte type system. This means that all sources must produce schemas and records within these types, and all destinations must handle records that conform to this type system.
Because Airybet's interfaces are JSON-based, this type system is realized using JSON schemas. In order to work around some limitations of JSON schemas, schemas may declare an additional airbyte_type
annotation. This is used to disambiguate certain types that JSON schema does not explicitly differentiate between. See the specific types section for details.
This type system does not (generally) constrain values. Sources may declare streams using additional features of JSON schema (such as the length
property for strings), but those constraints will be ignored by all other Airbyte components. The exception is in numeric types; integer
and number
fields must be representable within 64-bit primitives.
This table summarizes the available types. See the Specific Types section for explanation of optional parameters.
Airbyte type | JSON Schema | Examples |
---|---|---|
String | {"type": "string"} |
"foo bar" |
Date | {"type": "string", "format": "date"} |
"2021-01-23" |
Datetime with timezone | {"type": "string", "format": "date-time", "airbyte_type": "timestamp_with_timezone"} |
"2022-11-22T01:23:45+05:00" |
Datetime without timezone | {"type": "string", "format": "date-time", "airbyte_type": "timestamp_without_timezone"} |
"2022-11-22T01:23:45" |
Integer | {"type": "integer"} |
42 |
Big integer (unrepresentable as a 64-bit two's complement int) | {"type": "string", "airbyte_type": "big_integer"} |
"123141241234124123141241234124123141241234124123141241234124123141241234124" |
Number | {"type": "number"} |
1234.56 |
Big number (unrepresentable as a 64-bit IEEE 754 float) | {"type": "string", "airbyte_type": "big_number"} |
"1,000,000,...,000.1234" with 500 0's |
Array | {"type": "array"} ; optionally items and additionalItems |
[1, 2, 3] |
Object | {"type": "object"} ; optionally properties and additionalProperties |
{"foo": "bar"} |
Untyped (i.e. any value is valid) | {} |
|
Union | {"anyOf": [...]} or {"oneOf": [...]} |
As a reminder, sources expose a discover
command, which returns a list of AirbyteStreams
, and a read
method, which emits a series of AirbyteRecordMessages
. The type system determines what a valid json_schema
is for an AirbyteStream
, which in turn dictates what messages read
is allowed to emit.
For example, a source could produce this AirbyteStream
(remember that the json_schema
must declare "type": "object"
at the top level):
{
"name": "users",
"json_schema": {
"type": "object",
"properties": {
"username": {
"type": "string"
},
"age": {
"type": "integer"
},
"appointments": {
"type": "array",
"items": "string",
"airbyte_type": "timestamp_with_timezone"
}
}
}
}
Along with this AirbyteRecordMessage
(observe that the data
field conforms to the json_schema
from the stream):
{
"stream": "users",
"data": {
"username": "someone42",
"age": 84,
"appointments": ["2021-11-22T01:23:45+00:00", "2022-01-22T14:00:00+00:00"]
},
"emitted_at": 1623861660
}
The top-level object
must conform to the type system. This means that all of the fields must also conform to the type system.
Many sources cannot guarantee that all fields are present on all records. As such, they may replace the type
entry in the schema with ["null", "the_real_type"]
. For example, this schema is the correct way for a source to declare that the age
field may be missing from some records:
{
"type": "object",
"properties": {
"username": {
"type": "string"
},
"age": {
"type": ["null", "integer"]
}
}
}
This would then be a valid record:
{"username": "someone42"}
Nullable fields are actually the more common case, but this document omits them in other examples for the sake of clarity.
As an escape hatch, destinations which cannot handle a certain type should just fall back to treating those values as strings. For example, let's say a source discovers a stream with this schema:
{
"type": "object",
"properties": {
"appointments": {
"type": "array",
"items": {
"type": "string",
"airbyte_type": "timestamp_with_timezone"
}
}
}
}
Along with records that look like this:
{"appointments": ["2021-11-22T01:23:45+00:00", "2022-01-22T14:00:00+00:00"]}
The user then connects this source to a destination that cannot handle array
fields. The destination connector should simply JSON-serialize the array back to a string when pushing data into the end platform. In other words, the destination connector should behave as though the source declared this schema:
{
"type": "object",
"properties": {
"appointments": {
"type": "string"
}
}
}
And emitted this record:
{"appointments": "[\"2021-11-22T01:23:45+00:00\", \"2022-01-22T14:00:00+00:00\"]"}
Airbyte has three temporal types: date
, timestamp_with_timezone
, and timestamp_without_timezone
. These are represented as strings with specific format
(either date
or date-time
).
However, JSON schema does not have a built-in way to indicate whether a field includes timezone information. For example, given the schema
{
"type": "object",
"properties": {
"created_at": {
"type": "string",
"format": "date-time",
"airbyte_type": "timestamp_with_timezone"
}
}
}
Both {"created_at": "2021-11-22T01:23:45+00:00"}
and {"created_at": "2021-11-22T01:23:45"}
are valid records. The airbyte_type
annotation resolves this ambiguity; sources producing date-time
fields must set the airbyte_type
to either timestamp_with_timezone
or timestamp_without_timezone
.
64-bit integers and floating-point numbers (AKA long
and double
) cannot represent every number in existence. The big_integer
and big_number
types indicate that the source may produce numbers outside the ranges of long
and double
s.
Note that these are declared as "type": "string"
. This is intended to make parsing more safe by preventing accidental overflow/loss-of-precision.
Arrays contain 0 or more items, which must have a defined type. These types should also conform to the type system. Arrays may require that all of their elements be the same type ("items": {whatever type...}
), or they may require specific types for the first N entries ("items": [{first type...}, {second type...}, ... , {Nth type...}]
, AKA tuple-type).
Tuple-typed arrays can configure the type of any additional elements using the additionalItems
field; by default, any type is allowed. They may also pass a boolean to enable/disable additional elements, with "additionalItems": true
being equivalent to "additionalItems": {}
and "additionalItems": false
meaning that only the tuple-defined items are allowed.
Destinations may have a difficult time supporting tuple-typed arrays without very specific handling, and as such are permitted to somewhat loosen their requirements. For example, many Avro-based destinations simply declare an array of a union of all allowed types, rather than requiring the correct type in each position of the array.
As with arrays, objects may declare properties
, each of which should have a type which conforms to the type system. Objects may additionally accept additionalProperties
, as true
(any type is acceptable), a specific type (all additional properties must be of that type), or false
(no additonal properties are allowed).
In some cases, sources may want to use multiple types for the same field. For example, a user might have a property which holds either an object, or a string
explanation of why that data is missing. This is supported with JSON schema's oneOf
and anyOf
types.
In some unusual cases, a property may not have type information associated with it. This is represented by the empty schema {}
. As many destinations do not allow untyped data, this will frequently trigger the string-typed escape hatch.