Skip to content

Initial alpha release for version 1.0.0

Pre-release
Pre-release
Compare
Choose a tag to compare
@nmcdonnell-kx nmcdonnell-kx released this 25 Feb 13:49

Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in memory. This data format has a rich datatype system (included nested data types) designed to support the needs of analytic database systems, dataframe libraries, and more.

The arrowkdb integration enables kdb+ users to read and write Arrow tables created from kdb+ data using:

  • Parquet file format
  • Arrow IPC record batch file format
  • Arrow IPC record batch stream format

Currently Arrow supports over 35 datatypes including concrete, parameterized and nested datatypes. Each Arrow datatype is mapped to a kdb+ type and arrowkdb can seamlessly convert between both representations.

Separate APIs are provided where the Arrow table is either created from a kdb+ table using an inferred schema or from an Arrow schema and the table’s list of array data.

  • Inferred schemas. If you are less familiar with Arrow or do not wish to use the more complex or nested Arrow datatypes, arrowkdb can infer the schema from a kdb+ table where each column in the table is mapped to a field in the schema.
  • Constructed schemas. Although inferred schemas are easy to use, they support only a subset of the Arrow datatypes and are considerably less flexible. Where more complex schemas are required then these should be manually constructed using the datatype/field/schema constructor functions which arrowkdb exposes, similar to the C++ Arrow library and PyArrow.