Skip to content

Commit

Permalink
first pass at I/O using parquet files
Browse files Browse the repository at this point in the history
  • Loading branch information
schlegelp committed Oct 17, 2023
1 parent 3b1ccb7 commit 6fc38c9
Show file tree
Hide file tree
Showing 4 changed files with 513 additions and 10 deletions.
3 changes: 3 additions & 0 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -551,6 +551,9 @@ Functions to import/export neurons.
navis.write_json
navis.write_precomputed
navis.read_precomputed
navis.read_parquet
navis.write_parquet
navis.scan_parquet


.. _api_utility:
Expand Down
14 changes: 9 additions & 5 deletions navis/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,24 @@
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.

from .json_io import write_json, read_json
from .json_io import read_json, write_json
from .swc_io import read_swc, write_swc
from .nrrd_io import read_nrrd, write_nrrd
from .precomputed_io import write_precomputed, read_precomputed
from .precomputed_io import read_precomputed, write_precomputed
from .hdf_io import read_h5, write_h5, inspect_h5
from .rda_io import read_rda
from .nmx_io import read_nmx
from .mesh_io import read_mesh, write_mesh
from .tiff_io import read_tiff
from .pq_io import read_parquet, write_parquet, scan_parquet

__all__ = ['write_json', 'read_json',
__all__ = ['read_json', 'write_json',
'read_swc', 'write_swc',
'read_nrrd', 'write_nrrd',
'read_h5', 'write_h5', 'inspect_h5',
'write_precomputed', 'read_precomputed',
'read_precomputed', 'write_precomputed',
'read_tiff',
'read_rda', 'read_nmx', 'read_mesh', 'write_mesh']
'read_rda',
'read_nmx',
'read_mesh', 'write_mesh',
'read_parquet', 'write_parquet', 'scan_parquet']
25 changes: 20 additions & 5 deletions navis/io/pq_io.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Current formats for storing neuroanatomical data typically focus on one neuron per file. Unsurprisingly this doesn't scale well to tens of thousands of neurons: reading and writing a large number of small files quickly becomes painfully slow.

Here, we propose a file format that stores an arbitrary number of neurons in a single Parquet file. What is Parquet? Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval. It has two important properties for what we are trying to do:
Here, we propose a file format that stores an arbitrary number of neurons in a single Parquet file. What is Parquet you ask? Why, Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval. It has two important properties for what we are trying to do:

1. Because it is column-oriented we can quickly search for a given neuron without having to load the entire file.
2. It allows storage of arbitrary meta-data which we can use to store neuron properties
Expand Down Expand Up @@ -51,7 +51,9 @@ The table for two dotprops with IDs `12345` and `67890` would look like this:
15970 36362 23044 -0.459681 -0.524251 0.716836 67890
```

The node table must contain the following columns: `x`, `y`, `z`, `vec_x`, `vec_y`, `vec_z` and `neuron`. Additional columns are allowed but may be ignored by the reader.
The node table must contain the following columns: `x`, `y`, `z`, and `neuron`.
Additional columns such as `vec_x`, `vec_y`, `vec_z` or `alpha` are allowed but
may be ignored by the reader.

### Meta data

Expand All @@ -61,19 +63,32 @@ Meta data can be stored in Parquet files as `{key: value}` dictionary where both
This means that floats/integers need to be converted to bytes or strings.

To keep track of which neuron has which property, the meta data is encoded in
the dictionary as `{ID_PROPERTY: VALUE}`. For example, if our two neurons in the
the dictionary as `{ID:PROPERTY: VALUE}`. For example, if our two neurons in the
examples above had names they would be encode as:

```
{"12345_name": "Humpty", "67890_name": "Dumpty"}
{"12345:name": "Humpty", "67890:name": "Dumpty"}
```

The datatype of the `ID` (i.e. whether ID is `12345` or `"12345"`) can be inferred
from the node table itself. In our example, the names (Humpty and Dumpty) are
quite obviously supposed to be strings. This may be less obvious for other
(byte-encoded) properties or values. It is on the reader to decide how to parse
them. In the future, we could add additional meta data to determine data
types e.g. via `{"_dtype_name": "str", "_dtype_id": "int"}`.
types e.g. via `{"_dtype:name": "str", "_dtype:id": "int"}`.

### Synapses

Synapses and other similar data typically associated with a neuron must be
stored in separate parquet files.

We propose using a simple zip archive where:

```bash
skeletons.parquet.zip
├── skeletons.parquet <- contains the actual skeletons
└── synapses.parquet <- contains the synapse data
```

## Benchmarks

Expand Down
Loading

0 comments on commit 6fc38c9

Please sign in to comment.