Skip to content

Commit

Permalink
first pass at parquet format specifications
Browse files Browse the repository at this point in the history
  • Loading branch information
schlegelp committed Oct 16, 2023
1 parent 5f4a65c commit 3b1ccb7
Showing 1 changed file with 106 additions and 0 deletions.
106 changes: 106 additions & 0 deletions navis/io/pq_io.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
## Storing Neurons in Parquet Files

Current formats for storing neuroanatomical data typically focus on one neuron per file. Unsurprisingly this doesn't scale well to tens of thousands of neurons: reading and writing a large number of small files quickly becomes painfully slow.

Here, we propose a file format that stores an arbitrary number of neurons in a single Parquet file. What is Parquet? Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval. It has two important properties for what we are trying to do:

1. Because it is column-oriented we can quickly search for a given neuron without having to load the entire file.
2. It allows storage of arbitrary meta-data which we can use to store neuron properties

### Skeletons

For skeletons the file contains the SWC table plus an additional column indicating the neuron ID.

Let's assume we want to store two neurons with IDs `12345` and `67890`, respectively. The table would look like this:

```
node_id parent_id radius x y z neuron
1 -1 10.000000 15784 37250 28062 12345
2 1 18.284300 15764 37230 28082 12345
3 2 34.721401 15744 37190 28122 12345
4 3 34.721401 15744 37150 28202 12345
5 4 34.721401 15704 37130 28242 12345
... ... ... ... ... ... ...
4843 9 62.111000 15450 35582 23284 67890
4844 7 46.568501 15830 36182 23124 67890
4845 6 287.321014 15450 35862 23244 67890
4846 5 151.244995 15530 35622 22844 67890
4847 3 30.000000 15970 36362 23044 67890
```

A node table must contain the following columns: `node_id`, `x`, `y`, `z`, `parent_id` and `neuron`. Additional columns (such as `radius`) are allowed but may be ignored by the reader.

### Dotprops

Dotprops are point clouds with associated vectors indicating directionality.

The table for two dotprops with IDs `12345` and `67890` would look like this:

```
x y z vec_x vec_y vec_z neuron
15784 37250 28062 -0.300205 -0.393649 0.868860 12345
15764 37230 28082 -0.108453 -0.211375 0.971369 12345
15744 37190 28122 -0.043569 -0.455931 0.888948 12345
15744 37150 28202 -0.307032 -0.533594 0.788041 12345
15704 37130 28242 -0.143014 -0.318347 0.937124 12345
... ... ... ... ... ... ...
15450 35582 23284 -0.199709 -0.936613 0.287877 67890
15830 36182 23124 -0.154800 -0.952146 -0.263541 67890
15450 35862 23244 -0.283306 0.042772 0.958075 67890
15530 35622 22844 -0.020729 0.722610 0.690945 67890
15970 36362 23044 -0.459681 -0.524251 0.716836 67890
```

The node table must contain the following columns: `x`, `y`, `z`, `vec_x`, `vec_y`, `vec_z` and `neuron`. Additional columns are allowed but may be ignored by the reader.

### Meta data

Meta data can be stored in Parquet files as `{key: value}` dictionary where both
`key` and `value` have to be either strings or bytes. Note that the former will be converted to bytes automatically.

This means that floats/integers need to be converted to bytes or strings.

To keep track of which neuron has which property, the meta data is encoded in
the dictionary as `{ID_PROPERTY: VALUE}`. For example, if our two neurons in the
examples above had names they would be encode as:

```
{"12345_name": "Humpty", "67890_name": "Dumpty"}
```

The datatype of the `ID` (i.e. whether ID is `12345` or `"12345"`) can be inferred
from the node table itself. In our example, the names (Humpty and Dumpty) are
quite obviously supposed to be strings. This may be less obvious for other
(byte-encoded) properties or values. It is on the reader to decide how to parse
them. In the future, we could add additional meta data to determine data
types e.g. via `{"_dtype_name": "str", "_dtype_id": "int"}`.

## Benchmarks

Testcase: 1,000 skeletons on a 2018 MacBook Pro


| Writing | Timing|
|-------------------|-------|
|Write to SWC files: | 2:37min|
|Write to Zip: | 2:55min |
|Write to Parquet: |0:25min|


| Size on disk | |
|-------------------|-------|
|SWC files: | 200.7MB |
|Zip archive: | 55.6MB|
|Parquet file: | 35.6MB|

| Reading | Timing|
|-------------------|-------|
|SWC files (single thread): | 0:42min |
|SWC files (multi-thread): | 0:24min |
|Zip archive (single thread): | 1:01min |
|Zip archive (multi-thread): | 0:28min |
|Parquet file: | 0:35min |

As you can see, in these preliminary tests parquet is ahead in terms of
writing speed and size on disk. It beats reading if compared to single-threaded
reads but is slightly slower compared to multi-threaded.

0 comments on commit 3b1ccb7

Please sign in to comment.