first pass at I/O using parquet files

navis-org · Oct 17, 2023 · 6fc38c9 · 6fc38c9
1 parent 3b1ccb7
commit 6fc38c9
Show file tree

Hide file tree

Showing 4 changed files with 513 additions and 10 deletions.
diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -551,6 +551,9 @@ Functions to import/export neurons.
     navis.write_json
     navis.write_precomputed
     navis.read_precomputed
+    navis.read_parquet
+    navis.write_parquet
+    navis.scan_parquet
 
 
 .. _api_utility:

diff --git a/navis/io/__init__.py b/navis/io/__init__.py
@@ -11,20 +11,24 @@
 #    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 #    GNU General Public License for more details.
 
-from .json_io import write_json, read_json
+from .json_io import read_json, write_json
 from .swc_io import read_swc, write_swc
 from .nrrd_io import read_nrrd, write_nrrd
-from .precomputed_io import write_precomputed, read_precomputed
+from .precomputed_io import read_precomputed, write_precomputed
 from .hdf_io import read_h5, write_h5, inspect_h5
 from .rda_io import read_rda
 from .nmx_io import read_nmx
 from .mesh_io import read_mesh, write_mesh
 from .tiff_io import read_tiff
+from .pq_io import read_parquet, write_parquet, scan_parquet
 
-__all__ = ['write_json', 'read_json',
+__all__ = ['read_json', 'write_json',
            'read_swc', 'write_swc',
            'read_nrrd', 'write_nrrd',
            'read_h5', 'write_h5', 'inspect_h5',
-           'write_precomputed', 'read_precomputed',
+           'read_precomputed', 'write_precomputed',
            'read_tiff',
-           'read_rda', 'read_nmx', 'read_mesh', 'write_mesh']
+           'read_rda',
+           'read_nmx',
+           'read_mesh', 'write_mesh',
+           'read_parquet', 'write_parquet', 'scan_parquet']
diff --git a/navis/io/pq_io.md b/navis/io/pq_io.md
@@ -2,7 +2,7 @@
 
 Current formats for storing neuroanatomical data typically focus on one neuron per file. Unsurprisingly this doesn't scale well to tens of thousands of neurons: reading and writing a large number of small files quickly becomes painfully slow.
 
-Here, we propose a file format that stores an arbitrary number of neurons in a single Parquet file. What is Parquet? Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval. It has two important properties for what we are trying to do:
+Here, we propose a file format that stores an arbitrary number of neurons in a single Parquet file. What is Parquet you ask? Why, Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval. It has two important properties for what we are trying to do:
 
 1. Because it is column-oriented we can quickly search for a given neuron without having to load the entire file.
 2. It allows storage of arbitrary meta-data which we can use to store neuron properties
@@ -51,7 +51,9 @@ The table for two dotprops with IDs `12345` and `67890` would look like this:
   15970  36362  23044 -0.459681 -0.524251  0.716836   67890
 ```
 
-The node table must contain the following columns: `x`, `y`, `z`, `vec_x`, `vec_y`, `vec_z` and `neuron`. Additional columns are allowed but may be ignored by the reader.
+The node table must contain the following columns: `x`, `y`, `z`, and `neuron`.
+Additional columns such as `vec_x`, `vec_y`, `vec_z` or `alpha` are allowed but
+may be ignored by the reader.
 
 ### Meta data
 
@@ -61,19 +63,32 @@ Meta data can be stored in Parquet files as `{key: value}` dictionary where both
 This means that floats/integers need to be converted to bytes or strings.
 
 To keep track of which neuron has which property, the meta data is encoded in
-the dictionary as `{ID_PROPERTY: VALUE}`. For example, if our two neurons in the
+the dictionary as `{ID:PROPERTY: VALUE}`. For example, if our two neurons in the
 examples above had names they would be encode as:
 
 ```
-{"12345_name": "Humpty", "67890_name": "Dumpty"}
+{"12345:name": "Humpty", "67890:name": "Dumpty"}
 ```
 
 The datatype of the `ID` (i.e. whether ID is `12345` or `"12345"`) can be inferred
 from the node table itself. In our example, the names (Humpty and Dumpty) are
 quite obviously supposed to be strings. This may be less obvious for other
 (byte-encoded) properties or values. It is on the reader to decide how to parse
 them. In the future, we could add additional meta data to determine data
-types e.g. via `{"_dtype_name": "str", "_dtype_id": "int"}`.
+types e.g. via `{"_dtype:name": "str", "_dtype:id": "int"}`.
+
+### Synapses
+
+Synapses and other similar data typically associated with a neuron must be
+stored in separate parquet files.
+
+We propose using a simple zip archive where:
+
+```bash
+skeletons.parquet.zip
+├── skeletons.parquet  <- contains the actual skeletons
+└── synapses.parquet   <- contains the synapse data
+```
 
 ## Benchmarks