Skip to content

Commit

Permalink
Added notebook for big datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
freemansw1 committed Feb 16, 2024
1 parent ed4c4bf commit 68f2978
Show file tree
Hide file tree
Showing 3 changed files with 1,188 additions and 3 deletions.
37 changes: 34 additions & 3 deletions doc/big_datasets.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,42 @@
Handling Large Datasets
-------------------------------------

Often, one desires to use *tobac* to identify and track features in large datasets ("big data"). This documentation strives to suggest various methods for doing so efficiently. Current versions of *tobac* do not allow for out-of-memory computation, meaning that these strategies may need to be employed for both computational and memory reasons.
Often, one desires to use *tobac* to identify and track features in large datasets ("big data"). This documentation strives to suggest various methods for doing so efficiently. Current versions of *tobac* do not support out-of-core (e.g., :code:`dask`) computation, meaning that these strategies may need to be employed for both computational and memory reasons.

.. _Split Feature Detection:

=======================
Split Feature Detection
Split Feature Detection and Run in Parallel
=======================
Current versions of threshold feature detection (see :doc:`feature_detection_overview`) are time independent, meaning that one can parallelize feature detection across all times (although not across space). *tobac* provides the :py:meth:`tobac.utils.combine_tobac_feats` function to combine a list of dataframes produced by a parallelization method (such as :code:`jug` or :code:`multiprocessing.pool`) into a single combined dataframe suitable to perform tracking with.
Current versions of threshold feature detection (see :doc:`feature_detection_overview`) are time independent, meaning that one can easily parallelize feature detection across all times (although not across space). *tobac* provides the :py:meth:`tobac.utils.combine_feature_dataframes` function to combine a list of dataframes produced by a parallelization method (such as :code:`jug`, :code:`multiprocessing.pool`, or :code:`dask.bag`) into a single combined dataframe suitable to perform tracking with.

Below is a snippet from a larger notebook demonstrating how to run feature detection in parallel ( :doc:`big_datasets_examples/notebooks/parallel_processing_tobac`).

.. code-block:: python
# build list of tracked variables using Dask.Bag
b = dask.bag.from_sequence([combined_ds['data'][x:x+1] for x in range(len(combined_ds['time']))], npartitions=1)
out_feature_dfs = dask.bag.map(lambda x: tobac.feature_detection_multithreshold(x.to_iris(), 4000, **parameters_features), b).compute()
combined_dataframes= tobac.utils.general.combine_feature_dataframes(out_feature_dfs)
.. _Split Segmentation:

======================================
Split Segmentation and Run in Parallel
======================================
Recall that the segmentation mask (see :doc:`segmentation_output`) is the same size as the input grid, which results in large files when handling large input datasets. The following strategies can help reduce the output size and make segmentation masks more useful for the analysis.

The first strategy is to only segment on features *after tracking and quality control*. While this will not directly impact performance, waiting to run segmentation on the final set of features (after discarding, e.g., non-tracked cells) can make analysis of the output segmentation dataset easier.

To enhance the speed at which segmentation runs, one can process multiple segmentation times in parallel independently, similar to feature detection. Unlike feature detection, however, there is currently no built-in *tobac* method to combine multiple segmentation times into a single file. While one can do this using typical NetCDF tools such as :code:`nccat` or with xarray utilities such as :code:`xr.concat`, you can also leave the segmentation mask output as separate files, opening them later with multiple file retrievals such as :code:`xr.open_mfdataset`.


.. _Tracking Hanging:

=====================================
Tracking Hangs with too many Features
=====================================

When tracking on a large dataset, :code:`tobac.tracking.linking_trackpy` can hang using the default parameters. This is due to the tracking library :code:`trackpy` searching for the next timestep's feature in too large of an area. This can be solved *without impact to scientific output* by lowering the :code:`subnetwork_size` parameter in :code:`tobac.tracking.linking_trackpy`.

8 changes: 8 additions & 0 deletions doc/big_datasets_examples/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
##############################
Big Data Processing Examples
##############################

.. toctree::
:maxdepth: 2

notebooks/parallel_processing_tobac
Loading

0 comments on commit 68f2978

Please sign in to comment.