Extend TraJan to provide tooling to do "nan insertion" when timeseries have significant holes? #98

jerabaul29 · 2024-08-02T14:42:08Z

Would we be interested to provide some functionality / tooling to make it easy to do "nan insertion" when timeseries have holes? This is already built into the spectra plotting that I recently pushed, and could be relevant to plenty of other cases (i.e. trajectories, statistics, etc). The idea is to avoid ugly linear interpolation in plots when there are some data holes.

See for example the difference between:

nan insertion:
no nan insertion (default as of now, ugly lines when instruments loose contact because buried in snow or whatever else reason)
difference highlighted by putting the no nan in black and other in color

I end up using stuff like this all the time, so I have prepared my own function. I would be happy to i) add it to some trajan tooling, ii) use it to refactor how I do the nan insertion in the spectra plotting, iii) start working / thinking about how this could be integrated into the other potting functionalities (a switch to use it, with argument forwarding for the hole duration threshold?)

# %%

import numpy as np

from typing import List
import numpy.typing as npt

# %%


def insert_NaNs_in_time_holes(
    common_time_index: npt.NDArray[np.datetime64],
    hole_threshold: np.timedelta64,
    list_data_arrays: List[npt.NDArray],
    data_arrays_time_index: int = 0):
    """Insert NaNs where holes are detected in common_time_index. This is useful, for example,
    to fill holes with NaNs before plotting to avoid ugly lines on the plots.

    Args:
        common_time_index: the time index, should be common to all arrays in list_data_arrays
        hole_threshold: the threshold over which we consider that there
            is a hole. The NaNs will be filled at start_hole + hole_threshold/4 and
            end_hole - hole_threshold/4 , for each hole
        list_data_arrays: a list of the data arrays using the common_time_index
        data_arrays_time_index: the index used for time in the arrays within list_data_arrays

    Returns:
        (new_common_time_index, list_new_data_arrays): new_time_index is a new time index (i.e.
            a separate copy) with the start and ends of NaN periods inserted,
                                                       list_new_data_arrays is a new list of
            data arrays (i.e. separate copies) with the NaN hole delimiters inserted.
    """

    # common_time_index should be datetime64 and increasing
    assert np.issubdtype(common_time_index.dtype, np.datetime64)
    assert np.all(np.diff(common_time_index)) >= 0

    hole_threshold = hole_threshold.astype('timedelta64[ns]')

    # each list_data_arrays array should have a time dimension equal to common_time_index
    for crrt_data_array in list_data_arrays:
        assert np.shape(crrt_data_array)[data_arrays_time_index] == np.shape(common_time_index)[0]

    time_deltas = common_time_index[1:] - common_time_index[:-1]
    time_jump_indexes = np.where(time_deltas > hole_threshold)[0]

    time_hole_starts = common_time_index[time_jump_indexes] + hole_threshold / 4
    time_hole_ends = common_time_index[time_jump_indexes+1] - hole_threshold / 4
    # intertwindle the 2
    time_hole_limits = np.dstack((time_hole_starts, time_hole_ends)).flatten()

    time_jump_indexes = np.repeat(time_jump_indexes, 2)

    new_common_time_index = np.insert(common_time_index, time_jump_indexes+1, time_hole_limits)

    list_new_data_arrays = []
    for crrt_data_array in list_data_arrays:
        to_insert = np.transpose(np.nan*crrt_data_array[0])
        new_crrt_data_array = np.insert(crrt_data_array, time_jump_indexes+1, to_insert, data_arrays_time_index)
        list_new_data_arrays.append(new_crrt_data_array)

    return (new_common_time_index, list_new_data_arrays)

# %%

# example / test
# TODO: if moving to dedicated functionality, make this into the associated tests

common_time_index = np.array(
    [
        np.datetime64("2024-01-01T00:00:00"),
        np.datetime64("2024-01-01T01:00:00"),
        np.datetime64("2024-01-01T06:00:00"),
        np.datetime64("2024-01-01T07:00:00"),
        np.datetime64("2024-01-01T08:00:00"),
        np.datetime64("2024-01-01T09:00:00"),
        np.datetime64("2024-01-01T12:00:00"),
        np.datetime64("2024-01-01T13:00:00"),
        np.datetime64("2024-01-01T13:30:00"),
    ]
)

expected_new_common_time_index = np.array(
    [
        np.datetime64("2024-01-01T00:00:00"),
        np.datetime64("2024-01-01T01:00:00"),
        np.datetime64("2024-01-01T01:30:00"),
        np.datetime64("2024-01-01T05:30:00"),
        np.datetime64("2024-01-01T06:00:00"),
        np.datetime64("2024-01-01T07:00:00"),
        np.datetime64("2024-01-01T08:00:00"),
        np.datetime64("2024-01-01T09:00:00"),
        np.datetime64("2024-01-01T09:30:00"),
        np.datetime64("2024-01-01T11:30:00"),
        np.datetime64("2024-01-01T12:00:00"),
        np.datetime64("2024-01-01T13:00:00"),
        np.datetime64("2024-01-01T13:30:00"),
    ]
)

hole_threshold = np.timedelta64(2, "h")

list_data_arrays = [
    np.array([1, 2, 5, 6, 7, 8, 11, 12, 13]).astype('float'),
    np.array([10, 20, 50, 60, 70, 80, 110, 120, 130]).astype('float'),
    np.array([[1,0], [2,0], [5,0], [6,0], [7,0], [8,0], [11,0], [12,0], [13,0],]).astype('float'),
]

expected_list_new_data_arrays = [
    np.array([1, 2, np.nan, np.nan, 5, 6, 7, 8, np.nan, np.nan, 11, 12, 13]).astype('float'),
    np.array([10, 20, np.nan, np.nan, 50, 60, 70, 80, np.nan, np.nan, 110, 120, 130]).astype('float'),
    np.array([[1,0], [2,0], [np.nan,np.nan], [np.nan,np.nan], [5,0], [6,0], [7,0], [8,0], [np.nan,np.nan], [np.nan,np.nan], [11,0], [12,0], [13,0],]).astype('float'),
]

data_arrays_time_index = 0

(new_common_time_index, list_new_data_arrays) = insert_NaNs_in_time_holes(common_time_index, hole_threshold, list_data_arrays)


def compare_np_arrays_with_nan(a, b, epsilon=1.0e-12):
    """Given 2 arrays of floats, check for corresponding indexes that either both have values close to
    each other at epsilon prediction, or both have nans."""
    res = np.logical_or(
        np.abs(a - b) < epsilon,
        np.logical_and(np.isnan(a), np.isnan(b))
    ).all()
    return res


assert np.all(new_common_time_index == expected_new_common_time_index)
assert compare_np_arrays_with_nan(list_new_data_arrays[0], expected_list_new_data_arrays[0])
assert compare_np_arrays_with_nan(list_new_data_arrays[1], expected_list_new_data_arrays[1])
assert compare_np_arrays_with_nan(list_new_data_arrays[2], expected_list_new_data_arrays[2])

# %%

gauteh · 2024-08-02T14:46:01Z

Maybe this one is relevant: https://opendrift.github.io/trajan/autoapi/trajan/traj2d/index.html#trajan.traj2d.Traj2d.insert_nan_where I'm not sure if it is implemented for 1d, or maybe could even be general for both.

jerabaul29 · 2024-08-02T15:02:40Z

Thanks for the pointer, I will look into it :) .

A question is if we like better to add this nan hole filling on the dataset and modify it, or if we just want some small function to edit the data "on the fly" when we need to plot it, without changing the underlying dataset (now I just call this function above in my code to modify the arrays to plot for example, so I am in the second case and I do not do modifications of the whole dataset).

Another question is how to handle cases when there are more time arrays (ie time for position / wave / temperature etc that differ from each others, common with buoys). If we modify the dataset we may have to think about the different groups of related variables and dimensions etc. If we just apply a bit of data modification before plotting we do not need to think about as many aspects, as we can just call the data modifying function on the fly on the exact data that we are preparing to plot :) .

gauteh · 2024-08-02T15:09:35Z

I think that a very important principle in trajan is to stay close to xarray, numpy and matplotlib so that it is easy for the user to reason out what happens. The way things are done in xarray is usually to chain various limitations and selections on the dataset, then plot. These changes do not change the original dataset, but creates a copy or a view. The same with plotting, we should set up the future like xarray does and let regular matplotlib save.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend TraJan to provide tooling to do "nan insertion" when timeseries have significant holes? #98

Extend TraJan to provide tooling to do "nan insertion" when timeseries have significant holes? #98

jerabaul29 commented Aug 2, 2024

gauteh commented Aug 2, 2024

jerabaul29 commented Aug 2, 2024

gauteh commented Aug 2, 2024

Extend TraJan to provide tooling to do "nan insertion" when timeseries have significant holes? #98

Extend TraJan to provide tooling to do "nan insertion" when timeseries have significant holes? #98

Comments

jerabaul29 commented Aug 2, 2024

gauteh commented Aug 2, 2024

jerabaul29 commented Aug 2, 2024

gauteh commented Aug 2, 2024