Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/traintestsplit: Adding train test split customisability to the data loading #22

Merged
merged 6 commits into from
Jan 31, 2024
Merged
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -164,3 +164,8 @@ cython_debug/

# tracks the version
_version.py

# rogue directories from example notebooks running in local space
checkpoints/
loss_figures/
loss_logs/
63 changes: 63 additions & 0 deletions docs/customising_training.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ We will cover the following topics:
* Number of epochs
* Checkpoint suffix modification
* Number of workers in PyTorch DataLoader
* Train/test and cross-validation splitting yourself

Early stopping
--------------
Expand Down Expand Up @@ -248,3 +249,65 @@ You can change the number of workers in the PyTorch DataLoader using the ``num_w
fusion_model=example_model,
)



-----

Train/test and cross-validation splitting yourself
---------------------------------------------------

By default, fusilli will split your data into train/test or cross-validation splits for you randomly based on a test size or a number of folds you specify in the :func:`~.fusilli.data.prepare_fusion_data` function.

You can remove the randomness and specify the data indices for train and test, or for the different cross validation folds yourself by passing in optional arguments to :func:`~.fusilli.data.prepare_fusion_data`.


For train/test splitting, the argument `test_indices` should be a list of indices for the test set. To make the test set the first 6 data points in the overall dataset, follow the example below:

.. code-block:: python

from fusilli.data import prepare_fusion_data
from fusilli.train import train_and_save_models

test_indices = [0, 1, 2, 3, 4, 5]

datamodule = prepare_fusion_data(
prediction_task="binary",
fusion_model=example_model,
data_paths=data_paths,
output_paths=output_path,
test_indices=test_indices,
)

For specifying your own cross validation folds, the argument `own_kfold_indices` should be a list of lists of indices for each fold.

If you wanted to have non-random cross validation folds through your data, you can either specify the folds like so for 3 folds:

.. code-block:: python

own_kfold_indices = [
([ 4, 5, 6, 7, 8, 9, 10, 11], [0, 1, 2, 3]), # first fold
([ 0, 1, 2, 3, 8, 9, 10, 11], [4, 5, 6, 7]), # second fold
([ 0, 1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11]) # third fold
]

Or to do this automatically, use the Scikit-Learn `KFold functionality <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html>`_ to generate the folds outside of the fusilli functions, like so:

.. code-block:: python

from sklearn.model_selection import KFold

num_folds = 5

own_kfold_indices = [(train_index, test_index) for train_index, test_index in KFold(n_splits=num_folds).split(range(len(dataset)))]


datamodule = prepare_fusion_data(
kfold=True,
prediction_task="binary",
fusion_model=example_model,
data_paths=data_paths,
output_paths=output_path,
own_kfold_indices=own_kfold_indices,
num_folds=num_folds,
)

Loading
Loading