Add early stop module #301

sanaAyrml · 2024-12-05T22:53:13Z

PR Type

[Feature]

Short Description

Clickup Ticket(s): https://app.clickup.com/t/8688wzkuk , https://app.clickup.com/t/860qxm622

Integrated an early stopping module as a plug-in for all clients. After a specified number of training steps, the module computes the evaluation loss. If the loss improves compared to previous evaluations, it saves a snapshot of the model's key attributes, enabling the model to restore these attributes when the stopping criteria are met.

Tests Added

Added a series of tests for snapshot modules to ensure they are saved and loaded correctly as intended.

…4Health into sa_early_stop

for more information, see https://pre-commit.ci

sanaAyrml · 2025-01-09T09:49:39Z

fl4health/clients/basic_client.py

@@ -11,7 +11,7 @@
 from flwr.common.typing import Config, NDArrays, Scalar
 from torch.nn.modules.loss import _Loss
 from torch.optim import Optimizer
-from torch.optim.lr_scheduler import _LRScheduler
+from torch.optim.lr_scheduler import LRScheduler


This change was necessary due to precommit errors. I think _LRScheduler has been deprecated in new versions,

emersodb

There are a few things we definitely want to make sure we think about carefully. Specifically, I just want to make sure we aren't going to have a lot of additional memory overhead with the way we're doing snapshotting.

emersodb · 2025-01-09T16:49:17Z

fl4health/clients/basic_client.py

@@ -159,6 +159,9 @@ def get_parameters(self, config: Config) -> NDArrays:
            # Need all parameters even if normally exchanging partial
            return FullParameterExchanger().push_parameters(self.model, config=config)
        else:
+            if hasattr(self, "early_stopper") and self.early_stopper.patience == 0:


Maybe we can encapsulate this in a class method with an informative name?

emersodb · 2025-01-09T16:57:21Z

fl4health/clients/basic_client.py

@@ -159,6 +159,9 @@ def get_parameters(self, config: Config) -> NDArrays:
            # Need all parameters even if normally exchanging partial
            return FullParameterExchanger().push_parameters(self.model, config=config)
        else:
+            if hasattr(self, "early_stopper") and self.early_stopper.patience == 0:


This is going to be a naive question, but is there a reason we are doing a hasattr for early stopper rather than having it be an optional property that defaults to None in the init function? That is,

self.early_stopper: EarlyStopper | None = None

if we do that, then here we can just check if it isn't None and then the set_early_stopper method can default to just logging the not activated message and leaving it None. This would eliminate the need for the try-catch below as well.

emersodb · 2025-01-09T17:00:05Z

fl4health/clients/basic_client.py

+        except NotImplementedError:
+            log(
+                INFO,
+                """Early stopping not implemented for this client.


Super minor, but we've been avoiding """ logging strings since they preserve white space. Rather, we could do

log( INFO, "Early stopping not implemented for this client. ", "Override set_early_stopper to activate early stopping.", )

emersodb · 2025-01-09T19:33:36Z

fl4health/utils/early_stopper.py

+class EarlyStopper:
+    def __init__(
+        self,
+        client: BasicClient,


We talked about looking into this, so forgive me if we already did and I forgot the conclusion, but have we made sure that this will be storing a reference to the client object? Just want to make sure we're not suddenly doubling the memory footprint of each client by doing this.

emersodb · 2025-01-09T19:33:56Z

fl4health/utils/early_stopper.py

+        snapshot_dir: Path | None = None,
+    ) -> None:
+        """
+        Early stopping class is an plugin for the client that allows to stop local training based on the validation


"an plugin" -> "a plugin" 🙂

emersodb · 2025-01-09T21:36:21Z

fl4health/utils/early_stopper.py

+        if self.best_score is None or val_loss < self.best_score:
+            self.best_score = val_loss
+
+            self.count_down = self.patience


Say that we have patience = 0 and we do our first check. The best_score will be None, we we'll get in here and self.count_down = self.patience = 0. Next time we get in here, count_down will get decremented to -1, so we also won't stop. I know if works, but it feels a bit weird and confusing to read. It feels like there is a cleaner way to do "infinite" patience and also making sure count_down never goes negative. Perhaps we can have patience be optional?

emersodb · 2025-01-09T21:46:13Z

fl4health/utils/snapshotter.py

+T = TypeVar("T")
+
+
+class Snapshotter(ABC, Generic[T]):


Nice use of the Generic!

emersodb · 2025-01-09T21:48:37Z

fl4health/utils/snapshotter.py

+        Returns:
+            dict[str, T]: Wrapped attribute as a dictionary.
+        """
+        attribute = copy.deepcopy(getattr(self.client, name))


Is there a reason we're deep copying here? I think this will force a duplicate of the attribute (for example the model) to be created. I'm a little worried that will double our memory footprint. If it's so that we can keep stuff in memory (i.e. not checkpoint to a file) I think maybe we should force checkpointing to a file to avoid the memory overhead.

emersodb · 2025-01-09T22:14:31Z

fl4health/utils/early_stopper.py

+            self.snapshot_ckpt = self.checkpointer.load_checkpoint(f"temp_{self.client.client_name}.pt")
+
+        for attr in attrs:
+            snapshotter_function, expected_type = self.default_snapshot_attrs[attr]


Just a suggestion, but I think we can call this just snapshotter rather than snapshotter_function

emersodb · 2025-01-09T22:31:56Z

fl4health/utils/snapshotter.py

@@ -0,0 +1,244 @@
+import copy


Overall, I think these snapshotters look good. I just want to make sure we're sure that we're not duplicating a bunch of objects in memory. If we're not careful, these objects will make the clients much heavier than without early stopping, which we want to avoid.

sanaAyrml added 9 commits December 5, 2024 17:10

add early stopper

d0c454f

Merge branch 'main' into sa_early_stop

8938bf5

test smoke tests

6136a90

Separate early_stopper and snapshotter

6f20445

Temporary commit

52546e9

Merge branch 'sa_early_stop' of https://github.com/VectorInstitute/FL…

fe610db

…4Health into sa_early_stop

Firx extra early stopper implementation

92fe751

Merge branch 'main' into sa_early_stop

c73d04c

add seriazable snapshotter tests

fe57eea

sanaAyrml marked this pull request as draft January 2, 2025 19:09

pre-commit-ci bot and others added 11 commits January 2, 2025 19:09

[pre-commit.ci] Add auto fixes from pre-commit.com hooks

2d939b3

for more information, see https://pre-commit.ci

Merge branch 'main' into sa_early_stop

5627c72

add metric manager snappshotter test

311de1c

Merge branch 'main' into sa_early_stop

0471905

Resolve conflict with main

27bf361

Merge branch 'main' into sa_early_stop

3142889

update precommit type changes

f4ad580

Merge branch 'main' into sa_early_stop

9d09429

add snappshotter other tests

00320fc

Add docstrings

5c48474

Add docstring

bf2504d

sanaAyrml commented Jan 9, 2025

View reviewed changes

Update doc strings

476343e

sanaAyrml changed the title ~~Sa early stop~~ Add early stop module Jan 9, 2025

sanaAyrml marked this pull request as ready for review January 9, 2025 10:00

sanaAyrml requested review from emersodb and jewelltaylor January 9, 2025 10:00

Ignoring a vulnerability without a fix yet

dd3e964

emersodb reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add early stop module #301

Add early stop module #301

sanaAyrml commented Dec 5, 2024 •

edited

Loading

sanaAyrml Jan 9, 2025

emersodb left a comment

emersodb Jan 9, 2025

emersodb Jan 9, 2025

emersodb Jan 9, 2025

emersodb Jan 9, 2025

emersodb Jan 9, 2025

emersodb Jan 9, 2025

emersodb Jan 9, 2025

emersodb Jan 9, 2025

emersodb Jan 9, 2025

emersodb Jan 9, 2025

Add early stop module #301

Are you sure you want to change the base?

Add early stop module #301

Conversation

sanaAyrml commented Dec 5, 2024 • edited Loading

PR Type

Short Description

Tests Added

Choose a reason for hiding this comment

emersodb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanaAyrml commented Dec 5, 2024 •

edited

Loading