diff --git a/doc/source/tutorial-quickstart-scikitlearn.rst b/doc/source/tutorial-quickstart-scikitlearn.rst index 56bdf18cad1..6aea6b3d2d4 100644 --- a/doc/source/tutorial-quickstart-scikitlearn.rst +++ b/doc/source/tutorial-quickstart-scikitlearn.rst @@ -3,312 +3,334 @@ Quickstart scikit-learn ======================= -.. meta:: - :description: Check out this Federated Learning quickstart tutorial for using Flower with scikit-learn to train a linear regression model. +In this federated learning tutorial we will learn how to train a Logistic Regression on +MNIST using Flower and scikit-learn. It is recommended to create a virtual environment +and run everything within a :doc:`virtualenv `. -In this tutorial, we will learn how to train a ``Logistic Regression`` model on MNIST -using Flower and scikit-learn. +Let's use ``flwr new`` to create a complete Flower+scikit-learn project. It will +generate all the files needed to run, by default with the Flower Simulation Engine, a +federation of 10 nodes using |fedavg|_ The dataset will be partitioned using +|flowerdatasets|_'s |iidpartitioner|_ -It is recommended to create a virtual environment and run everything within this -:doc:`virtualenv `. +Now that we have a rough idea of what this example is about, let's get started. First, +install Flower in your new environment: -Our example consists of one *server* and two *clients* all having the same model. +.. code-block:: shell -*Clients* are responsible for generating individual model parameter updates for the -model based on their local datasets. These updates are then sent to the *server* which -will aggregate them to produce an updated global model. Finally, the *server* sends this -improved version of the model back to each *client*. A complete cycle of parameters -updates is called a *round*. + # In a new Python environment + $ pip install flwr -Now that we have a rough idea of what is going on, let's get started. We first need to -install Flower. You can do this by running: +Then, run the command below. You will be prompted to select one of the available +templates (choose ``sklearn``), give a name to your project, and type in your developer +name: .. code-block:: shell - $ pip install flwr + $ flwr new -Since we want to use scikit-learn, let's go ahead and install it: +After running it you'll notice a new directory with your project name has been created. +It should have the following structure: .. code-block:: shell - $ pip install scikit-learn + + ├── + │ ├── __init__.py + │ ├── client_app.py # Defines your ClientApp + │ ├── server_app.py # Defines your ServerApp + │ └── task.py # Defines your model, training and data loading + ├── pyproject.toml # Project metadata like dependencies and configs + └── README.md -Or simply install all dependencies using Poetry: +If you haven't yet installed the project and its dependencies, you can do so by: .. code-block:: shell - $ poetry install + # From the directory where your pyproject.toml is + $ pip install -e . -Flower Client -------------- +To run the project, do: + +.. code-block:: shell + + # Run with default arguments + $ flwr run . + +With default arguments you will see an output like this one: + +.. code-block:: shell + + Loading project configuration... + Success + INFO : Starting Flower ServerApp, config: num_rounds=3, no round_timeout + INFO : + INFO : [INIT] + INFO : Requesting initial parameters from one random client + INFO : Received initial parameters from one random client + INFO : Starting evaluation of initial global parameters + INFO : Evaluation returned no results (`None`) + INFO : + INFO : [ROUND 1] + INFO : configure_fit: strategy sampled 10 clients (out of 10) + INFO : aggregate_fit: received 10 results and 0 failures + WARNING : No fit_metrics_aggregation_fn provided + INFO : configure_evaluate: strategy sampled 10 clients (out of 10) + INFO : aggregate_evaluate: received 10 results and 0 failures + WARNING : No evaluate_metrics_aggregation_fn provided + INFO : + INFO : [ROUND 2] + INFO : configure_fit: strategy sampled 10 clients (out of 10) + INFO : aggregate_fit: received 10 results and 0 failures + INFO : configure_evaluate: strategy sampled 10 clients (out of 10) + INFO : aggregate_evaluate: received 10 results and 0 failures + INFO : + INFO : [ROUND 3] + INFO : configure_fit: strategy sampled 10 clients (out of 10) + INFO : aggregate_fit: received 10 results and 0 failures + INFO : configure_evaluate: strategy sampled 10 clients (out of 10) + INFO : aggregate_evaluate: received 10 results and 0 failures + INFO : + INFO : [SUMMARY] + INFO : Run finished 3 round(s) in 19.41s + INFO : History (loss, distributed): + INFO : round 1: 1.3447584261018466 + INFO : round 2: 0.9680018613482815 + INFO : round 3: 0.7667920399137523 + INFO : + +You can also override the parameters defined in the ``[tool.flwr.app.config]`` section +in ``pyproject.toml`` like this: + +.. code-block:: shell + + # Override some arguments + $ flwr run . --run-config "num-server-rounds=5 local-epochs=2" -Now that we have all our dependencies installed, let's run a simple distributed training -with two clients and one server. However, before setting up the client and server, we -will define all functionalities that we need for our federated learning setup within -``utils.py``. The ``utils.py`` contains different functions defining all the machine -learning basics: +What follows is an explanation of each component in the project you just created: +dataset partition, the model, defining the ``ClientApp`` and defining the ``ServerApp``. -- ``get_model_parameters()`` - - Returns the parameters of a ``sklearn`` LogisticRegression model -- ``set_model_params()`` - - Sets the parameters of a ``sklearn`` LogisticRegression model -- ``set_initial_params()`` - - Initializes the model parameters that the Flower server will ask for +The Data +-------- -Please check out ``utils.py`` `here -`_ for -more details. The pre-defined functions are used in the ``client.py`` and imported. The -``client.py`` also requires to import several packages such as Flower and scikit-learn: +This tutorial uses |flowerdatasets|_ to easily download and partition the `MNIST +`_ dataset. In this example you'll make +use of the |iidpartitioner|_ to generate ``num_partitions`` partitions. You can choose +|otherpartitioners|_ available in Flower Datasets. Each ``ClientApp`` will call this +function to create dataloaders with the data that correspond to their data partition. .. code-block:: python - import argparse - import warnings + partitioner = IidPartitioner(num_partitions=num_partitions) + fds = FederatedDataset( + dataset="mnist", + partitioners={"train": partitioner}, + ) - from sklearn.linear_model import LogisticRegression - from sklearn.metrics import log_loss + dataset = fds.load_partition(partition_id, "train").with_format("numpy") - import flwr as fl - import utils - from flwr_datasets import FederatedDataset + X, y = dataset["image"].reshape((len(dataset), -1)), dataset["label"] -Prior to local training, we need to load the MNIST dataset, a popular image -classification dataset of handwritten digits for machine learning, and partition the -dataset for FL. This can be conveniently achieved using `Flower Datasets -`_. The ``FederatedDataset.load_partition()`` method -loads the partitioned training set for each partition ID defined in the -``--partition-id`` argument. + # Split the on edge data: 80% train, 20% test + X_train, X_test = X[: int(0.8 * len(X))], X[int(0.8 * len(X)) :] + y_train, y_test = y[: int(0.8 * len(y))], y[int(0.8 * len(y)) :] + +The Model +--------- + +We define the |logisticregression|_ model from scikit-learn in the ``get_model()`` +function: .. code-block:: python - if __name__ == "__main__": - N_CLIENTS = 10 + def get_model(penalty: str, local_epochs: int): - parser = argparse.ArgumentParser(description="Flower") - parser.add_argument( - "--partition-id", - type=int, - choices=range(0, N_CLIENTS), - required=True, - help="Specifies the artificial data partition", + return LogisticRegression( + penalty=penalty, + max_iter=local_epochs, + warm_start=True, ) - args = parser.parse_args() - partition_id = args.partition_id - - fds = FederatedDataset(dataset="mnist", partitioners={"train": N_CLIENTS}) - dataset = fds.load_partition(partition_id, "train").with_format("numpy") - X, y = dataset["image"].reshape((len(dataset), -1)), dataset["label"] +To perform the training and evaluation, we will make use of the ``.fit()`` and +``.score()`` methods available in the ``LogisticRegression`` class. - X_train, X_test = X[: int(0.8 * len(X))], X[int(0.8 * len(X)) :] - y_train, y_test = y[: int(0.8 * len(y))], y[int(0.8 * len(y)) :] +The ClientApp +------------- -Next, the logistic regression model is defined and initialized with -``utils.set_initial_params()``. +The main changes we have to make to use scikit-learn with Flower will be found in the +``get_model_params()``, ``set_model_params()``, and ``set_initial_params()`` functions. +In ``get_model_params()``, the coefficients and intercept of the logistic regression +model are extracted and represented as a list of NumPy arrays. In +``set_model_params()``, that's the opposite: given a list of NumPy arrays it applies +them to an existing ``LogisticRegression`` model. Finally, in ``set_initial_params()``, +we initialize the model parameters based on the MNIST dataset, which has 10 classes +(corresponding to the 10 digits) and 784 features (corresponding to the size of the +MNIST image array, which is 28 × 28). Doing this is fairly easy in scikit-learn. .. code-block:: python - model = LogisticRegression( - penalty="l2", - max_iter=1, # local epoch - warm_start=True, # prevent refreshing weights when fitting - ) + def get_model_params(model): + if model.fit_intercept: + params = [ + model.coef_, + model.intercept_, + ] + else: + params = [model.coef_] + return params - utils.set_initial_params(model) -The Flower server interacts with clients through an interface called ``Client``. When -the server selects a particular client for training, it sends training instructions over -the network. The client receives those instructions and calls one of the ``Client`` -methods to run your code (i.e., to fit the logistic regression we defined earlier). + def set_model_params(model, params): + model.coef_ = params[0] + if model.fit_intercept: + model.intercept_ = params[1] + return model -Flower provides a convenience class called ``NumPyClient`` which makes it easier to -implement the ``Client`` interface when your workload uses scikit-learn. Implementing -``NumPyClient`` usually means defining the following methods (``set_parameters`` is -optional though): -1. ``get_parameters`` - - return the model weight as a list of NumPy ndarrays -2. ``set_parameters`` (optional) - - update the local model weights with the parameters received from the server - - is directly imported with ``utils.set_model_params()`` -3. ``fit`` - - set the local model weights - - train the local model - - return the updated local model weights -4. ``evaluate`` - - test the local model + def set_initial_params(model): + n_classes = 10 # MNIST has 10 classes + n_features = 784 # Number of features in dataset + model.classes_ = np.array([i for i in range(10)]) -The methods can be implemented in the following way: + model.coef_ = np.zeros((n_classes, n_features)) + if model.fit_intercept: + model.intercept_ = np.zeros((n_classes,)) + +The rest of the functionality is directly inspired by the centralized case: .. code-block:: python - class MnistClient(fl.client.NumPyClient): - def get_parameters(self, config): # type: ignore - return utils.get_model_parameters(model) + class FlowerClient(NumPyClient): + def __init__(self, model, X_train, X_test, y_train, y_test): + self.model = model + self.X_train = X_train + self.X_test = X_test + self.y_train = y_train + self.y_test = y_test + + def fit(self, parameters, config): + set_model_params(self.model, parameters) - def fit(self, parameters, config): # type: ignore - utils.set_model_params(model, parameters) + # Ignore convergence failure due to low local epochs with warnings.catch_warnings(): warnings.simplefilter("ignore") - model.fit(X_train, y_train) - print(f"Training finished for round {config['server_round']}") - return utils.get_model_parameters(model), len(X_train), {} + self.model.fit(self.X_train, self.y_train) + + return get_model_params(self.model), len(self.X_train), {} - def evaluate(self, parameters, config): # type: ignore - utils.set_model_params(model, parameters) - loss = log_loss(y_test, model.predict_proba(X_test)) - accuracy = model.score(X_test, y_test) - return loss, len(X_test), {"accuracy": accuracy} + def evaluate(self, parameters, config): + set_model_params(self.model, parameters) + loss = log_loss(self.y_test, self.model.predict_proba(self.X_test)) + accuracy = self.model.score(self.X_test, self.y_test) + return loss, len(self.X_test), {"accuracy": accuracy} -We can now create an instance of our class ``MnistClient`` and add one line to actually -run this client: +Finally, we can construct a ``ClientApp`` using the ``FlowerClient`` defined above by +means of a ``client_fn()`` callback. Note that the ``context`` enables you to get access +to hyperparemeters defined in your ``pyproject.toml`` to configure the run. In this +tutorial we access the `local-epochs` setting to control the number of epochs a +``ClientApp`` will perform when running the ``fit()`` method. You could define +additioinal hyperparameters in ``pyproject.toml`` and access them here. .. code-block:: python - fl.client.start_client("0.0.0.0:8080", client=MnistClient().to_client()) + def client_fn(context: Context): + # Load data and model + partition_id = context.node_config["partition-id"] + num_partitions = context.node_config["num-partitions"] + X_train, X_test, y_train, y_test = load_data(partition_id, num_partitions) + penalty = context.run_config["penalty"] + local_epochs = context.run_config["local-epochs"] + model = get_model(penalty, local_epochs) -That's it for the client. We only have to implement ``Client`` or ``NumPyClient`` and -call ``fl.client.start_client()``. If you implement a client of type ``NumPyClient`` -you'll need to first call its ``to_client()`` method. The string ``"0.0.0.0:8080"`` -tells the client which server to connect to. In our case we can run the server and the -client on the same machine, therefore we use ``"0.0.0.0:8080"``. If we run a truly -federated workload with the server and clients running on different machines, all that -needs to change is the ``server_address`` we pass to the client. + # Setting initial parameters, akin to model.compile for keras models + set_initial_params(model) + + # Return Client instance + return FlowerClient(model, X_train, X_test, y_train, y_test).to_client() -Flower Server -------------- -The following Flower server is a little bit more advanced and returns an evaluation -function for the server-side evaluation. First, we import again all required libraries -such as Flower and scikit-learn. + # Flower ClientApp + app = ClientApp(client_fn) -``server.py``, import Flower and start the server: +The ServerApp +------------- + +To construct a ``ServerApp`` we define a ``server_fn()`` callback with an identical +signature to that of ``client_fn()`` but the return type is |serverappcomponents|_ as +opposed to a |client|_ In this example we use the `FedAvg` strategy. To it we pass a +zero-initialized model that will server as the global model to be federated. Note that +the values of ``num-server-rounds``, ``penalty``, and ``local-epochs`` are read from the +run config. You can find the default values defined in the ``pyproject.toml``. .. code-block:: python - import flwr as fl - import utils - from flwr.common import NDArrays, Scalar - from sklearn.metrics import log_loss - from sklearn.linear_model import LogisticRegression - from typing import Dict + def server_fn(context: Context): + # Read from config + num_rounds = context.run_config["num-server-rounds"] - from flwr_datasets import FederatedDataset + # Create LogisticRegression Model + penalty = context.run_config["penalty"] + local_epochs = context.run_config["local-epochs"] + model = get_model(penalty, local_epochs) -The number of federated learning rounds is set in ``fit_round()`` and the evaluation is -defined in ``get_evaluate_fn()``. The evaluation function is called after each federated -learning round and gives you information about loss and accuracy. Note that we also make -use of Flower Datasets here to load the test split of the MNIST dataset for server-side -evaluation. + # Setting initial parameters, akin to model.compile for keras models + set_initial_params(model) -.. code-block:: python + initial_parameters = ndarrays_to_parameters(get_model_params(model)) - def fit_round(server_round: int) -> Dict: - """Send round number to client.""" - return {"server_round": server_round} + # Define strategy + strategy = FedAvg( + fraction_fit=1.0, + fraction_evaluate=1.0, + min_available_clients=2, + initial_parameters=initial_parameters, + ) + config = ServerConfig(num_rounds=num_rounds) + return ServerAppComponents(strategy=strategy, config=config) - def get_evaluate_fn(model: LogisticRegression): - """Return an evaluation function for server-side evaluation.""" - fds = FederatedDataset(dataset="mnist", partitioners={"train": 10}) - dataset = fds.load_split("test").with_format("numpy") - X_test, y_test = dataset["image"].reshape((len(dataset), -1)), dataset["label"] + # Create ServerApp + app = ServerApp(server_fn=server_fn) - def evaluate( - server_round: int, parameters: NDArrays, config: Dict[str, Scalar] - ) -> Optional[Tuple[float, Dict[str, Scalar]]]: - utils.set_model_params(model, parameters) - loss = log_loss(y_test, model.predict_proba(X_test)) - accuracy = model.score(X_test, y_test) - return loss, {"accuracy": accuracy} +Congratulations! You've successfully built and run your first federated learning system +in scikit-learn. - return evaluate +.. note:: -The ``main`` contains the server-side parameter initialization -``utils.set_initial_params()`` as well as the aggregation strategy -``fl.server.strategy:FedAvg()``. The strategy is the default one, federated averaging -(or FedAvg), with two clients and evaluation after each federated learning round. The -server can be started with the command -``fl.server.start_server(server_address="0.0.0.0:8080", strategy=strategy, -config=fl.server.ServerConfig(num_rounds=3))``. + Check the source code of the extended version of this tutorial in + |quickstart_sklearn_link|_ in the Flower GitHub repository. -.. code-block:: python +.. |client| replace:: ``Client`` - # Start Flower server for three rounds of federated learning - if __name__ == "__main__": - model = LogisticRegression() - utils.set_initial_params(model) - strategy = fl.server.strategy.FedAvg( - min_available_clients=2, - evaluate_fn=get_evaluate_fn(model), - on_fit_config_fn=fit_round, - ) - fl.server.start_server( - server_address="0.0.0.0:8080", - strategy=strategy, - config=fl.server.ServerConfig(num_rounds=3), - ) +.. |fedavg| replace:: ``FedAvg`` -Train the model, federated! ---------------------------- +.. |flowerdatasets| replace:: Flower Datasets -With both client and server ready, we can now run everything and see federated learning -in action. Federated learning systems usually have a server and multiple clients. We, -therefore, have to start the server first: +.. |iidpartitioner| replace:: ``IidPartitioner`` -.. code-block:: shell +.. |logisticregression| replace:: ``LogisticRegression`` - $ python3 server.py +.. |otherpartitioners| replace:: other partitioners -Once the server is running we can start the clients in different terminals. Open a new -terminal and start the first client: +.. |serverappcomponents| replace:: ``ServerAppComponents`` -.. code-block:: shell +.. |quickstart_sklearn_link| replace:: ``examples/sklearn-logreg-mnist`` - $ python3 client.py +.. _client: ref-api/flwr.client.Client.html#client -Open another terminal and start the second client: +.. _fedavg: ref-api/flwr.server.strategy.FedAvg.html#flwr.server.strategy.FedAvg -.. code-block:: shell +.. _flowerdatasets: https://flower.ai/docs/datasets/ - $ python3 client.py +.. _iidpartitioner: https://flower.ai/docs/datasets/ref-api/flwr_datasets.partitioner.IidPartitioner.html#flwr_datasets.partitioner.IidPartitioner -Each client will have its own dataset. You should now see how the training does in the -very first terminal (the one that started the server): +.. _logisticregression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html -.. code-block:: shell +.. _otherpartitioners: https://flower.ai/docs/datasets/ref-api/flwr_datasets.partitioner.html + +.. _quickstart_sklearn_link: https://github.com/adap/flower/tree/main/examples/sklearn-logreg-mnist + +.. _serverappcomponents: ref-api/flwr.server.ServerAppComponents.html#serverappcomponents - INFO flower 2022-01-13 13:43:14,859 | app.py:73 | Flower server running (insecure, 3 rounds) - INFO flower 2022-01-13 13:43:14,859 | server.py:118 | Getting initial parameters - INFO flower 2022-01-13 13:43:17,903 | server.py:306 | Received initial parameters from one random client - INFO flower 2022-01-13 13:43:17,903 | server.py:120 | Evaluating initial parameters - INFO flower 2022-01-13 13:43:17,992 | server.py:123 | initial parameters (loss, other metrics): 2.3025850929940455, {'accuracy': 0.098} - INFO flower 2022-01-13 13:43:17,992 | server.py:133 | FL starting - DEBUG flower 2022-01-13 13:43:19,814 | server.py:251 | fit_round: strategy sampled 2 clients (out of 2) - DEBUG flower 2022-01-13 13:43:20,046 | server.py:260 | fit_round received 2 results and 0 failures - INFO flower 2022-01-13 13:43:20,220 | server.py:148 | fit progress: (1, 1.3365667871792377, {'accuracy': 0.6605}, 2.227397900000142) - INFO flower 2022-01-13 13:43:20,220 | server.py:199 | evaluate_round: no clients selected, cancel - DEBUG flower 2022-01-13 13:43:20,220 | server.py:251 | fit_round: strategy sampled 2 clients (out of 2) - DEBUG flower 2022-01-13 13:43:20,456 | server.py:260 | fit_round received 2 results and 0 failures - INFO flower 2022-01-13 13:43:20,603 | server.py:148 | fit progress: (2, 0.721620492535375, {'accuracy': 0.7796}, 2.6108531999998377) - INFO flower 2022-01-13 13:43:20,603 | server.py:199 | evaluate_round: no clients selected, cancel - DEBUG flower 2022-01-13 13:43:20,603 | server.py:251 | fit_round: strategy sampled 2 clients (out of 2) - DEBUG flower 2022-01-13 13:43:20,837 | server.py:260 | fit_round received 2 results and 0 failures - INFO flower 2022-01-13 13:43:20,967 | server.py:148 | fit progress: (3, 0.5843629244915138, {'accuracy': 0.8217}, 2.9750180000010005) - INFO flower 2022-01-13 13:43:20,968 | server.py:199 | evaluate_round: no clients selected, cancel - INFO flower 2022-01-13 13:43:20,968 | server.py:172 | FL finished in 2.975252800000817 - INFO flower 2022-01-13 13:43:20,968 | app.py:109 | app_fit: losses_distributed [] - INFO flower 2022-01-13 13:43:20,968 | app.py:110 | app_fit: metrics_distributed {} - INFO flower 2022-01-13 13:43:20,968 | app.py:111 | app_fit: losses_centralized [(0, 2.3025850929940455), (1, 1.3365667871792377), (2, 0.721620492535375), (3, 0.5843629244915138)] - INFO flower 2022-01-13 13:43:20,968 | app.py:112 | app_fit: metrics_centralized {'accuracy': [(0, 0.098), (1, 0.6605), (2, 0.7796), (3, 0.8217)]} - DEBUG flower 2022-01-13 13:43:20,968 | server.py:201 | evaluate_round: strategy sampled 2 clients (out of 2) - DEBUG flower 2022-01-13 13:43:21,232 | server.py:210 | evaluate_round received 2 results and 0 failures - INFO flower 2022-01-13 13:43:21,232 | app.py:121 | app_evaluate: federated loss: 0.5843629240989685 - INFO flower 2022-01-13 13:43:21,232 | app.py:122 | app_evaluate: results [('ipv4:127.0.0.1:53980', EvaluateRes(loss=0.5843629240989685, num_examples=10000, accuracy=0.0, metrics={'accuracy': 0.8217})), ('ipv4:127.0.0.1:53982', EvaluateRes(loss=0.5843629240989685, num_examples=10000, accuracy=0.0, metrics={'accuracy': 0.8217}))] - INFO flower 2022-01-13 13:43:21,232 | app.py:127 | app_evaluate: failures [] - -Congratulations! You've successfully built and run your first federated learning system. -The full `source code -`_ for this -example can be found in ``examples/sklearn-logreg-mnist``. +.. meta:: + :description: Check out this Federated Learning quickstart tutorial for using Flower with scikit-learn to train a linear regression model.