diff --git a/Instructions/Exercises/10-Deep-learning.md b/Instructions/Exercises/10-Deep-learning.md index 6d75d8e..613149c 100644 --- a/Instructions/Exercises/10-Deep-learning.md +++ b/Instructions/Exercises/10-Deep-learning.md @@ -108,7 +108,6 @@ Network](https://lternet.edu/). - Normalize the numeric data to a similar scale - Split the data into two datasets: one for training, and another for testing. - ```python from pyspark.sql.types import * from pyspark.sql.functions import * @@ -172,22 +171,22 @@ PyTorch makes use of *data loaders* to load training and validation data in batc Add a cell and run the following code to prepare data loaders: - ```python + ```python # Create a dataset and loader for the training data and labels train_x = torch.Tensor(x_train).float() train_y = torch.Tensor(y_train).long() train_ds = td.TensorDataset(train_x,train_y) train_loader = td.DataLoader(train_ds, batch_size=20, shuffle=False, num_workers=1) - + # Create a dataset and loader for the test data and labels test_x = torch.Tensor(x_test).float() test_y = torch.Tensor(y_test).long() test_ds = td.TensorDataset(test_x,test_y) test_loader = td.DataLoader(test_ds, batch_size=20, - shuffle=False, num_workers=1) + shuffle=False, num_workers=1) print('Ready to load data') - ``` + ``` ## Define a neural network @@ -333,7 +332,7 @@ Now you can use the **train** and **test** functions to train a neural network m - In each *epoch*, the full set of training data is passed forward through the network. There are five features for each observation, and five corresponding nodes in the input layer - so the features for each observation are passed as a vector of five values to that layer. However, for efficiency, the feature vectors are grouped into batches; so actually a matrix of multiple feature vectors is fed in each time. - The matrix of feature values is processed by a function that performs a weighted sum using initialized weights and bias values. The result of this function is then processed by the activation function for the input layer to constrain the values passed to the nodes in the next layer. - The weighted sum and activation functions are repeated in each layer. Note that the functions operate on vectors and matrices rather than individual scalar values. In other words, the forward pass is essentially a series of nested linear algebra functions. This is the reason data scientists prefer to use computers with graphical processing units (GPUs), since these are optimized for matrix and vector calculations. - - In the final layer of the network, the output vectors contain a calculated value for each possible class (in this case, classes 0, 1, and 2). This vector is processed by a *loss function* that determines how far they are from the expected values based on the actual classes - so for example, suppose the output for a Gentoo penguin (class 1) observation is \[0.3, 0.4, 0.3\]. The correct prediction would be \[0.0, 1.0, 0.0\], so the variance between the predicted and actual values (how far away each predicted value is from what it should be) is \[0.3, 0.6, 0.3\]. This variance is aggregated for each batch and maintained as a running aggregate to calculate the overall level of error (*loss*) incurred by the training data for the epoch. + - In the final layer of the network, the output vectors contain a calculated value for each possible class (in this case, classes 0, 1, and 2). This vector is processed by a *loss function* that determines how far they are from the expected values based on the actual classes - so for example, suppose the output for a Gentoo penguin (class 1) observation is \[0.3, 0.4, 0.3\]. The correct prediction would be \[0.0, 1.0, 0.0\], so the variance between the predicted and actual values (how far away each predicted value is from what it should be) is \[0.3, 0.6, 0.3\]. This variance is aggregated for each batch and maintained as a running aggregate to calculate the overall level of error (*loss*) incurred by the training data for the epoch. - At the end of each epoch, the validation data is passed through the network, and its loss and accuracy (proportion of correct predictions based on the highest probability value in the output vector) are also calculated. It's useful to do this because it enables us to compare the performance of the model after each epoch using data on which it was not trained, helping us determine if it will generalize well for new data or if it's *overfitted* to the training data. - After all the data has been passed forward through the network, the output of the loss function for the *training* data (but not the *validation* data) is passed to the optimizer. The precise details of how the optimizer processes the loss vary depending on the specific optimization algorithm being used; but fundamentally you can think of the entire network, from the input layer to the loss function as being one big nested (*composite*) function. The optimizer applies some differential calculus to calculate *partial derivatives* for the function with respect to each weight and bias value that was used in the network. It's possible to do this efficiently for a nested function due to something called the *chain rule*, which enables you to determine the derivative of a composite function from the derivatives of its inner function and outer functions. You don't really need to worry about the details of the math here (the optimizer does it for you), but the end result is that the partial derivatives tell us about the slope (or *gradient*) of the loss function with respect to each weight and bias value - in other words, we can determine whether to increase or decrease the weight and bias values in order to minimize the loss. - Having determined in which direction to adjust the weights and biases, the optimizer uses the *learning rate* to determine by how much to adjust them; and then works backwards through the network in a process called *backpropagation* to assign new values to the weights and biases in each layer.