Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update markdown instructions for Penguin classification exercise (1) #66

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

surbhigoel77
Copy link
Member

@surbhigoel77 surbhigoel77 commented Jul 1, 2024

Amendments to Penguin classifiication exercise to aid readability.

  • Propagate solution notebook changes to exercise notebook.
  • Propagate all changes to colab. This is done by rebasing or merging into the colab branch.

@surbhigoel77 surbhigoel77 self-assigned this Jul 1, 2024
@surbhigoel77 surbhigoel77 marked this pull request as draft July 1, 2024 09:22
" - Print ``data`` and recognise that ``load_penguins`` has returned the dataframe.\n",
"- Analyse which features it might make sense to use in order to classify the species of the penguins.\n",
" - You can print the column names using ``pd.DataFrame.keys()``\n",
" - You can also obtain useful statical information on the dataset using ``pd.DataFrame.Series.describe()``"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statistical

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'Consider' is probably fine here, but either is fine.

@@ -108,23 +108,25 @@
"source": [
"### Task 2: creating a ``torch.utils.data.Dataset``\n",
"\n",
"All PyTorch dataset objects are subclasses of the ``torch.utils.data.Dataset`` class. To make a custom dataset, create a class which inherits from the ``Dataset`` class, implement some methods (the Python magic (or dunder) methods ``__len__`` and ``__getitem__``) and supply some data.\n",
"To be able to use Pytorch functionalities, we need to make the dataset compatible with Pytorch. We do it using PyTorch's Dataset class called ``torch.utils.data.Dataset``. \n",
Copy link
Member

@ma595 ma595 Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTorch

Copy link
Member

@ma595 ma595 Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should keep the bit on torch.utils.data.Dataset. I really like the new comment however. Perhaps this should be the first sentence:

"To be able to use Pytorch functionalities, we need to make the dataset compatible with Pytorch. We do it using PyTorch's Dataset class called ``torch.utils.data.Dataset``. \n",

Followed by:
"All PyTorch dataset objects are subclasses of the torch.utils.data.Dataset class. To make a custom dataset, create a class which inherits from the Dataset class, implement some methods (the Python magic (or dunder) methods __len__ and __getitem__) and supply some data.\n",

Sorry, I realise this is fine. It's just been restructured.

Copy link
Member

@ma595 ma595 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.

Copy link
Member

@ma595 ma595 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are helpful to understand the material. I've made a few comments here and there.

  • Your git profile configuration is a bit messed up. Can you please fix this.
  • You've probably already been done this, but when commiting the solutions, restart the kernel, clear the outputs and run everything through once more.
  • Make sure to restart the kernel and clear the outputs for the exercises before committing.
  • I think adding some steps (as python comments) in task 9, task 10 and task 11 would make it easier to complete. See how I've done this in PR Simpler PenguinDataset #71 for the first set of tasks. Maybe we should open another PR for these changes?

Some other things worth mentioning either in the session itself or perhaps in a later PR:

  • What are the implications of the batch size (i.e. too large and model is prone to overfitting / too small and we get inefficient use of hardware)
  • Task 6: We could mention feature scaling / normalisation using batch norm.
  • Optimiser (could include the lr here).
  • What model.train() and model.eval() are doing. Just noticed there's a mention of model.train() and model.eval() already.
  • Whether the train & valid routines lead to good code or not. (repetition - often this is packaged in a training class / alternatively use pytorch lightning.

@@ -108,23 +108,25 @@
"source": [
"### Task 2: creating a ``torch.utils.data.Dataset``\n",
"\n",
"All PyTorch dataset objects are subclasses of the ``torch.utils.data.Dataset`` class. To make a custom dataset, create a class which inherits from the ``Dataset`` class, implement some methods (the Python magic (or dunder) methods ``__len__`` and ``__getitem__``) and supply some data.\n",
"To be able to use Pytorch functionalities, we need to make the dataset compatible with Pytorch. We do it using PyTorch's Dataset class called ``torch.utils.data.Dataset``. \n",
Copy link
Member

@ma595 ma595 Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should keep the bit on torch.utils.data.Dataset. I really like the new comment however. Perhaps this should be the first sentence:

"To be able to use Pytorch functionalities, we need to make the dataset compatible with Pytorch. We do it using PyTorch's Dataset class called ``torch.utils.data.Dataset``. \n",

Followed by:
"All PyTorch dataset objects are subclasses of the torch.utils.data.Dataset class. To make a custom dataset, create a class which inherits from the Dataset class, implement some methods (the Python magic (or dunder) methods __len__ and __getitem__) and supply some data.\n",

Sorry, I realise this is fine. It's just been restructured.

" - We allow the model to learn directly from the training set—i.e. we fit the function to these data.\n",
" - During training, we monitor the model's performance on the validation set in order to check how it's doing on unseen data. Normally, people use the validation performance to determine when to stop the training process.\n",
"- Note: Here we create a training and validation set.\n",
" - We allow the model to learn from the training setn i.e. we fit the function to these data.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has this been fixed?

" - ``train``— A boolean variable determining if the model returns the training or validation split (``True`` for training).\n",
" - ``x_tfms``— A ``Compose`` object with functions which will convert the raw input to a tensor. This argument is _optional_.\n",
" - ``x_tfms``— A ``Compose`` object with functions which will convert the raw input to a tensor. This argument is _optional_. Remember Pytorch deals with tensors only.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about: Recall that PyTorch deals with torch.Tensorss only

" - The number of items we supply at once is called the batch size.\n",
" - The ``DataLoader`` can also randomly shuffle the data each epoch (when training).\n",
" - It allows us to load different mini-batches in parallel, which can be very useful for larger datasets and images that can't all fit in memory at once.\n",
" - The number of observations we pass at once is called the batch size.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer 'items' and 'supply' here

@@ -323,19 +359,19 @@
"### Task 5: Creating ``DataLoaders``—and why\n",
"\n",
"- Once we have created a ``Dataset`` object, we wrap it in a ``DataLoader``.\n",
" - The ``DataLoader`` object allows us to put our inputs and targets in mini-batches, which makes for more efficient training.\n",
" - The ``DataLoader`` utility allows us to load and put inputs and targets in mini-batches, which makes iterating over the dataset during training more efficient.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the old comment. It's slightly more succinct.

@@ -495,14 +534,18 @@
"source": [
"### Task 8: Selecting an optimiser\n",
"\n",
"Optimiser updates the model's weights and biases on the basis of the gradient of the loss function w.r.t each weight and bias. The aim of weight updation is to reduce the loss. The weight updation goes on until the loss is minimised as much as possible. \n",
Copy link
Member

@ma595 ma595 Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The optimiser updates the model's weights...
  • I wouldn't use the word 'updation', use 'the weight update' or weight update process/procedure.
  • again, it's a little vague. "The weight update procedure proceeds until the loss is minimised subject to some stopping criterion?"

"While we talked about stochastic gradient descent in the slides, most people use the so-called [Adam optimiser](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html).\n",
"\n",
"You can think of it as a more complex and improved implementation of SGD."
"You can think of it as a slightly more complex and improved implementation of SGD."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the old comment.

" - This activation function is good for classifcation when the result is one of ``A or B or C``.\n",
" - It's bad if you even want to assign two classification to one images—say a photo of a dog _and_ a cat.\n",
" - It turns the raw outputs, or logits, into \"psuedo probabilities\", and we take our prediction to be the most probable class.\n",
"\n",
"\n",
"- The purpose of training loop is that for each epoch, all the mini-batches are sequentially passed through the model to make predictions, compute the loss, and update the model parameters and repeat it until we receive a satisfactory performance.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the training loop

and update the model parameters. This is repeated until we receive a satisfactory performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is useful but it needs to be reworded.

" - This activation function is good for classifcation when the result is one of ``A or B or C``.\n",
" - It's bad if you even want to assign two classification to one images—say a photo of a dog _and_ a cat.\n",
" - It turns the raw outputs, or logits, into \"psuedo probabilities\", and we take our prediction to be the most probable class.\n",
"\n",
"\n",
"- The purpose of training loop is that for each epoch, all the mini-batches are sequentially passed through the model to make predictions, compute the loss, and update the model parameters and repeat it until we receive a satisfactory performance.\n",
"- The validation loop works similarly, but the purpose is to validate the model training's performance without any weight updates. Hence, we are not calculating any gradient during validation. This is made sure using the ``no_grad`` decorator with validation loop.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @no_grad decorator disables gradient tracking in the function. We could point out that it is optional.

We also don't need gradients in the batch_level_accuracy() function.

" - Print ``data`` and recognise that ``load_penguins`` has returned the dataframe.\n",
"- Analyse which features it might make sense to use in order to classify the species of the penguins.\n",
" - You can print the column names using ``pd.DataFrame.keys()``\n",
" - You can also obtain useful statical information on the dataset using ``pd.DataFrame.Series.describe()``"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'Consider' is probably fine here, but either is fine.

@ma595
Copy link
Member

ma595 commented Jul 5, 2024

This will also need to be rebased on #71

@ma595 ma595 changed the title Updated the markdown for Task1 in Exercise1 penguin Update markdown instructionss for Penguin classification exercise (1) Jul 5, 2024
@ma595 ma595 changed the title Update markdown instructionss for Penguin classification exercise (1) Update markdown instructions for Penguin classification exercise (1) Jul 5, 2024
@surbhigoel77
Copy link
Member Author

Thanks for the comments @ma595. I have made the changes as highlighted by you.

Copy link
Member

@ma595 ma595 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed a few other changes during the first review, and resolved some other requests as they were really good additions.

"\n",
"\n",
"Note: we are going to use batch normalisation layers in our network, which don't work if the batch size is one. This can happen on the last batch, if we don't choose a batch size that evenly divides the number of items in the data set. To avoid this, we can set the ``drop_last`` argument to ``True``. The last batch, which will be of size ``len(data_set) % batch_size`` gets dropped, and the data are reshuffled. This is only relevant during the training process - validation will use population statistics."
"Note: we are going to use batch normalisation in our network, which doesn't work if the batch size is one. This can happen on the last batch, if we don't choose a batch size that evenly divides the number of items in the data set. To avoid this, we can set the ``drop_last`` argument to ``True``. The last batch, which will be of size ``len(data_set) % batch_size`` gets dropped, and the data are reshuffled. This is only relevant during the training process - validation will use population statistics."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for changing frombatch normalisation layers to batch normalisation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just to lay more focus on batch normalisation as a concept. Can revert the change if that looks better

@@ -520,17 +558,20 @@
"\n",
"- Before we jump in and write these loops, we must first choose an activation function to apply to the model's outputs.\n",
" - Here we are going to use the softmax activation function: see [the PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html).\n",
" - For those of you who've studied physics, you may be remininded of the partition function in thermodynamics.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite like this comment and would like it to be kept.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected

@surbhigoel77 surbhigoel77 requested a review from ma595 July 10, 2024 19:44
@surbhigoel77 surbhigoel77 marked this pull request as ready for review July 10, 2024 19:44
@jatkinson1000
Copy link
Member

Just doing some cleaning on this repo.
Is this stale or did it get superseded by another PR - I see there are some conflicts with the latest main.

Is there an interest in picking up and finishing this or should it be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants