Tutorial 1: Optimization techniques#
Week 1, Day 5: Optimization
By Neuromatch Academy
Content creators: Jose Gallego-Posada, Ioannis Mitliagkas
Content reviewers: Piyush Chauhan, Vladimir Haltakov, Siwei Bai, Kelson Shilling-Scrivo
Content editors: Charles J Edelson, Gagana B, Spiros Chavlis
Production editors: Arush Tagade, R. Krishnakumaran, Gagana B, Spiros Chavlis
Tutorial Objectives#
Objectives:
Necessity and importance of optimization
Introduction to commonly used optimization techniques
Optimization in non-convex loss landscapes
‘Adaptive’ hyperparameter tuning
Ethical concerns
Setup#
Install and import feedback gadget#
Show code cell source
# @title Install and import feedback gadget
!pip3 install vibecheck datatops --quiet
from vibecheck import DatatopsContentReviewContainer
def content_review(notebook_section: str):
return DatatopsContentReviewContainer(
"", # No text prompt
notebook_section,
{
"url": "https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab",
"name": "neuromatch_dl",
"user_key": "f379rz8y",
},
).render()
feedback_prefix = "W1D5_T1"
# Imports
import copy
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import time
import torch
import torchvision
import torchvision.datasets as datasets
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim
from tqdm.auto import tqdm
Figure settings#
Show code cell source
# @title Figure settings
import logging
logging.getLogger('matplotlib.font_manager').disabled = True
import ipywidgets as widgets # interactive display
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/content-creation/main/nma.mplstyle")
plt.rc('axes', unicode_minus=False)
Helper functions#
Show code cell source
# @title Helper functions
def print_params(model):
"""
Lists the name and current value of the model's
named parameters
Args:
model: an nn.Module inherited model
Represents the ML/DL model
Returns:
Nothing
"""
for name, param in model.named_parameters():
if param.requires_grad:
print(name, param.data)
Set random seed#
Executing set_seed(seed=seed)
you are setting the seed
Show code cell source
# @title Set random seed
# @markdown Executing `set_seed(seed=seed)` you are setting the seed
# for DL its critical to set the random seed so that students can have a
# baseline to compare their results to expected results.
# Read more here: https://pytorch.org/docs/stable/notes/randomness.html
# Call the `set_seed` function in the exercises to ensure reproducibility.
import random
import torch
def set_seed(seed=None, seed_torch=True):
"""
Handles variability by controlling sources of randomness
through set seed values
Args:
seed: Integer
Set the seed value to given integer.
If no seed, set seed value to random integer in the range 2^32
seed_torch: Bool
Seeds the random number generator for all devices to
offer some guarantees on reproducibility
Returns:
Nothing
"""
if seed is None:
seed = np.random.choice(2 ** 32)
random.seed(seed)
np.random.seed(seed)
if seed_torch:
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
print(f'Random seed {seed} has been set.')
# In case that `DataLoader` is used
def seed_worker(worker_id):
"""
DataLoader will reseed workers following randomness in
multi-process data loading algorithm.
Args:
worker_id: integer
ID of subprocess to seed. 0 means that
the data will be loaded in the main process
Refer: https://pytorch.org/docs/stable/data.html#data-loading-randomness for more details
Returns:
Nothing
"""
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)
Set device (GPU or CPU). Execute set_device()
#
Show code cell source
# @title Set device (GPU or CPU). Execute `set_device()`
# especially if torch modules are used.
# inform the user if the notebook uses GPU or CPU.
def set_device():
"""
Set the device. CUDA if available, CPU otherwise
Args:
None
Returns:
Nothing
"""
device = "cuda" if torch.cuda.is_available() else "cpu"
if device != "cuda":
print("WARNING: For this notebook to perform best, "
"if possible, in the menu under `Runtime` -> "
"`Change runtime type.` select `GPU` ")
else:
print("GPU is enabled in this notebook.")
return device
SEED = 2021
set_seed(seed=SEED)
DEVICE = set_device()
Random seed 2021 has been set.
WARNING: For this notebook to perform best, if possible, in the menu under `Runtime` -> `Change runtime type.` select `GPU`
Section 1. Introduction#
Time estimate: ~15 mins
Video 1: Introduction#
Submit your feedback#
Show code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Introduction_Video")
Discuss: Unexpected consequences#
Can you think of examples from your own experience/life where poorly chosen incentives or objectives have led to unexpected consequences?
Submit your feedback#
Show code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Unexpected_consequences_Discussion")
Section 2: Case study: successfully training an MLP for image classification#
Time estimate: ~40 mins
Many of the core ideas (and tricks) in modern optimization for deep learning can be illustrated in the simple setting of training an MLP to solve an image classification task. In this tutorial we will guide you through the key challenges that arise when optimizing high-dimensional, non-convex\(^\dagger\) problems. We will use these challenges to motivate and explain some commonly used solutions.
Disclaimer: Some of the functions you will code in this tutorial are already implemented in Pytorch and many other libraries. For pedagogical reasons, we decided to bring these simple coding tasks into the spotlight and place a relatively higher emphasis in your understanding of the algorithms, rather than the use of a specific library.
In ‘day-to-day’ research projects you will likely rely on the community-vetted, optimized libraries rather than the ‘manual implementations’ you will write today. In Section 8 you will have a chance to ‘put it all together’ and use the full power of Pytorch to tune the parameters of an MLP to classify handwritten digits.
\(^\dagger\): A strictly convex function has the same global and local minimum - a nice property for optimization as it won’t get stuck in a local minimum that isn’t a global one (e.g., \(f(x)=x^2 + 2x + 1\)). A non-convex function is wavy - has some ‘valleys’ (local minima) that aren’t as deep as the overall deepest ‘valley’ (global minimum). Thus, the optimization algorithms can get stuck in the local minimum, and it can be hard to tell when this happens (e.g., \(f(x) = x^4 + x^3 - 2x^2 - 2x\)). See also Section 5 for more details.
Video 2: Case Study - MLP Classification#
Submit your feedback#
Show code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Case_study_MLP_classification_Video")
Section 2.1: Data#
We will use the MNIST dataset of handwritten digits. We load the data via the Pytorch datasets
module, as you learned in W1D1.
Note: Although we can download the MNIST dataset directly from datasets
using the optional argument download=True
, we are going to download them from NMA directory on OSF to ensure network reliability.
Download MNIST dataset#
Show code cell source
# @title Download MNIST dataset
import tarfile, requests, os
fname = 'MNIST.tar.gz'
name = 'MNIST'
url = 'https://osf.io/y2fj6/download'
if not os.path.exists(name):
print('\nDownloading MNIST dataset...')
r = requests.get(url, allow_redirects=True)
with open(fname, 'wb') as fh:
fh.write(r.content)
print('\nDownloading MNIST completed.')
if not os.path.exists(name):
with tarfile.open(fname) as tar:
tar.extractall()
os.remove(fname)
else:
print('MNIST dataset has been downloaded.')
Downloading MNIST dataset...
Downloading MNIST completed.
def load_mnist_data(change_tensors=False, download=False):
"""
Load training and test examples for the MNIST handwritten digits dataset
with every image: 28*28 x 1 channel (greyscale image)
Args:
change_tensors: Bool
Argument to check if tensors need to be normalised
download: Bool
Argument to check if dataset needs to be downloaded/already exists
Returns:
train_set:
train_data: Tensor
training input tensor of size (train_size x 784)
train_target: Tensor
training 0-9 integer label tensor of size (train_size)
test_set:
test_data: Tensor
test input tensor of size (test_size x 784)
test_target: Tensor
training 0-9 integer label tensor of size (test_size)
"""
# Load train and test sets
train_set = datasets.MNIST(root='.', train=True, download=download,
transform=torchvision.transforms.ToTensor())
test_set = datasets.MNIST(root='.', train=False, download=download,
transform=torchvision.transforms.ToTensor())
# Original data is in range [0, 255]. We normalize the data wrt its mean and std_dev.
# Note that we only used *training set* information to compute mean and std
mean = train_set.data.float().mean()
std = train_set.data.float().std()
if change_tensors:
# Apply normalization directly to the tensors containing the dataset
train_set.data = (train_set.data.float() - mean) / std
test_set.data = (test_set.data.float() - mean) / std
else:
tform = torchvision.transforms.Compose([torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(mean=[mean / 255.], std=[std / 255.])
])
train_set = datasets.MNIST(root='.', train=True, download=download,
transform=tform)
test_set = datasets.MNIST(root='.', train=False, download=download,
transform=tform)
return train_set, test_set
train_set, test_set = load_mnist_data(change_tensors=True)
As we are just getting started, we will concentrate on a small subset of only 500 examples out of the 60.000 data points contained in the whole training set.
# Sample a random subset of 500 indices
subset_index = np.random.choice(len(train_set.data), 500)
# We will use these symbols to represent the training data and labels, to stay
# as close to the mathematical expressions as possible.
X, y = train_set.data[subset_index, :], train_set.targets[subset_index]
Run the following cell to visualize the content of three examples in our training set. Note how the preprocessing we applied to the data changes the range of pixel values after normalization.
Run me!#
Show code cell source
# @title Run me!
# Exploratory data analysis and visualisation
num_figures = 3
fig, axs = plt.subplots(1, num_figures, figsize=(5 * num_figures, 5))
for sample_id, ax in enumerate(axs):
# Plot the pixel values for each image
ax.matshow(X[sample_id, :], cmap='gray_r')
# 'Write' the pixel value in the corresponding location
for (i, j), z in np.ndenumerate(X[sample_id, :]):
text = '{:.1f}'.format(z)
ax.text(j, i, text, ha='center',
va='center', fontsize=6, c='steelblue')
ax.set_title('Label: ' + str(y[sample_id].item()))
ax.axis('off')
plt.show()
Section 2.2: Model#
As you will see next week, there are specific model architectures that are better suited to image-like data, such as Convolutional Neural Networks (CNNs). For simplicity, in this tutorial we will focus exclusively on Multi-Layer Perceptron (MLP) models as they allow us to highlight many important optimization challenges shared with more advanced neural network designs.
class MLP(nn.Module):
"""
This class implements MLPs in Pytorch of an arbitrary number of hidden
layers of potentially different sizes. Since we concentrate on classification
tasks in this tutorial, we have a log_softmax layer at prediction time.
"""
def __init__(self, in_dim=784, out_dim=10, hidden_dims=[], use_bias=True):
"""
Constructs a MultiLayerPerceptron
Args:
in_dim: Integer
dimensionality of input data (784)
out_dim: Integer
number of classes (10)
hidden_dims: List
containing the dimensions of the hidden layers,
empty list corresponds to a linear model (in_dim, out_dim)
Returns:
Nothing
"""
super(MLP, self).__init__()
self.in_dim = in_dim
self.out_dim = out_dim
# If we have no hidden layer, just initialize a linear model (e.g. in logistic regression)
if len(hidden_dims) == 0:
layers = [nn.Linear(in_dim, out_dim, bias=use_bias)]
else:
# 'Actual' MLP with dimensions in_dim - num_hidden_layers*[hidden_dim] - out_dim
layers = [nn.Linear(in_dim, hidden_dims[0], bias=use_bias), nn.ReLU()]
# Loop until before the last layer
for i, hidden_dim in enumerate(hidden_dims[:-1]):
layers += [nn.Linear(hidden_dim, hidden_dims[i + 1], bias=use_bias),
nn.ReLU()]
# Add final layer to the number of classes
layers += [nn.Linear(hidden_dims[-1], out_dim, bias=use_bias)]
self.main = nn.Sequential(*layers)
def forward(self, x):
"""
Defines the network structure and flow from input to output
Args:
x: Tensor
Image to be processed by the network
Returns:
output: Tensor
same dimension and shape as the input with probabilistic values in the range [0, 1]
"""
# Flatten each images into a 'vector'
transformed_x = x.view(-1, self.in_dim)
hidden_output = self.main(transformed_x)
output = F.log_softmax(hidden_output, dim=1)
return output
Linear models constitute a very special kind of MLPs: they are equivalent to an MLP with zero hidden layers. This is simply an affine transformation, in other words a ‘linear’ map \(W x\) with an ‘offset’ \(b\); followed by a softmax function.
Here \(x \in \mathbb{R}^{784}\), \(W \in \mathbb{R}^{10 \times 784}\) and \(b \in \mathbb{R}^{10}\). Notice that the dimensions of the weight matrix are \(10 \times 784\) as the input tensors are flattened images, i.e., \(28 \times 28 = 784\)-dimensional tensors and the output layer consists of \(10\) nodes. Also, note that the implementation of softmax encapsulates b in W i.e., It maps the rows of the input instead of the columns. That is, the i’th row of the output is the mapping of the i’th row of the input under W, plus the bias term. Refer Affine maps here: https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html#affine-maps
# Empty hidden_dims means we take a model with zero hidden layers.
model = MLP(in_dim=784, out_dim=10, hidden_dims=[])
# We print the model structure with 784 inputs and 10 outputs
print(model)
MLP(
(main): Sequential(
(0): Linear(in_features=784, out_features=10, bias=True)
)
)
Section 2.3: Loss#
While we care about the accuracy of the model, the ‘discrete’ nature of the 0-1 loss makes it challenging to optimize. In order to learn good parameters for this model, we will use the cross entropy loss (negative log-likelihood), which you saw in the last lecture, as a surrogate objective to be minimized.
This particular choice of model and optimization objective leads to a convex optimization problem with respect to the parameters \(W\) and \(b\).
loss_fn = F.nll_loss
Section 2.4: Interpretability#
In the last lecture, you saw that inspecting the weights of a model can provide insights on what ‘concepts’ the model has learned. Here we show the weights of a partially trained model. The weights corresponding to each class ‘learn’ to fire when an input of the class is detected.
Show code cell source
#@markdown Run _this cell_ to train the model. If you are curious about how the training
#@markdown takes place, double-click this cell to find out. At the end of this tutorial
#@markdown you will have the opportunity to train a more complex model on your own.
cell_verbose = False
partial_trained_model = MLP(in_dim=784, out_dim=10, hidden_dims=[])
if cell_verbose:
print('Init loss', loss_fn(partial_trained_model(X), y).item()) # This matches around np.log(10 = # of classes)
# Invoke an optimizer using Adaptive gradient and Momentum (more about this in Section 7)
optimizer = optim.Adam(partial_trained_model.parameters(), lr=7e-4)
for _ in range(200):
loss = loss_fn(partial_trained_model(X), y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if cell_verbose:
print('End loss', loss_fn(partial_trained_model(X), y).item()) # This should be less than 1e-2
# Show class filters of a trained model
W = partial_trained_model.main[0].weight.data.numpy()
fig, axs = plt.subplots(1, 10, figsize=(15, 4))
for class_id in range(10):
axs[class_id].imshow(W[class_id, :].reshape(28, 28), cmap='gray_r')
axs[class_id].axis('off')
axs[class_id].set_title('Class ' + str(class_id) )
plt.show()
Section 3: High dimensional search#
Time estimate: ~25 mins
We now have a model with its corresponding trainable parameters as well as an objective to optimize. Where do we goto next? How do we find a ‘good’ configuration of parameters?
One idea is to choose a random direction and move only if the objective is reduced. However, this is inefficient in high dimensions and you will see how gradient descent (with a suitable step-size) can guarantee consistent improvement in terms of the objective function.
Video 3: Optimization of an Objective Function#
Submit your feedback#
Show code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Optimization_of_an_Objective_Function_Video")
Coding Exercise 3: Implement gradient descent#
In this exercise you will use PyTorch automatic differentiation capabilities to compute the gradient of the loss with respect to the parameters of the model. You will then use these gradients to implement the update performed by the gradient descent method.
def zero_grad(params):
<