Tutorial 2: Regularization techniques part 2

Tutorial 2: Regularization techniques part 2#

Week 2, Day 1: Regularization

By Neuromatch Academy

Content creators: Ravi Teja Konkimalla, Mohitrajhu Lingan Kumaraian, Kevin Machado Gamboa, Kelson Shilling-Scrivo, Lyle Ungar

Content reviewers: Piyush Chauhan, Siwei Bai, Kelson Shilling-Scrivo

Content editors: Roberto Guidotti, Spiros Chavlis

Production editors: Saeed Salehi, Gagana B, Spiros Chavlis

Tutorial Objectives#

Regularization as shrinkage of overparameterized models: L1 and L2
Regularization by Dropout
Regularization by Data Augmentation
Perils of Hyper-Parameter Tuning
Rethinking generalization

Setup#

Note that some of the code for today can take up to an hour to run. We have therefore “hidden” that code and shown the resulting outputs.

Install dependencies#

WARNING: There may be errors and/or warnings reported during the installation. However, they should be ignored.

Install and import feedback gadget#

# Imports
import copy
import torch
import random
import pathlib

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchvision import transforms
from torchvision.datasets import ImageFolder

from tqdm.auto import tqdm
from IPython.display import HTML, display

Figure Settings#

Loading Animal Faces Data#

Start downloading and unzipping `AnimalFaces` dataset...

Download completed.

Loading Animal Faces Randomized data#

Start downloading and unzipping `Randomized AnimalFaces` dataset...

Download completed.

Plotting functions#

Helper functions#

Show code cell source Hide code cell source

# @title Helper functions

class AnimalNet(nn.Module):
  """
  Network Class - Animal Faces with following structure:
  nn.Linear(3 * 32 * 32, 128) # Fully connected layer 1
  nn.Linear(128, 32) # Fully connected layer 2
  nn.Linear(32, 3) # Fully connected layer 3
  """

  def __init__(self):
    """
    Initialize parameters of AnimalNet

    Args:
      None

    Returns:
      Nothing
    """
    super(AnimalNet, self).__init__()
    self.fc1 = nn.Linear(3 * 32 * 32, 128)
    self.fc2 = nn.Linear(128, 32)
    self.fc3 = nn.Linear(32, 3)

  def forward(self, x):
    """
    Forward Pass of AnimalNet

    Args:
      x: torch.tensor
        Input features

    Returns:
      output: torch.tensor
        Outputs/Predictions
    """
    x = x.view(x.shape[0], -1)
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    output = F.log_softmax(x, dim=1)
    return output


class Net(nn.Module):
  """
  Network Class - 2D with following structure
  nn.Linear(1, 300) + leaky_relu(self.fc1(x)) # First fully connected layer
  nn.Linear(300, 500) + leaky_relu(self.fc2(x)) # Second fully connected layer
  nn.Linear(500, 1) # Final fully connected layer
  """

  def __init__(self):
    """
    Initialize parameters of Net

    Args:
      None

    Returns:
      Nothing
    """
    super(Net, self).__init__()

    self.fc1 = nn.Linear(1, 300)
    self.fc2 = nn.Linear(300, 500)
    self.fc3 = nn.Linear(500, 1)

  def forward(self, x):
    """
    Forward pass of Net

    Args:
      x: torch.tensor
        Input features

    Returns:
      x: torch.tensor
        Output/Predictions
    """
    x = F.leaky_relu(self.fc1(x))
    x = F.leaky_relu(self.fc2(x))
    output = self.fc3(x)
    return output


class BigAnimalNet(nn.Module):
  """
  Network Class - Animal Faces with following structure:
  nn.Linear(3*32*32, 124) + leaky_relu(self.fc1(x)) # First fully connected layer
  nn.Linear(124, 64) + leaky_relu(self.fc2(x)) # Second fully connected layer
  nn.Linear(64, 3) # Final fully connected layer
  """

  def __init__(self):
    """
    Initialize parameters for BigAnimalNet

    Args:
      None

    Returns:
      Nothing
    """
    super(BigAnimalNet, self).__init__()
    self.fc1 = nn.Linear(3*32*32, 124)
    self.fc2 = nn.Linear(124, 64)
    self.fc3 = nn.Linear(64, 3)

  def forward(self, x):
    """
    Forward pass of BigAnimalNet

    Args:
      x: torch.tensor
        Input features

    Returns:
      x: torch.tensor
        Output/Predictions
    """
    x = x.view(x.shape[0],-1)
    x = F.leaky_relu(self.fc1(x))
    x = F.leaky_relu(self.fc2(x))
    x = self.fc3(x)
    output = F.log_softmax(x, dim=1)
    return output


def train(args, model, train_loader, optimizer, epoch,
          reg_function1=None, reg_function2=None, criterion=F.nll_loss):
  """
  Trains the current input model using the data
  from Train_loader and Updates parameters for a single pass

  Args:
    args: dictionary
      Dictionary with epochs: 200, lr: 5e-3, momentum: 0.9, device: DEVICE
    model: nn.module
      Neural network instance
    train_loader: torch.loader
      Input dataset
    optimizer: function
      Optimizer
    reg_function1: function
      Regularisation function [default: None]
    reg_function2: function
      Regularisation function [default: None]
    criterion: function
      Specifies loss function [default: nll_loss]

  Returns:
    model: nn.module
      Neural network instance post training
  """
  device = args['device']
  model.train()
  for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()
    output = model(data)
    # L1 regularization
    if reg_function2 is None and reg_function1 is not None:
      loss = criterion(output, target) + args['lambda1']*reg_function1(model)
    # L2 regularization
    elif reg_function1 is None and reg_function2 is not None:
      loss = criterion(output, target) + args['lambda2']*reg_function2(model)
    # No regularization
    elif reg_function1 is None and reg_function2 is None:
      loss = criterion(output, target)
    # Both L1 and L2 regularizations
    else:
      loss = criterion(output, target) + args['lambda1']*reg_function1(model) + args['lambda2']*reg_function2(model)
    loss.backward()
    optimizer.step()

  return model


def test(model, test_loader, loader='Test', criterion=F.nll_loss,
         device='cpu'):
  """
  Tests the current model

  Args:
    model: nn.module
      Neural network instance
    device: string
      GPU/CUDA if available, CPU otherwise
    test_loader: torch.loader
      Test dataset
    criterion: function
      Specifies loss function [default: nll_loss]

  Returns:
    test_loss: float
      Test loss
  """
  model.eval()
  test_loss = 0
  correct = 0
  with torch.no_grad():
    for data, target in test_loader:
      data, target = data.to(device), target.to(device)
      output = model(data)
      test_loss += criterion(output, target, reduction='sum').item()  # sum up batch loss
      pred = output.argmax(dim=1, keepdim=True)  # Get the index of the max log-probability
      correct += pred.eq(target.view_as(pred)).sum().item()

  test_loss /= len(test_loader.dataset)
  return 100. * correct / len(test_loader.dataset)


def main(args, model, train_loader, val_loader, test_data,
         reg_function1=None, reg_function2=None, criterion=F.nll_loss):
  """
  Trains the model with train_loader and
  tests the learned model using val_loader

  Args:
    args: dictionary
      Dictionary with epochs: 200, lr: 5e-3, momentum: 0.9, device: DEVICE
    model: nn.module
      Neural network instance
    train_loader: torch.loader
      Train dataset
    val_loader: torch.loader
      Validation set
    reg_function1: function
      Regularisation function [default: None]
    reg_function2: function
      Regularisation function [default: None]

  Returns:
    val_acc_list: list
      Log of validation accuracy
    train_acc_list: list
      Log of training accuracy
    param_norm_list: list
      Log of frobenius norm
    trained_model: nn.module
      Trained model/model post training
  """
  device = args['device']

  model = model.to(device)
  optimizer = optim.SGD(model.parameters(), lr=args['lr'], momentum=args['momentum'])

  val_acc_list, train_acc_list,param_norm_list = [], [], []
  for epoch in tqdm(range(args['epochs'])):
    trained_model = train(args, model, train_loader, optimizer, epoch,
                          reg_function1=reg_function1,
                          reg_function2=reg_function2)
    train_acc = test(trained_model, train_loader, loader='Train', device=device)
    val_acc = test(trained_model, val_loader, loader='Val', device=device)
    param_norm = calculate_frobenius_norm(trained_model)
    train_acc_list.append(train_acc)
    val_acc_list.append(val_acc)
    param_norm_list.append(param_norm)

  return val_acc_list, train_acc_list, param_norm_list, model


def calculate_frobenius_norm(model):
    """
    Function to calculate frobenius norm

    Args:
      model: nn.module
        Neural network instance

    Returns:
      norm: float
        Frobenius norm
    """
    norm = 0.0
    # Sum the square of all parameters
    for name,param in model.named_parameters():
        norm += torch.norm(param).data**2
    # Return a square root of the sum of squares of all the parameters
    return norm**0.5


def early_stopping_main(args, model, train_loader, val_loader, test_data):
  """
  Function to simulate early stopping

  Args:
    args: dictionary
      Dictionary with epochs: 200, lr: 5e-3, momentum: 0.9, device: DEVICE
    model: nn.module
      Neural network instance
    train_loader: torch.loader
      Train dataset
    val_loader: torch.loader
      Validation set

  Returns:
    val_acc_list: list
      Val accuracy log until early stop point
    train_acc_list: list
      Training accuracy log until early stop point
    best_model: nn.module
      Model performing best with early stopping
    best_epoch: int
      Epoch at which early stopping occurs
  """
  device = args['device']

  model = model.to(device)
  optimizer = optim.SGD(model.parameters(), lr=args['lr'], momentum=args['momentum'])

  best_acc  = 0.0
  best_epoch = 0

  # Number of successive epochs that you want to wait before stopping training process
  patience = 20

  # Keps track of number of epochs during which the val_acc was less than best_acc
  wait = 0

  val_acc_list, train_acc_list = [], []
  for epoch in tqdm(range(args['epochs'])):
    trained_model = train(args, model, device, train_loader, optimizer, epoch)
    train_acc = test(trained_model, train_loader, loader='Train', device=device)
    val_acc = test(trained_model, val_loader, loader='Val', device=device)
    if (val_acc > best_acc):
      best_acc = val_acc
      best_epoch = epoch
      best_model = copy.deepcopy(trained_model)
      wait = 0
    else:
      wait += 1
    if (wait > patience):
      print(f'Early stopped on epoch: {epoch}')
      break
    train_acc_list.append(train_acc)
    val_acc_list.append(val_acc)

  return val_acc_list, train_acc_list, best_model, best_epoch

Set random seed#

Executing set_seed(seed=seed) you are setting the seed

Set device (GPU or CPU). Execute `set_device()`#

SEED = 2021
set_seed(seed=SEED)
DEVICE = set_device()

Random seed 2021 has been set.
WARNING: For this notebook to perform best, if possible, in the menu under `Runtime` -> `Change runtime type.`  select `GPU` 

Dataloaders for the Dataset#

Show code cell source Hide code cell source

# @title Dataloaders for the Dataset
## Dataloaders for the Dataset
batch_size = 128
classes = ('cat', 'dog', 'wild')

train_transform = transforms.Compose([
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
     ])
data_path = pathlib.Path('.')/'afhq' # Using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=train_transform)


####################################################
g_seed = torch.Generator()
g_seed.manual_seed(SEED)


## Dataloaders for the  Original Dataset
img_train_data, img_val_data,_ = torch.utils.data.random_split(img_dataset,
                                                               [100, 100, 14430])

# Creating train_loader and Val_loader
train_loader = torch.utils.data.DataLoader(img_train_data,
                                           batch_size=batch_size,
                                           worker_init_fn=seed_worker,
                                           num_workers=2,
                                           generator=g_seed)
val_loader = torch.utils.data.DataLoader(img_val_data,
                                         batch_size=1000,
                                         num_workers=2,
                                         worker_init_fn=seed_worker,
                                         generator=g_seed)

# Creating test dataset
test_transform = transforms.Compose([
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
     ])
img_test_dataset = ImageFolder(data_path/'val', transform=test_transform)


####################################################

## Dataloaders for the  Random Dataset

# Splitting randomized data into training and validation data
data_path = pathlib.Path('.')/'afhq_random_32x32/afhq_random' # using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=train_transform)
random_img_train_data, random_img_val_data,_ = torch.utils.data.random_split(img_dataset, [100,100,14430])

# Randomized train and validation dataloader
rand_train_loader = torch.utils.data.DataLoader(random_img_train_data,
                                                batch_size=batch_size,
                                                num_workers=2,
                                                worker_init_fn=seed_worker,
                                                generator=g_seed)
rand_val_loader = torch.utils.data.DataLoader(random_img_val_data,
                                              batch_size=1000,
                                              num_workers=2,
                                              worker_init_fn=seed_worker,
                                              generator=g_seed)

####################################################

## Dataloaders for the Partially Random Dataset

# Splitting data between training and validation dataset for partially randomized data
data_path = pathlib.Path('.')/'afhq_10_32x32/afhq_10' # using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=train_transform)
partially_random_train_data, partially_random_val_data, _ = torch.utils.data.random_split(img_dataset, [100,100,14430])

# Training and Validation loader for partially randomized data
partial_rand_train_loader = torch.utils.data.DataLoader(partially_random_train_data,
                                                        batch_size=batch_size,
                                                        num_workers=2,
                                                        worker_init_fn=seed_worker,
                                                        generator=g_seed)
partial_rand_val_loader = torch.utils.data.DataLoader(partially_random_val_data,
                                                      batch_size=1000,
                                                      num_workers=2,
                                                      worker_init_fn=seed_worker,
                                                      generator=g_seed)

Section 1: L1 and L2 Regularization#

Time estimate: ~30 mins

Video 1: L1 and L2 regularization#

Submit your feedback#

Some of you might have already come across L1 and L2 regularization before in other courses. L1 and L2 are the most common types of regularization. These update the general cost function by adding another term known as the regularization term.

(58)#\[\begin{equation} \text{Cost function} = Loss(\text{e.g., binary cross entropy}) + \text{Regularization term} \end{equation}\]

This regularization term makes the parameters smaller, giving simpler models that will overfit less.

Discuss among your teammates whether the above assumption is good or bad?

Section 1.1: Unregularized Model#

Dataloaders for Regularization#

Now let’s train a model without regularization and keep it aside as our benchmark for this section.

# Set the arguments
args = {
    'epochs': 150,
    'lr': 5e-3,
    'momentum': 0.99,
    'device': DEVICE,
}

# Initialize the model
set_seed(seed=SEED)
model = AnimalNet()

# Train the model
val_acc_unreg, train_acc_unreg, param_norm_unreg, _ = main(args,
                                                           model,
                                                           reg_train_loader,
                                                           reg_val_loader,
                                                           img_test_dataset)

# Train and Test accuracy plot
plt.figure()
plt.plot(val_acc_unreg, label='Val Accuracy', c='red', ls='dashed')
plt.plot(train_acc_unreg, label='Train Accuracy', c='red', ls='solid')
plt.axhline(y=max(val_acc_unreg), c='green', ls='dashed')
plt.title('Unregularized Model')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()
print(f"Maximum Validation Accuracy reached: {max(val_acc_unreg)}")

Random seed 2021 has been set.

../../../_images/f1ac4b858717bad113607af53d8df15006fa1375cfdfd27057fa88ab5f088ae3.png

Maximum Validation Accuracy reached: 51.0

Section 1.2: L1 Regularization#

L1 Regularization (or LASSO\(^{\ddagger}\)) uses a penalty which is the sum of the absolute value of all the weights in the Deep Learning architecture, resulting in the following loss function (\(L\) is the usual Cross-Entropy loss):

(59)#\[\begin{equation} L_R = L + \lambda \sum \left| w^{(r)}_{ij} \right| \end{equation}\]

where \(r\) denotes the layer, and \(ij\) the specific weight in that layer.

At a high level, L1 Regularization is similar to L2 Regularization since it leads to smaller weights (you will see the analogy in the next subsection). It results in the following weight update equation when using Stochastic Gradient Descent:

(60)#\[\begin{equation} w^{(r)}_{ij}←w^{(r)}_{ij} − \eta \cdot \lambda \cdot \text{sgn}\left(w^{(r)}_{ij}\right)−\eta \frac{\partial L}{\partial w_{ij}^{(r)}} \end{equation}\]

where \(\text{sgn}(\cdot)\) is the sign function, such that

(61)#\[\begin{equation} \text{sgn}(w) = \left\{ \begin{array}{ll} +1 & \mbox{if } w > 0 \\ -1 & \mbox{if } w < 0 \\ 0 & \mbox{if } w = 0 \end{array} \right. \end{equation}\]

\(^{\ddagger}\)LASSO: Least Absolute Shrinkage and Selection Operator

Coding Exercise 1.1: L1 Regularization#

Write a function that calculates the L1 norm of all the tensors of a PyTorch model.

def l1_reg(model):
  """
  This function calculates the l1 norm of the all the tensors in the model

  Args:
    model: nn.module
      Neural network instance

  Returns:
    l1: float
      L1 norm of the all the tensors in the model
  """
  l1 = 0.0
  ####################################################################
  # Fill in all missing code below (...),
  # then remove or comment the line below to test your function
  raise NotImplementedError("Complete the l1_reg function")
  ####################################################################
  for param in model.parameters():
    l1 += ...

  return l1


set_seed(seed=SEED)
## uncomment to test
# net = nn.Linear(20, 20)
# print(f"L1 norm of the model: {l1_reg(net)}")

Random seed 2021 has been set.

Random seed 2021 has been set.
L1 norm of the model: 48.445133209228516

Click for solution

Submit your feedback#

Now, let’s train a classifier that uses L1 regularization. Tune the hyperparameter lambda1 such that the validation accuracy is higher than that of the unregularized model.

# Set the arguments
args1 = {
    'test_batch_size': 1000,
    'epochs': 150,
    'lr': 5e-3,
    'momentum': 0.99,
    'device': DEVICE,
    'lambda1': 0.001  # <<<<<<<< Tune the hyperparameter lambda1
}

# Initialize the model
set_seed(seed=SEED)
model = AnimalNet()

# Train the model
val_acc_l1reg, train_acc_l1reg, param_norm_l1reg, _ = main(args1,
                                                           model,
                                                           reg_train_loader,
                                                           reg_val_loader,
                                                           img_test_dataset,
                                                           reg_function1=l1_reg)

# Train and Test accuracy plot
plt.figure()
plt.plot(val_acc_l1reg, label='Val Accuracy L1 Regularized',
         c='red', ls='dashed')
plt.plot(train_acc_l1reg, label='Train Accuracy L1 regularized',
         c='red', ls='solid')
plt.axhline(y=max(val_acc_l1reg), c='green', ls='dashed')
plt.title('L1 regularized model')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()
print(f"Maximum Validation Accuracy Reached: {max(val_acc_l1reg)}")

What value of lambda1 hyperparameter worked for L1 Regularization?

Note: that the \(\lambda\) in the equations is the lambda1 in the code for clarity.

Submit your feedback#

Section 1.3: L2 / Ridge Regularization#

L2 Regularization (or Ridge), also referred to as “Weight Decay”, is widely used. It works by adding a quadratic penalty term to the Cross-Entropy Loss Function \(L\), which results in a new Loss Function \(L_R\) given by:

(62)#\[\begin{equation} L_R = L + \lambda \sum \left( w^{(r)}_{ij} \right)^2 \end{equation}\]

where, again, \(r\) superscript denotes the layer, and \(ij\) the specific weight in that layer.

To get further insight into L2 Regularization, we investigate its effect on the Gradient Descent based update equations for the weight and bias parameters. Taking the derivative on both sides of the above equation, we obtain

(63)#\[\begin{equation} \frac{\partial L_R}{\partial w^{(r)}_{ij}}=\frac{\partial L}{\partial w^{(r)}_{ij}} + 2\lambda w^{(r)}_{ij} \end{equation}\]

Thus the weight update rule becomes:

(64)#\[\begin{equation} w^{(r)}_{ij}←w^{(r)}_{ij}−η\frac{\partial L}{\partial w^{(r)}_{ij}}−2 \eta \lambda w^{(r)}_{ij}=(1−2 \eta \lambda)w^{(r)}_{ij} − \eta \frac{\partial L}{\partial w^{(r)}_{ij}} \end{equation}\]

where \(\eta\) is the learning rate.

Coding Exercise 1.2: L2 Regularization#

Write a function that calculates the L2 norm of all the tensors of a PyTorch model. (What did we call this before?)

def l2_reg(model):
  """
  This function calculates the l2 norm of the all the tensors in the model

  Args:
    model: nn.module
      Neural network instance

  Returns:
    l2: float
      L2 norm of the all the tensors in the model
  """

  l2 = 0.0
  ####################################################################
  # Fill in all missing code below (...),
  # then remove or comment the line below to test your function
  raise NotImplementedError("Complete the l2_reg function")
  ####################################################################
  for param in model.parameters():
    l2 += ...

  return l2


set_seed(SEED)
## uncomment to test
# net = nn.Linear(20, 20)
# print(f"L2 norm of the model: {l2_reg(net)}")

Random seed 2021 has been set.

Random seed 2021 has been set.
L2 norm of the model: 7.328375816345215

Click for solution

Submit your feedback#

Now we’ll train a classifier that uses L2 regularization. Tune the hyperparameter lambda2 such that the validation accuracy is higher than that of the unregularized model.

# Set the arguments
args2 = {
    'test_batch_size': 1000,
    'epochs': 150,
    'lr': 5e-3,
    'momentum': 0.99,
    'device': DEVICE,
    'lambda2': 0.001  # <<<<<<<< Tune the hyperparameter lambda2
}

# Initialize the model
set_seed(seed=SEED)
model = AnimalNet()

# Train the model
val_acc_l2reg, train_acc_l2reg, param_norm_l2reg, model = main(args2,
                                                               model,
                                                               train_loader,
                                                               val_loader,
                                                               img_test_dataset,
                                                               reg_function2=l2_reg)

## Train and Test accuracy plot
plt.figure()
plt.plot(val_acc_l2reg, label='Val Accuracy L2 regularized',
         c='red', ls='dashed')
plt.plot(train_acc_l2reg, label='Train Accuracy L2 regularized',
         c='red', ls='solid')
plt.axhline(y=max(val_acc_l2reg), c='green', ls='dashed')
plt.title('L2 Regularized Model')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()
print(f"Maximum Validation Accuracy reached: {max(val_acc_l2reg)}")

What value lambda2 worked for L2 Regularization?

Note: that the \(\lambda\) in the equations is the lambda2 in the code for clarity.

Submit your feedback#

Now, let’s run a model with both L1 and L2 regularization terms.

Now, let’s visualize what these different regularizations do to the model’s parameters. We observe the effect by computing the size (technically, the Frobenius norm).

x =  param_norm_unreg[0]
print(x)

tensor(7.3810)

Visualize Norm of the Models (Train Me!)#

In the above plots, you should have seen that the validation accuracies fluctuate even after the model achieves 100% train accuracy. Thus, the model is still trying to learn something. Why would this be the case?

Section 2: Dropout#

Time estimate: ~25 mins

Video 2: Dropout#

Submit your feedback#

With Dropout, we literally drop out (zero out) some neurons during training. Throughout the training, the standard dropout zeros out some fraction (usually 50%) of the nodes in each layer, and on each iteration, before calculating the subsequent layer. Randomly selecting different subsets to drop out introduces noise into the process and reduces overfitting.

Now let’s revisit the toy dataset we generated above to visualize how the Dropout stabilizes training on a noisy dataset. We will slightly modify the architecture we used above to add dropout layers.

class NetDropout(nn.Module):
  """
  Network Class - 2D with the following structure:
  nn.Linear(1, 300) + leaky_relu(self.dropout1(self.fc1(x))) # First fully connected layer with 0.4 dropout
  nn.Linear(300, 500) + leaky_relu(self.dropout2(self.fc2(x))) # Second fully connected layer with 0.2 dropout
  nn.Linear(500, 1) # Final fully connected layer
  """

  def __init__(self):
    """
    Initialize parameters of NetDropout

    Args:
      None

    Returns:
      Nothing
    """
    super(NetDropout, self).__init__()

    self.fc1 = nn.Linear(1, 300)
    self.fc2 = nn.Linear(300, 500)
    self.fc3 = nn.Linear(500, 1)
    # We add two dropout layers
    self.dropout1 = nn.Dropout(0.4)
    self.dropout2 = nn.Dropout(0.2)

  def forward(self, x):
    """
    Forward pass of NetDropout

    Args:
      x: torch.tensor
        Input features

    Returns:
      output: torch.tensor
        Output/Predictions
    """
    x = F.leaky_relu(self.dropout1(self.fc1(x)))
    x = F.leaky_relu(self.dropout2(self.fc2(x)))
    output = self.fc3(x)
    return output

Run to train the default network#

Random seed 2021 has been set.

# Train the network on toy dataset

# Initialize the model
set_seed(seed=SEED)
model = NetDropout()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)
max_epochs = 10000
iters = 0

running_predictions_dp = np.empty((40, (int)(max_epochs / 500)))

train_loss_dp = []
test_loss_dp = []
model_norm_dp = []

for epoch in tqdm(range(max_epochs)):

  # Training
  model_norm_dp.append(calculate_frobenius_norm(model))
  model.train()
  optimizer.zero_grad()
  predictions = model(X)
  loss = criterion(predictions, Y)
  loss.backward()
  optimizer.step()

  train_loss_dp.append(loss.data)
  model.eval()
  Y_test = model(X_test)
  loss = criterion(Y_test, 2*X_test)
  test_loss_dp.append(loss.data)

  if (epoch % 500 == 0 or epoch == max_epochs):
    running_predictions_dp[:, iters] = Y_test[:, 0, 0].detach().numpy()
    iters += 1

Random seed 2021 has been set.

Now that we have finished the training, let’s see how the model has evolved over the training process.

Animation! (Run Me!)

Random seed 2021 has been set.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[34], line 22
     16   return plot
     19 anim = animation.FuncAnimation(fig, frame, frames=range(20),
     20                                blit=False, repeat=False,
     21                                repeat_delay=10000)
---> 22 html_anim = HTML(anim.to_html5_video());
     23 plt.close()
     24 display(html_anim)

File /opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/matplotlib/animation.py:1265, in Animation.to_html5_video(self, embed_limit)
   1262 path = Path(tmpdir, "temp.m4v")
   1263 # We create a writer manually so that we can get the
   1264 # appropriate size for the tag
-> 1265 Writer = writers[mpl.rcParams['animation.writer']]
   1266 writer = Writer(codec='h264',
   1267                 bitrate=mpl.rcParams['animation.bitrate'],
   1268                 fps=1000. / self._interval)
   1269 self.save(str(path), writer=writer)

File /opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/matplotlib/animation.py:128, in MovieWriterRegistry.__getitem__(self, name)
    126 if self.is_available(name):
    127     return self._registered[name]
--> 128 raise RuntimeError(f"Requested MovieWriter ({name}) not available")

RuntimeError: Requested MovieWriter (ffmpeg) not available

../../../_images/22b7c2888e515d90879b9f182da07046340626ee104158d4c6180d589a88cd2c.png

Plot the train and test losses with epoch

../../../_images/60309f5467921fd893971fa7f94f2dc9dc525ee09dcdfc60a056b011be61818d.png

Plot the train and test losses with epoch

../../../_images/30e81b391e2a24291308945e2a641124148e557e32e182ae144995b392754a45.png

Plot model weights with epoch

../../../_images/989536030ea47af388726a7714caa2a9ed0d65809bc7084101a9c4f65a5d6bb3.png

Think 2.1!: Dropout#

Do you think this (with dropout) performed better than the initial model (without dropout)?

Click for solution

Submit your feedback#

Section 2.1: Dropout Implementation Caveats#

Dropout is used only during training. However, the complete model weights are used during testing, so it is vital to use the model.eval() method before testing the model.
Dropout reduces the capacity of the model during training, and hence as a general practice, wider networks are used when using dropout. If you are using a dropout with a random probability of 0.5, you might want to double the number of hidden neurons in that layer.

Now, let’s see how dropout fares on the “Animal Faces” dataset. We first modify the existing model to include dropout and then train it.

class AnimalNetDropout(nn.Module):
  """
  Network Class - Animal Faces with following structure
  nn.Linear(3*32*32, 248) + leaky_relu(self.dropout1(self.fc1(x))) # First fully connected layer with 0.5 dropout
  nn.Linear(248, 210) + leaky_relu(self.dropout2(self.fc2(x))) # Second fully connected layer with 0.3 dropout
  nn.Linear(210, 3) # Final fully connected layer
  """

  def __init__(self):
    """
    Initialize parameters of AnimalNetDropout

    Args:
      None

    Returns:
      Nothing
    """
    super(AnimalNetDropout, self).__init__()
    self.fc1 = nn.Linear(3*32*32, 248)
    self.fc2 = nn.Linear(248, 210)
    self.fc3 = nn.Linear(210, 3)
    self.dropout1 = nn.Dropout(p=0.5)
    self.dropout2 = nn.Dropout(p=0.3)

  def forward(self, x):
    """
    Forward pass of AnimalNetDropout

    Args:
      x: torch.tensor
        Input features

    Returns:
      x: torch.tensor
        Output/Predictions
    """
    x = x.view(x.shape[0], -1)
    x = F.leaky_relu(self.dropout1(self.fc1(x)))
    x = F.leaky_relu(self.dropout2(self.fc2(x)))
    x = self.fc3(x)
    output = F.log_softmax(x, dim=1)
    return output

# Set the arguments
args = {
    'test_batch_size': 1000,
    'epochs': 200,
    'lr': 5e-3,
    'batch_size': 32,
    'momentum': 0.9,
    'device': DEVICE,
    'log_interval': 100
}

# Initialize the model
set_seed(seed=SEED)
model = AnimalNetDropout()

# Train the model with Dropout
val_acc_dropout, train_acc_dropout, _, model_dp = main(args,
                                                       model,
                                                       train_loader,
                                                       val_loader,
                                                       img_test_dataset)

# Initialize the BigAnimalNet model
set_seed(seed=SEED)
model = BigAnimalNet()

# Train the model
val_acc_big, train_acc_big, _, model_big = main(args,
                                                model,
                                                train_loader,
                                                val_loader,
                                                img_test_dataset)


# Train and Test accuracy plot
plt.figure()
plt.plot(val_acc_big, label='Val - Big', c='blue', ls='dashed')
plt.plot(train_acc_big, label='Train - Big', c='blue', ls='solid')
plt.plot(val_acc_dropout, label='Val - DP', c='magenta', ls='dashed')
plt.plot(train_acc_dropout, label='Train - DP', c='magenta', ls='solid')
plt.title('Dropout')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()

Random seed 2021 has been set.

Random seed 2021 has been set.

../../../_images/43c6213924a499e8e941406955accc4c96fe4b5f0999a63a8d3e9d060c09b560.png

Think 2.2! Dropout caveats#

When do you think dropouts can perform bad and do you think their placement within a model matters?

Click for solution

Submit your feedback#

Section 3: Data Augmentation#

Time estimate: ~15 mins

Video 3: Data Augmentation#

Submit your feedback#

Data augmentation is often used to increase the number of training samples. Now we will explore the effects of data augmentation on regularization. Here regularization is achieved by adding noise into training data after every epoch.

PyTorch’s torchvision module provides a few built-in data augmentation techniques, which we can use on image datasets. Some of the techniques we most frequently use are:

Random Crop
Random Rotate
Vertical Flip
Horizontal Flip

Data Loader without Data Augmentation#

Define a DataLoader using torchvision.transforms, which randomly augments the data for us. For more info, see here.

# Data Augmentation using transforms
new_transforms = transforms.Compose([
                                     transforms.RandomHorizontalFlip(p=0.1),
                                     transforms.RandomVerticalFlip(p=0.1),
                                     transforms.ToTensor(),
                                     transforms.Normalize((0.5, 0.5, 0.5),
                                                          (0.5, 0.5, 0.5))
                                     ])

data_path = pathlib.Path('.')/'afhq'  # Using pathlib to be compatible with all OS's
img_dataset = ImageFolder(data_path/'train', transform=new_transforms)
# Splitting dataset
new_train_data, _,_ = torch.utils.data.random_split(img_dataset,
                                                    [250, 100, 14280])

# For reproducibility
g_seed = torch.Generator()
g_seed.manual_seed(SEED)

# Creating train_loader and Val_loader
new_train_loader = torch.utils.data.DataLoader(new_train_data,
                                               batch_size=batch_size,
                                               worker_init_fn=seed_worker,
                                               generator=g_seed)

# Set the arguments
args = {
    'epochs': 250,
    'lr': 1e-3,
    'momentum': 0.99,
    'device': DEVICE,
}

# Initialize the model
set_seed(seed=SEED)
model_aug = AnimalNet()

# Train the model
val_acc_dataaug, train_acc_dataaug, param_norm_dataaug, _ = main(args,
                                                                 model_aug,
                                                                 new_train_loader,
                                                                 val_loader,
                                                                 img_test_dataset)
# Initialize the model
set_seed(seed=SEED)
model_pure = AnimalNet()

val_acc_pure, train_acc_pure, param_norm_pure, _, = main(args,
                                                         model_pure,
                                                         train_loader,
                                                         val_loader,
                                                         img_test_dataset)


# Train and Test accuracy plot
plt.figure()
plt.plot(val_acc_pure, label='Val Accuracy Pure',
         c='red', ls='dashed')
plt.plot(train_acc_pure, label='Train Accuracy Pure',
         c='red', ls='solid')
plt.plot(val_acc_dataaug, label='Val Accuracy data augment',
         c='blue', ls='dashed')
plt.plot(train_acc_dataaug, label='Train Accuracy data augment',
         c='blue', ls='solid')
plt.axhline(y=max(val_acc_pure), c='red', ls='dashed')
plt.axhline(y=max(val_acc_dataaug), c='blue', ls='dashed')
plt.title('Data Augmentation')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.show()

Random seed 2021 has been set.

Random seed 2021 has been set.

../../../_images/a3e5948ad49f34a70397d508257cb121ec5bb6fe6b36a0120c2f19f5b52d17f3.png

# Plot together: without and with augmentation
plt.figure()
plt.plot([i.cpu().numpy().item() for i in param_norm_pure],
         c='red', label='Without Augmentation')
plt.plot([i.cpu().numpy().item() for i in param_norm_dataaug],
         c='blue', label='With Augmentation')
plt.title('Norm of parameters as a function of training epoch')
plt.xlabel('Epoch')
plt.ylabel('Norm of model parameters')
plt.legend()
plt.show()

../../../_images/030ad1a53c254656ffee00051ca087fd2518f0fec57b1e39f388f23c24135b57.png

Think 3.1!: Data Augmentation#

Can you think of more ways of augmenting the training data? (Think of other problems beyond object recognition.)

Click for solution

Submit your feedback#

Think! 3.2!: Overparameterized vs. Small NN#

Why is it better to regularize an overparameterized ANN than to start with a smaller one? Think about the regularization methods you know. Each group should have a 10 min discussion.

Click for solution

Submit your feedback#

Section 4: Stochastic Gradient Descent#

Time estimate: ~20 mins

Video 4: SGD#

Submit your feedback#

Section 4.1: Learning Rate#

In this section, we will see how the learning rate can act as a regularizer while training a neural network. In summary:

Smaller learning rates regularize less and slowly converge to deep minima.
Larger learning rates regularize more by missing local minima and converging to broader, flatter minima, which often generalize better.

But beware, a very large learning rate may result in overshooting or finding a bad local minimum.

In the block below, we will train the AnimalNet model with different learning rates and see how that affects the regularization.

Generating Data Loaders#

# Set the arguments
args = {
    'test_batch_size': 1000,
    'epochs': 20,
    'batch_size': 32,
    'momentum': 0.99,
    'device': DEVICE
}

learning_rates = [5e-4, 1e-3, 5e-3]
acc_dict = {}

for i, lr in enumerate(learning_rates):
  # Initialize the model
  set_seed(seed=SEED)
  model = AnimalNet()
  # Learning rate
  args['lr'] = lr
  # Train the model
  val_acc, train_acc, param_norm, _ = main(args,
                                           model,
                                           train_loader,
                                           val_loader,
                                           img_test_dataset)
  # Store the outputs
  acc_dict[f'val_{i}'] = val_acc
  acc_dict[f'train_{i}'] = train_acc
  acc_dict[f'param_norm_{i}'] = param_norm

Random seed 2021 has been set.

Random seed 2021 has been set.

Random seed 2021 has been set.

Plot Train and Validation accuracy (Run me)

Maximum Test Accuracy obtained with lr=5.0e-04: 36.0
Maximum Test Accuracy obtained with lr=1.0e-03: 42.0
Maximum Test Accuracy obtained with lr=5.0e-03: 49.0

../../../_images/b4e54b9d80db9f4453e4f3b3bb89043bfa2d1775a18721c5852fb023f432db92.png

Plot parametric norms (Run me)

../../../_images/54345976b8940513e9c1b86ca3dc6a1d7961ad95f99d22b2c561e7d6450bc2a2.png

In the model above, we observe something different from what we expected. Why do you think this is happening?

Section 5: Hyperparameter Tuning#

Time estimate: ~5 mins

Video 5: Hyperparameter tuning#

Submit your feedback#

Hyperparameter tuning is often tricky and time-consuming, and it is a vital part of training any Deep Learning model to give good generalization. There are a few techniques that we can use to guide us during the search.

Grid Search: Try all possible combinations of hyperparameters
Random Search: Randomly try different combinations of hyperparameters
Coordinate-wise Gradient Descent: Start at one set of hyperparameters and try changing one at a time, accept any changes that reduce your validation error
Bayesian Optimization / Auto ML: Start from a set of hyperparameters that have worked well on a similar problem, and then do some sort of local exploration (e.g., gradient descent) from there.

There are many choices, like what range to explore over, which parameter to optimize first, etc. Some hyperparameters don’t matter much (people use a dropout of either 0.5 or 0.2, but not much else). Others can matter a lot more (e.g., size and depth of the neural net). The key is to see what worked on similar problems.

One can automate the process of tuning the network architecture using the so called Neural Architecture Search (NAS). NAS designs new architectures using a few building blocks (Linear, Convolutional, Convolution Layers, etc.) and optimizes the design based on performance using a wide range of techniques such as Grid Search, Reinforcement Learning, Gradient Descent, Evolutionary Algorithms, etc. This obviously requires very high computing power. Read this article to learn more about NAS.

Think! 5: Overview of regularization techniques#

Which regularization technique today do you think had the most significant effect on the network? Why might do you think so? Can you apply all of the regularization methods on the same network?

Click for solution

Submit your feedback#

Summary#

Congratulations! You have finished the first week of NMA-DL!

In this tutorial, you learned more regularization techniques, i.e., L1 and L2 regularization, Dropout, and Data Augmentation. Finally, you have seen that the learning rate of SGD can act as a regularizer. An interesting paper can be found here.

Continue to the Bonus material on Adversarial Attacks if you have time left!

Daily survey#

Don’t forget to complete your reflections and content check in the daily survey! Please be patient after logging in as there is a small delay before you will be redirected to the survey.

Bonus: Adversarial Attacks#

Time estimate: ~15 mins

Video 6: Adversarial Attacks#

Submit your feedback#

Designing perturbations to the input data to trick a machine learning model is called an “adversarial attack”. These attacks are an inevitable consequence of learning in high dimensional space using complex decision boundaries. Depending on the application, these attacks can be very dangerous.

https://raw.githubusercontent.com/NeuromatchAcademy/course-content-dl/main/tutorials/static/AdversarialAttacks_w1d5t2.png

Hence, we need to build models which can defend against such attacks. One possible way to do it is by regularizing the networks, which smooths the decision boundaries. A few ways of building models robust to such attacks are:

Defensive Distillation: Models trained via distillation are less prone to such attacks as they are trained on soft labels as there is an element of randomness in the training process.
Feature Squeezing: Identifies adversarial attacks for online classifiers whose model is being used by comparing the model’s prediction before and after squeezing the input.
SGD: You can also pick weight to minimize what the adversary is trying to maximize via SGD.

Read more about adversarial attacks here.

Tutorial 2: Regularization techniques part 2

Contents

Tutorial 2: Regularization techniques part 2#

Tutorial Objectives#

Setup#

Install dependencies#

Install and import feedback gadget#

Figure Settings#

Loading Animal Faces Data#

Loading Animal Faces Randomized data#

Plotting functions#

Helper functions#

Set random seed#

Set device (GPU or CPU). Execute set_device()#

Dataloaders for the Dataset#

Section 1: L1 and L2 Regularization#

Video 1: L1 and L2 regularization#

Submit your feedback#

Section 1.1: Unregularized Model#

Dataloaders for Regularization#

Section 1.2: L1 Regularization#

Coding Exercise 1.1: L1 Regularization#

Submit your feedback#

Submit your feedback#

Section 1.3: L2 / Ridge Regularization#

Coding Exercise 1.2: L2 Regularization#

Submit your feedback#

Submit your feedback#

Visualize Norm of the Models (Train Me!)#

Section 2: Dropout#

Video 2: Dropout#

Submit your feedback#

Run to train the default network#

Think 2.1!: Dropout#

Submit your feedback#

Section 2.1: Dropout Implementation Caveats#

Think 2.2! Dropout caveats#

Submit your feedback#

Section 3: Data Augmentation#

Video 3: Data Augmentation#

Submit your feedback#

Data Loader without Data Augmentation#

Think 3.1!: Data Augmentation#

Submit your feedback#

Think! 3.2!: Overparameterized vs. Small NN#

Submit your feedback#

Section 4: Stochastic Gradient Descent#

Video 4: SGD#

Submit your feedback#

Section 4.1: Learning Rate#

Generating Data Loaders#

Section 5: Hyperparameter Tuning#

Video 5: Hyperparameter tuning#

Submit your feedback#

Think! 5: Overview of regularization techniques#

Submit your feedback#

Summary#

Daily survey#

Bonus: Adversarial Attacks#

Video 6: Adversarial Attacks#

Submit your feedback#

Set device (GPU or CPU). Execute `set_device()`#