Tutorial 1: Learn how to use modern convnets#

Week 2, Day 3: Modern Convnets

By Neuromatch Academy

Content creators: Laura Pede, Richard Vogg, Marissa Weis, Timo Lüddecke, Alexander Ecker

Content reviewers: Arush Tagade, Polina Turishcheva, Yu-Fang Yang, Bettina Hein, Melvin Selim Atay, Kelson Shilling-Scrivo

Content editors: Gagana B, Roberto Guidotti, Spiros Chavlis

Production editors: Anoop Kulkarni, Roberto Guidotti, Cary Murray, Gagana B, Spiros Chavlis

Tutorial notebook is based on an initial version by Ben Heil

Tutorial Objectives#

In this tutorial we are going to learn more about Convnets. More specifically, we will:

  1. Learn about modern CNNs and Transfer Learning.

  2. Understand how architectures incorporate ideas we have about the world.

  3. Understand the operating principles underlying the basic building blocks of modern CNNs.

  4. Understand the concept of transfer learning and learn to recognize opportunities for applying it.

  5. (Bonus) Understand the speed vs. accuracy trade-off.


Install dependencies#

# @title Install dependencies
!pip install Pillow --quiet

Install and import feedback gadget#

# @title Install and import feedback gadget

!pip3 install vibecheck datatops --quiet

from vibecheck import DatatopsContentReviewContainer
def content_review(notebook_section: str):
    return DatatopsContentReviewContainer(
        "",  # No text prompt
            "url": "",
            "name": "neuromatch_dl",
            "user_key": "f379rz8y",

feedback_prefix = "W2D3_T1"
# Import libraries
import os
import time
import tqdm
import torch
import IPython
import torchvision

import numpy as np
import matplotlib.pyplot as plt

import torch.nn as nn
import torch.nn.functional as F

from torchvision import transforms
from torchvision.models import AlexNet
from torchvision.utils import make_grid
from torchvision.datasets import ImageFolder

from PIL import Image
from io import BytesIO

Figure settings#

# @title Figure settings
import logging
logging.getLogger('matplotlib.font_manager').disabled = True

import ipywidgets as widgets  # Interactive display
%config InlineBackend.figure_format = 'retina'"")

Set random seed#

Executing set_seed(seed=seed) you are setting the seed

# @title Set random seed

# @markdown Executing `set_seed(seed=seed)` you are setting the seed

# For DL its critical to set the random seed so that students can have a
# baseline to compare their results to expected results.
# Read more here:

# Call `set_seed` function in the exercises to ensure reproducibility.
import random
import torch

def set_seed(seed=None, seed_torch=True):
  Function that controls randomness. NumPy and random modules must be imported.

    seed : Integer
      A non-negative integer that defines the random state. Default is `None`.
    seed_torch : Boolean
      If `True` sets the random seed for pytorch tensors, so pytorch module
      must be imported. Default is `True`.

  if seed is None:
    seed = np.random.choice(2 ** 32)
  if seed_torch:
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

  print(f'Random seed {seed} has been set.')

# In case that `DataLoader` is used
def seed_worker(worker_id):
  DataLoader will reseed workers following randomness in
  multi-process data loading algorithm.

    worker_id: integer
      ID of subprocess to seed. 0 means that
      the data will be loaded in the main process
      Refer: for more details

  worker_seed = torch.initial_seed() % 2**32

Set device (GPU or CPU). Execute set_device()#

# @title Set device (GPU or CPU). Execute `set_device()`
# especially if torch modules used.

# Inform the user if the notebook uses GPU or CPU.

def set_device():
  Set the device. CUDA if available, CPU otherwise


  device = "cuda" if torch.cuda.is_available() else "cpu"
  if device != "cuda":
    print("WARNING: For this notebook to perform best, "
        "if possible, in the menu under `Runtime` -> "
        "`Change runtime type.`  select `GPU` ")
    print("GPU is enabled in this notebook.")

  return device
SEED = 2021
DEVICE = set_device()
Random seed 2021 has been set.
WARNING: For this notebook to perform best, if possible, in the menu under `Runtime` -> `Change runtime type.`  select `GPU` 

Section 1: Modern CNNs and Transfer Learning#

Time estimate: ~25mins

Video 1: Modern CNNs and Transfer Learning#

Submit your feedback

# @title Submit your feedback

Images are high dimensional. That is to say that image_length * image_width * image_channels is a big number, and multiplying that big number by a normal sized fully-connected layer leads to a ton of parameters to learn. Yesterday, we learned about convolutional neural networks, one way of working around high dimensionality in images and other domains.

The widget below (i.e., Interactive Demo 1) calculates the parameters required for a single convolutional or fully connected layer that operates on an image of a certain height and width.

Recall that, the number of parameters of a convolutional layer \(l\) are calculated as:

(83)#\[\begin{equation} \text{num_of_params}_l = \left[ \left( H \times W \times K_{l-1} \right) + 1 \right] \times K_l \end{equation}\]

where \(H\) denotes the shape of the height of the filter, \(W\) the shape of the width of the filter, and \(K_l\) denotes the number of the filters in the \(l\)-th layer. The added \(1\) is because of the bias term for each filter.

While a fully connected layer contains:

(84)#\[\begin{equation} \text{num_of_params}_l = \left[ \left( N_{l-1} \times N_l \right) + 1 \times N_l \right] \end{equation}\]

where \(N_l\) denotes the number of nodes in the \(l\)-th layer.

Adjust the sliders to gain an intuition for how different model and data characteristics affect the number of parameters your model need to fit.

Note: these classes are designed to show parameter scaling in the first layer of a network, to be actually useful they would need more layers, an activation function, etc.

class FullyConnectedNet(nn.Module):
  Fully connected network with the following structure:
  nn.Linear(self.input_size, 256)

  def __init__(self):
    Initialize parameters of FullyConnectedNet


    super(FullyConnectedNet, self).__init__()

    image_width = 128
    image_channels = 3
    self.input_size = image_channels * image_width ** 2

    self.fc1 = nn.Linear(self.input_size, 256)

  def forward(self, x):
    Forward pass of FullyConnectedNet

      x: torch.tensor
        Input data

      x: torch.tensor
        Output from FullyConnectedNet
    x = x.view(-1, self.input_size)
    return self.fc1(x)
class ConvNet(nn.Module):
  Convolutional Neural Network

  def __init__(self):
    Initialize parameters of ConvNet


    super(ConvNet, self).__init__()

    self.conv1 = nn.Conv2d(in_channels=3,
                            kernel_size=(3, 3),

  def forward(self, x):
    Forward pass of ConvNet

      x: torch.tensor
        Input data

      x: torch.tensor
        Output after passing x through Conv2d layer
    return self.conv1(x)

Coding Exercise 1: Calculate number of parameters in FCNN vs ConvNet#

Write a function that calculates the number of parameters of a given network. Apply the function to the above defined fully-connected network and convolutional network and compare the parameter counts.

Hint: torch.numel

def get_parameter_count(network):
  Calculate the number of parameters used by the fully connected/convolutional network.
  Hint: Casting the result of network.parameters() to a list may make it
        easier to work with

    network: nn.module
      Network to calculate the parameters of fully connected/convolutional network

    param_count: int
      The number of parameters in the network

  # Fill in all missing code below (...),
  # then remove or comment the line below to test your function
  raise NotImplementedError("Convolution math")
  # Get the network's parameters
  parameters = ...

  param_count = 0
  # Loop over all layers
  for layer in parameters:
    param_count += ...

  return param_count

# Initialize networks
fccnet = FullyConnectedNet()
convnet = ConvNet()
## Apply the above defined function to both networks by uncommenting the following lines
# print(f"FCCN parameter count: {get_parameter_count(fccnet)}")
# print(f"ConvNet parameter count: {get_parameter_count(convnet)}")
FCCN parameter count: 12583168
ConvNet parameter count: 7168

Click for solution

Submit your feedback

# @title Submit your feedback

Interactive Demo 1: Check your results#

The widget below calculates the number of parameters in a FCNN and CNN with the same architecture as our models above. Our models had an input image that was 128x128, and used 256 filters (or 256 nodes in the FCNN case). Check that the calculations you made above are correct.

Note how few parameters the convolutional networks take, especially as you increase the input image size.

Parameter Calculator#

Run this cell to enable the widget!

# @title Parameter Calculator
# @markdown Run this cell to enable the widget!

def calculate_parameters(filter_count, image_width,
  Implement how parameters
  scale as a function of image size
  between convnets and FCNN

    filter_count: int
      Number of filters
    image_width: int
      Width of image
    fcnn_nodes: int
      Number of fCNN nodes


  filter_width = 3
  image_channels = 3

  # Assuming a square, RGB image
  image_area = image_width ** 2
  image_volume = image_area * image_channels

  # If we're using padding=same, the output of a
  # convnet will be the same shape
  # as the original image, but with more features
  fcnn_parameters = image_volume * fcnn_nodes
  cnn_parameters = image_channels * filter_count * filter_width ** 2

  # Add bias
  fcnn_parameters += fcnn_nodes
  cnn_parameters += filter_count

  print(f"CNN parameters: {cnn_parameters}")
  print(f"Fully Connected parameters: {fcnn_parameters}")

  return None

_ = widgets.interact(calculate_parameters,
                     filter_count=(16, 512, 16),
                     image_width=(16, 512, 16),
                     fcnn_nodes=(16, 512, 16))

Submit your feedback

# @title Submit your feedback

Section 2: The History of Convnets#

Time estimate: ~15mins

Convolutional neural networks have been around for a long time. The first CNN model was published in 1980, and was based on ideas in neuroscience that predated it by decades. Why is it then that AlexNet, a CNN model published in 2012, is generally considered to mark the start of the deep learning revolution?

Watch the video below to get a better idea of the role that hardware and the internet have played in progressing deep learning.

Video 2: History of convnets#

Submit your feedback#

# @title Submit your feedback

Think! 2: Challenges of improving CNNs#

As we shall see today, the story of deep learning and CNNs has been one of scaling networks: making them bigger and deeper.

Based on what you know so far from previous days, what challenges might researchers have faced when trying to scale up CNNs and applying them to different visual recognition tasks? Do you already have some ideas how these challenges might have been addressed?

Discuss this with your group for ~10 minutes.

(Hint: labeled data, compute and memory are all finite)

Click for solution

Submit your feedback

# @title Submit your feedback

Section 3: Big and Deep Convnets#

Time estimate: 18mins

Video 3: AlexNet & VGG#

Submit your feedback

# @title Submit your feedback

Section 3.1: Introduction to AlexNet#

AlexNet arguably marked the start of the current age of deep learning. It incorporates a number of the defining characteristics of successful DL today: deep networks, GPU-powered paralellization, and building blocks encoding task-specific priors. In this section you’ll have the opportunity to play with AlexNet and see the world through its eyes.

Import Alexnet#

This cell gives you the alexnet model as well as the input_image and input_batch variables used below

Hide code cell source
# @title Import Alexnet
# @markdown This cell gives you the `alexnet` model as well as the `input_image` and `input_batch` variables used below
import requests, urllib

# Original link:
state_dict = torch.hub.load_state_dict_from_url("")

alexnet = AlexNet()

url, filename = ("", "dog.jpg")
try: urllib.URLopener().retrieve(url, filename)
except: urllib.request.urlretrieve(url, filename)

input_image =
preprocess = transforms.Compose([
                                 transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                      std=[0.229, 0.224, 0.225]),
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0)  # Create a mini-batch as expected by the model

# Move the input and model to GPU for speed if available
if torch.cuda.is_available():
  input_batch = input_batch.cuda()

Section 3.2: What does AlexNet learn?#

This code visualizes the top-layer filters learned by AlexNet. What do these filters remind you of?

with torch.no_grad():
  params = list(alexnet.parameters())
  fig, axs = plt.subplots(8, 8, figsize=(8, 8))
  filters = []
  for filter_index in range(params[0].shape[0]):
    row_index = filter_index // 8
    col_index = filter_index % 8

    filter = params[0][filter_index,:,:,:]
    filter_image = filter.permute(1, 2, 0).cpu()
    scale = np.abs(filter_image).max()
    scaled_image = filter_image / (2 * scale) + 0.5
    axs[row_index, col_index].imshow(scaled_image.cpu())
    axs[row_index, col_index].axis('off')

Think! 3.2.1: Filter Similarity#

What do these filters remind you of?

Click for solution

Submit your feedback

# @title Submit your feedback

Interactive Demo 3.2: What does AlexNet see?#

One way of visualizing CNNs is to look at the output of individual filters for a given image. Below is a widget that lets you examine the outputs of various filters used in AlexNet.

Run this cell to enable the widget

# @markdown Run this cell to enable the widget

def alexnet_intermediate_output(net, image):
    Function to extract AlexNet's intermediate output

      net: nn.module
        AlexNet instance
      image: torch.tensor
        Input features

      ReLU output on processing features
    return F.relu(net.features[0](image))

def browse_images(input_batch, input_image):
  Helper function to browse images

    input_batch: torch.tensor
      Input batch
    input_image: torch.tensor
      Input features

  intermediate_output = alexnet_intermediate_output(alexnet, input_batch)
  n = intermediate_output.shape[1]

  def view_image(i):
    Function to view incoming image frame

      i: int

    with torch.no_grad():
      channel = intermediate_output[0, i, :].squeeze()
      fig, ax = plt.subplots(1, 3, figsize=(12, 6))
      ax[1].set_xlim([-22, 33])
      ax[0].set_title('Input image')
      ax[1].set_title(f"Filter {i}")
      ax[2].set_title(f"Filter {i} on input image")
      [axi.set_axis_off() for axi in ax.ravel()]

  widgets.interact(view_image, i=(0, n-1))

browse_images(input_batch, input_image)

Submit your feedback

# @title Submit your feedback

Think! 3.2.2 Filter Purpose#

What do these filters appear to be doing? Note that different filters play different roles so there are several good answers.

Click for solution

Submit your feedback

# @title Submit your feedback

Further Reading#

If the question “what are neural network filters looking for” is at all interesting to you, or if you like geometric art, you’ll enjoy this post creating images that maximize output of various CNN neurons. There is also a good article showing what the space of images looks like as models train here.

Section 4: Convnets After AlexNet#

Time estimate: ~25mins

Video 4: Residual Networks (ResNets)#

Submit your feedback

# @title Submit your feedback

In this section we’ll be working with a state of the art CNN model called ResNet. ResNet has two particularly interesting features. First, it uses skip connections to avoid the vanishing gradient problem. Second, each block (collection of layers) in a ResNet can be treated as learning a residual function.

Mathematically, a neural network can be thought of as a series of operations that maps an input (like an image of a dog) to an output (like the label “dog”). In math-speak a mapping from an input to an output is called a function. Neural networks are a flexible way of expressing that function.

If you were to subtract out the true function mapping images to class labels from the function learned by a network, you’d be left with the residual error or “residual function”. ResNets try to learn the original function, then the residual function, then the residual of the residual, and so on, using their residual blocks and adding them to the output of the preceeding layers.

In this section we’ll run several images through a pre-trained ResNet and see what happens.

Download imagenette#

# @title Download imagenette
import requests, tarfile, os

fname = 'imagenette2-320'
url = ''

if not os.path.exists(fname):
  print("Data is being downloaded...")
  r = requests.get(url, stream=True)
  with open(fname+'tgz', 'wb') as fd:

  with'tgz', "r") as ft:

  print("The download has been completed.")
  print("Data has already been downloaded.")
Data is being downloaded...
The download has been completed.

Set Up Textual ImageNet labels#

Map Imagenette Labels to Imagenet Labels#

# @title Map Imagenette Labels to Imagenet Labels
dir_to_imagenet_index = {
    'n03888257': 1,
    'n03425413': 571,
    'n03394916': 566,
    'n03000684': 491,
    'n02102040': 217,
    'n03445777': 574,
    'n03417042': 569,
    'n03028079': 497,
    'n02979186': 482,
    'n01440764': 701

dir_index_to_imagenet_label = {}
ordered_dirs = sorted(list(dir_to_imagenet_index.keys()))

for dir_index, dir_name in enumerate(ordered_dirs):
  dir_index_to_imagenet_label[dir_index] = dir_to_imagenet_index[dir_name]

Prepare Imagenette Data#

# @title Prepare Imagenette Data
val_transform = transforms.Compose((transforms.Resize((256, 256)),

imagenette_val = ImageFolder('imagenette2-320/val', transform=val_transform)

train_transform = transforms.Compose((transforms.Resize((256, 256)),

imagenette_train = ImageFolder('imagenette2-320/train',
random_indices = random.sample(range(len(imagenette_train)), 400)
imagenette_train_subset =,

# Subset to only one tenth of the data for faster runtime
random_indices = random.sample(range(len(imagenette_val)), int(len(imagenette_val) * .1))
imagenette_val =, random_indices)
# To preserve reproducibility
g_seed = torch.Generator()

imagenette_train_loader =,

imagenette_val_loader =,

dataiter = iter(imagenette_val_loader)
images, labels = next(dataiter)

# Show images
plt.figure(figsize=(8, 8))
plt.imshow(make_grid(images, nrow=4).permute(1, 2, 0))

eval_imagenette function#

# @title eval_imagenette function
def eval_imagenette(resnet, data_loader, dataset_length, device):
  with torch.no_grad():
    loss_sum = 0
    total_1_correct = 0
    total_5_correct = 0
    total = dataset_length
    for batch in tqdm.tqdm(data_loader):
      images, labels = batch

      # Map the imagenette labels onto the network's output
      for i, label in enumerate(labels):
          labels[i] = dir_index_to_imagenet_label[label.item()]

      images =
      labels =
      output = resnet(images)

      # Calculate top-5 accuracy
      # Implementation from
      batch_size = labels.size(0)

      _, predictions = output.topk(5, 1, True, True)
      predictions = predictions.t()

      top_k_correct = predictions.eq(labels.view(1, -1).expand_as(predictions))
      top_k_correct = top_k_correct.sum()

      predictions = torch.argmax(output, dim=1)
      top_1_correct = torch.sum(predictions == labels)
      total_1_correct += top_1_correct
      total_5_correct += top_k_correct

    top_1_acc = total_1_correct / total
    top_5_acc = total_5_correct / total

    return top_1_acc, top_5_acc

Imagenette Train Loop#

# @title Imagenette Train Loop

def imagenette_train_loop(model, optimizer, train_loader,
                          loss_fn, device):
  Training loop for Imagenette

    model: nn.module
      Untrained model
    optimizer: function
    train_loader: torch.loader
      Training loader
    loss_fn: function
    device: string
      If available, GPU/CUDA. CPU otherwise

    model: nn.module
      Trained model
  for epoch in tqdm.tqdm(range(5)):
    # Set model to use the imagenette classifier head
    # Train on a batch of images
    for imagenette_batch in train_loader:
      images, labels = imagenette_batch

      # Convert labels from imagenette indices to imagenet labels
      for i, label in enumerate(labels):
        labels[i] = dir_index_to_imagenet_label[label.item()]

      images =
      labels =
      output = model(images)
      loss = loss_fn(output, labels)

  return model

This cell creates a ResNet model pretrained on ImageNet, a 1000 class image prediction dataset. The model is then trained to make predictions on Imagenette, a small subset of ImageNet classes that is useful for demonstrations and prototyping.

# Original network
top_1_accuracies = []
top_5_accuracies = []

# Instantiate a pretrained resnet model
resnet = torchvision.models.resnet18(weights='ResNet18_Weights.DEFAULT').to(DEVICE)
resnet_opt = torch.optim.Adam(resnet.parameters(), lr=1e-4)
loss_fn = nn.CrossEntropyLoss()


top_1_acc, top_5_acc = eval_imagenette(resnet,
Random seed 2021 has been set.

Coding Exercise 4.1: Use the ResNet model#

Complete the function below that runs a batch of images through the trained ResNet and returns the Top 5 class predictions and their probabilities. Note that the ResNet model returns unnormalized logits\(^\dagger\). To obtain probabilities, you need to normalize the logits using softmax.

\(^\dagger\) \( \text{logit}(p) = \sigma^{-1}(p) = \text{log} \left( \frac{p}{1-p} \right), \, \text{for} \, p \in (0,1)\), where \(\sigma(\cdot)\) is the sigmoid function, i.e., \(\sigma(z) = 1/(1+e^{-z})\). For more information see here.

def predict_top5(images, device, seed):
  Function to predict top 5 classes

    images: torch.tensor
      Image data with dimensionality B x C x H x W batch size x number of channels x height x width)
    device: STRING
      `cuda` if GPU is available, else `cpu`.

    top5_probs: torch.tensor
      Tensor(B, 5) with top 5 class probabilities
    top5_names: list
      List of top 5 class names (B, 5)
  # Fill in all missing code below (...),
  # then remove or comment the line below to test your function
  raise NotImplementedError("Predict top 5")

  B = images.size(0)
  with torch.no_grad():
    # Run images through model
    images = ...
    output = ...
    # The model output is unnormalized. To get probabilities, run a softmax on it.
    probs = ...
    # Fetch output from GPU and convert to numpy array
    probs = ...

  # Get top 5 predictions
  _, top5_idcs = output.topk(5, 1, True, True)
  top5_idcs = top5_idcs.t().cpu().numpy()
  top5_probs = probs[torch.arange(B), top5_idcs]

  # Convert indices to class names
  top5_names = []
  for b in range(B):
    temp = [dict_map[key].split(',')[0] for key in top5_idcs[:, b]]

  return top5_names, top5_probs

# Get batch of images
dataiter = iter(imagenette_val_loader)
images, labels = next(dataiter)

## Uncomment to test your function and retrieve top 5 predictions
# top5_names, top5_probs = predict_top5(images, DEVICE, SEED)
# print(top5_names[1])

You will see something like this:

Random seed 2021 has been set.
['gas pump', 'chain saw', 'jinrikisha', 'rifle', 'turnstile']

Click for solution

Submit your feedback

# @title Submit your feedback
# Visualize probabilities of top 5 predictions
fig, ax = plt.subplots(5, 2, figsize=(10, 20))

for i in range(5):
  ax[i, 0].imshow(np.moveaxis(images[i].numpy(), 0, -1))
  ax[i, 0].axis('off')

  ax[i, 1].bar(np.arange(5), top5_probs[:, i])
  ax[i, 1].set_xticks(np.arange(5))
  ax[i, 1].set_xticklabels(top5_names[i], rotation=30)


Out-of-distribution examples#

The code below runs two out-of-distribution examples through the trained ResNet. Look at the predictions and discuss, why the model might fail to make accurate predictions on these images.

loc = ''

fname1 = 'bonsai-svg-5.png'
response = requests.get(loc + fname1)
image =, 256))
data = torch.from_numpy(np.asarray(image)[:, :, :3]) / 255.

fname2 = 'Pokémon_Pikachu_art.png'
response = requests.get(loc + fname2)
image =, 256))
data2 = torch.from_numpy(np.asarray(image)[:, :, :3]) / 255.

images = torch.stack([data, data2]).permute(0, 3, 1, 2)
# Retrieve top 5 predictions
top5_names, top5_probs = predict_top5(images, DEVICE, SEED)
# Visualize probabilities of top 5 predictions
fig, ax = plt.subplots(2, 2, figsize=(10, 10))

for i in range(2):
  ax[i, 0].imshow(np.moveaxis(images[i].numpy(), 0, -1))
  ax[i, 0].axis('off')

  ax[i, 1].bar(np.arange(5), top5_probs[:, i])
  ax[i, 1].set_xticks(np.arange(5))
  ax[i, 1].set_xticklabels(top5_names[i], rotation=30)


Section 5: Inception + ResNeXt#

Time estimate: ~27mins

Video 5: Improving efficiency: Inception and ResNeXt#

Submit your feedback

# @title Submit your feedback

ResNet vs ResNeXt#

Xie et al., 2016

Interactive Demo 5: ResNet vs. ResNeXt#

The widgets below calculate the number of parameters in a ResNet (top) and the parameters in a ResNeXt (bottom). We assume that the number of input and output channels (or feature maps) is the same (labeled “Channels in+out” in the widget). We refer to the number of channels after the first and the second layer of one block of either ResNet or ResNeXt as “bottleneck channels”.

The sliders are currently in the position that is displayed in the figure above. The goal of the following tasks is to investigate the difference in expressiveness and numbers of parameters in ResNet and ResNeXt.

Parameter Calculator#

Run this cell to enable the widget

# @title Parameter Calculator
# @markdown Run this cell to enable the widget
from IPython.display import display as dis

def calculate_parameters_resnet(d_in, resnet_channels):
    ResNet math: Implement how parameters scale

      d_in: int
        Input dimensionality
      resnet_channels: int
        Number of channels in ResNet

    d_out = d_in
    resnet_parameters = d_in*resnet_channels + 3*3*resnet_channels*resnet_channels + resnet_channels*d_out

    print('ResNet parameters: {}'.format(resnet_parameters))
    return None

def calculate_parameters_resnext(d_in, resnext_channels,
    ResNext math: Implement how parameters scale

      d_in: int
        Input dimensionality
      resnet_channels: int
        Number of channels in ResNext
      num_paths: int
        Number of pathways in ResNext

    d_out = d_in
    d = resnext_channels

    resnext_parameters = (d_in*d + 3*3*d*d + d*d_out)*num_paths

    print('ResNeXt parameters: {}'.format(resnext_parameters))
    return None

labels = ['ResNet', 'ResNeXt']
descriptions_resnet = ['Channels in+out', 'Bottleneck channels']
descriptions_resnext = ['Channels in+out', 'Bottleneck channels',
                        'Number of paths (cardinality)']
lbox_resnet = widgets.VBox([widgets.Label(description) for description in descriptions_resnet])
lbox_resnext = widgets.VBox([widgets.Label(description) for description in descriptions_resnext])

d_in = widgets.FloatLogSlider(
    min=1, # Max exponent of base
    max=10, # Min exponent of base
    step=1, # Exponent step
resnet_channels = widgets.FloatLogSlider(
    min=5, # Max exponent of base
    max=10, # Min exponent of base
    step=1, # Exponent step
resnext_channels = widgets.FloatLogSlider(
    min=1, # Max exponent of base
    max=10, # Min exponent of base
    step=1, # Exponent step
num_paths = widgets.FloatLogSlider(
    min=0, # Max exponent of base
    max=7, # Min exponent of base
    step=1, # Exponent step

rbox_resnet = widgets.VBox([d_in, resnet_channels])
rbox_resnext = widgets.VBox([d_in, resnext_channels, num_paths])
ui_resnet = widgets.HBox([lbox_resnet, rbox_resnet])
ui_resnet_labeled = widgets.VBox(
    [widgets.HTML(value="<b>" + labels[0] + "</b>"), ui_resnet],
    layout=widgets.Layout(border='1px solid black'))
ui_resnext = widgets.HBox([lbox_resnext, rbox_resnext])
ui_resnext_labeled = widgets.VBox(
    [widgets.HTML(value="<b>" + labels[1] + "</b>"), ui_resnext],
    layout=widgets.Layout(border='1px solid black'))
ui = widgets.VBox([ui_resnet_labeled, ui_resnext_labeled])

out_resnet = widgets.interactive_output(calculate_parameters_resnet,

out_resnext = widgets.interactive_output(calculate_parameters_resnext,

d1 = dis(ui, out_resnet, out_resnext)

Submit your feedback

# @title Submit your feedback

Think! 5: ResNet vs. ResNeXt#

In the figure above, both networks, i.e., ResNet and ResNeXt, have a similar number of parameters.

  1. How many channels are there in the bottleneck of the two networks, respectively?

  2. How are these channels connected to each other from the first to the second layer in the blocks of the two networks, respectively?

  3. What does it mean for the expressiveness of the two models relative to each other?

Click for solution

Submit your feedback

# @title Submit your feedback

Now we want to look at the number of parameters.

  • How does the difference in number of parameters change if we fix the number of channels in the bottleneck of both ResNet and ResNeXt to be 64, but vary the number of paths in ResNeXt? (8 paths with 8 channels each would be one such example)

  • Which number of paths results in the biggest parameter savings?

Click for solution

Section 6: Depthwise separable convolutions#

Time estimate: ~23mins

Video 6: Improving efficiency: MobileNet#

Submit your feedback

# @title Submit your feedback

Section 6.1: Depthwise separable convolutions#

Another way to reduce the computational cost of large models is the use of depthwise separable convolutions (introduced here). Depthwise separable convolutions are the key component making MobileNets efficient.

Coding Exercise 6.1: Calculation of parameters#

Fill in the calculation of the parameters of regular convolution and depthwise separable convolution in the function below. Above you can see the example given in the video for you to check if your calculation is correct.

def convolution_math(in_channels, filter_size, out_channels):
  Convolution math: Implement how parameters scale as a function of feature maps
  and filter size in convolution vs depthwise separable convolution.

    in_channels : int
      Number of input channels
    filter_size : int
      Size of the filter
    out_channels : int
      Number of output channels

  # Fill in all missing code below (...),
  # then remove or comment the line below to test your function
  raise NotImplementedError("Convolution math")
  # Calculate the number of parameters for regular convolution
  conv_parameters = ...
  # Calculate the number of parameters for depthwise separable convolution
  depthwise_conv_parameters = ...

  print(f"Depthwise separable: {depthwise_conv_parameters} parameters")
  print(f"Regular convolution: {conv_parameters} parameters")

  return None

## Uncomment to test your function
# convolution_math(in_channels=4, filter_size=3, out_channels=2)
Depthwise separable: 44 parameters
Regular convolution: 72 parameters

Click for solution

Submit your feedback

# @title Submit your feedback

Think! 6.1: How do parameter savings depend the on number of input feature maps, 4 vs. 64?#

Click for solution

Submit your feedback

# @title Submit your feedback

Section 7: Transfer Learning#

Time estimate: ~24mins

Video 7: Transfer Learning#

Submit your feedback

# @title Submit your feedback

The most common way large image models are trained in practice is via transfer learning. One first pretrains a network on a large classification dataset like ImageNet, then uses the weights of this network as initialization for training (“fine-tuning”) that network on your task of choice.

While training a network twice sounds like a strange thing to do, the model ends up training faster on the target dataset and often outperforms training “from scratch”. There are also other benefits such as robustness to noise that are the subject of active research.

In this section we will demonstrate transfer learning by taking a model trained on ImageNet and teaching it to classify Pokemon.

Section 7.1: Download and prepare the data#

Download Data#

# @title Download Data
import zipfile, io

# Original link:
url = ''

fname = 'small_pokemon_dataset'

if not os.path.exists(fname+'zip'):
  print("Data is being downloaded...")
  r = requests.get(url, stream=True)
  z = zipfile.ZipFile(io.BytesIO(r.content))
  print("The download has been completed.")
  print("Data has already been downloaded.")
Data is being downloaded...
The download has been completed.
# List the different Pokemon

Determine number of classes#

# @title Determine number of classes
num_classes = 0
for folders in os.listdir('small_pokemon_dataset/'):
  num_classes += 1
print(f"{num_classes} types of Pokemon")
9 types of Pokemon

Display Example Images#

# @title Display Example Images
train_transform = transforms.Compose((transforms.Resize((256, 256)),

pokemon_dataset = ImageFolder('small_pokemon_dataset',

image_count = len(pokemon_dataset)
train_indices = []
test_indices = []
for i in range(image_count):
  # Put ten percent of the images in the test set
  if random.random() < .1:

pokemon_test_set =, test_indices)
pokemon_train_set =, train_indices)

pokemon_train_loader =,
pokemon_test_loader =,

dataiter = iter(pokemon_train_loader)
images, labels = next(dataiter)

# Show images
plt.imshow(make_grid(images, nrow=4).permute(1, 2, 0))

Section 7.2: Fine-tuning a ResNet#

It is common in computer vision to take a large model trained on a large dataset (often ImageNet), replace the classification layer and fine-tune the entire network to perform a different task.

Here we’ll be using a pre-trained ResNet model to classify types of Pokemon.

resnet = torchvision.models.resnet18(weights='ResNet18_Weights.DEFAULT')
num_ftrs = resnet.fc.in_features
# Reset final fully connected layer, number of classes = types of Pokemon = 9
resnet.fc = nn.Linear(num_ftrs, num_classes)
optimizer = torch.optim.Adam(resnet.parameters(), lr=1e-4)
loss_fn = nn.CrossEntropyLoss()

Finetune ResNet#

# @title Finetune ResNet
pretrained_accs = []
for epoch in tqdm.tqdm(range(10)):
  # Train loop
  for batch in pokemon_train_loader:
    images, labels = batch
    images =
    labels =

    output = resnet(images)
    loss = loss_fn(output, labels)

  # Eval loop
  with torch.no_grad():
    loss_sum = 0
    total_correct = 0
    total = len(pokemon_test_set)
    for batch in pokemon_test_loader:
      images, labels = batch
      images =
      labels =
      output = resnet(images)
      loss = loss_fn(output, labels)
      loss_sum += loss.item()

      predictions = torch.argmax(output, dim=1)

      num_correct = torch.sum(predictions == labels)
      total_correct += num_correct

    # Plot accuracy
    pretrained_accs.append(total_correct.cpu() / total)
    plt.title('Pokemon prediction accuracy')

Section 7.3: Train only classification layer#

Another possible way to make use of transfer learning is to take a pre-trained model and replace the last layer, the classification layer (sometimes also called the “linear readout”). Instead of fine-tuning the whole model as before, we train only the classification layer.

resnet = torchvision.models.resnet18(weights='ResNet18_Weights.DEFAULT')
for param in resnet.parameters():
  param.requires_grad = False
num_ftrs = resnet.fc.in_features
# ResNet final fully connected layer
resnet.fc = nn.Linear(num_ftrs, num_classes)
optimizer = torch.optim.Adam(resnet.fc.parameters(), lr=1e-2)
loss_fn = nn.CrossEntropyLoss()

Finetune readout of ResNet#

# @title Finetune readout of ResNet
linreadout_accs = []
for epoch in tqdm.tqdm(range(10)):
  # Train loop
  for batch in pokemon_train_loader:
    images, labels = batch
    images =
    labels =

    output = resnet(images)
    loss = loss_fn(output, labels)

  # Eval loop
  with torch.no_grad():
    loss_sum = 0
    total_correct = 0
    total = len(pokemon_test_set)
    for batch in pokemon_test_loader:
      images, labels = batch
      images =
      labels =
      output = resnet(images)
      loss = loss_fn(output, labels)
      loss_sum += loss.item()

      predictions = torch.argmax(output, dim=1)

      num_correct = torch.sum(predictions == labels)
      total_correct += num_correct

    # Plot accuracy
    linreadout_accs.append(total_correct.cpu() / total)
    plt.title('Pokemon prediction accuracy')

Section 7.4: Training ResNet from scratch#

As a baseline and for comparison reasons we will also train the ResNet “from scratch” – that is: initialize the weights randomly and train the entire network exclusively on the Pokemon dataset.

resnet = torchvision.models.resnet18(weights=None)
num_ftrs = resnet.fc.in_features
# ResNet final fully connected layer
resnet.fc = nn.Linear(num_ftrs, num_classes)
optimizer = torch.optim.Adam(resnet.parameters(), lr=1e-4)

loss_fn = nn.CrossEntropyLoss()

Train ResNet from scratch#

# @title Train ResNet from scratch
scratch_accs = []
for epoch in tqdm.tqdm(range(10)):
  # Train loop
  for batch in pokemon_train_loader:
    images, labels = batch
    images =
    labels =

    output = resnet(images)
    loss = loss_fn(output, labels)

  # Eval loop
  with torch.no_grad():
    loss_sum = 0
    total_correct = 0
    total = len(pokemon_test_set)
    for batch in pokemon_test_loader:
      images, labels = batch
      images =
      labels =
      output = resnet(images)
      loss = loss_fn(output, labels)
      loss_sum += loss.item()

      predictions = torch.argmax(output, dim=1)

      num_correct = torch.sum(predictions == labels)
      total_correct += num_correct

    scratch_accs.append(total_correct.cpu() / total)
    plt.title('Pokemon prediction accuracy')


Section 7.5: Head to Head Comparison#

Starting from a randomly initialized network works less well, especially in the case of small datsets. Note that the model converges more slowly and less evenly.

Plot Accuracies#

# @title Plot Accuracies
plt.plot(pretrained_accs, label='Pretrained: fine-tuning')
plt.plot(linreadout_accs, label='Pretrained: linear Readout')
plt.plot(scratch_accs, label='Trained from Scratch')
plt.title('Pokemon prediction accuracy')

Exercise 7.5.1: Pretrained ResNet vs. ResNet trained from scratch#

First, we compare the Pretrained ResNet with the ResNet trained from scratch. Why might pretrained models outperform models trained from scratch? In what cases would you expect them to be worse?

Click for solution

Submit your feedback

# @title Submit your feedback

Exercise 7.5.2: Training only the classification layer#

Second, take a look at the different transfer learning methods - fine-tuning the whole network and training only the classification layer. Why might fine-tuning the whole network outperform training only the classification layer? What are the benefits of training only the classification layer? In what cases would you expect a similar performance of both methods?

Click for solution

Submit your feedback

# @title Submit your feedback

Further Reading#

Supervised pretraining as you’ve seen here is useful, but there are several other ways of using outside data to improve your models. The ones that are particularly popular right now are self-supervised techniques like contrastive learning.

There is also a recent paper that seeks to quantify the relationship between model size, pretraining dataset size, training dataset size, and performance.


In this tutorial, you have learned about the modern Convnets (CNNs), their architecture, and operating principles. Also, you are now familiar with the notion of Transfer Learning, and you have learned when to apply it. If you have time left, you will learn more about the speed vs. accuracy trade-off. In the next tutorial, we will see the modern convnets in a facial recognition task.

Video 8: Summary and Outlook#

Submit your feedback

# @title Submit your feedback

Daily survey#

Don’t forget to complete your reflections and content check in the daily survey! Please be patient after logging in as there is a small delay before you will be redirected to the survey.

button link to survey

Bonus: Speed-Accuracy Trade-Off / Different Backbones#

Time estimate: ~ 21mins

Video 9: Speed-accuracy trade-off#

Submit your feedback

# @title Submit your feedback

As the models got larger and the number of connections increased so did the computational costs involved. In the modern era of image processing, there is a tradeoff between model performance and computational cost. Models can reach extremely high performance on many problems, but achieving state of the art results requires huge amounts of compute power.

Bonus Coding Exercise: Compare accuracy and training speed of different models#

The goal is to load three pretrained models and fine-tune them. models is a dictionary where the keys are the names of the models and the values are the corresponding model objects. Currently the names are ResNet18, AlexNet and VGG-19. For a start, load these models from torchvision.models and make sure they are pretrained.

If you want to try other models, just change the dictionary, or if you want to even try out more than three models, just add them to the dictionary and add their learning rates in the array below.

Imagenette Train Loop: train_loop(model, optimizer, train_loader, loss_fn, device)#

# @title Imagenette Train Loop: `train_loop(model, optimizer, train_loader, loss_fn, device)`
def train_loop(model, optimizer, train_loader,
               loss_fn, device):
  Imagenette Train Loop

    model: nn.module
    optimizer: function
    train_loader: torch.loader
      Training dataset
    loss_fn: function
    device: string
      GPU/CUDA if available. CPU otherwise.

    Average Training time
  times = []
  for epoch in tqdm.tqdm(range(5)):
    t_start = time.time()

    # Train on a batch of images
    for imagenette_batch in train_loader:
      images, labels = imagenette_batch

      # Convert labels from imagenette indices to imagenet labels
      for i, label in enumerate(labels):
        labels[i] = dir_index_to_imagenet_label[label.item()]

      images =
      labels =
      output = model(images)
      loss = loss_fn(output, labels)
      if torch.cuda.is_available():

      times += [time.time() - t_start]

  return np.mean(times)

Run the models: run_models(models, lr_rates)#

# @title Run the models: `run_models(models, lr_rates)`
def run_models(models, lr_rates):
  Run the models

    models: dict
    lr_rates: list
      Learning rates

    times: list
      Running time for models
    top_1_acciracies: list
      Top 1 accuracy per model
  times, top_1_accuracies = [], []

  for (name, model), lr in zip(models.items(), lr_rates):

    print(name, lr)
    model.aux_logits = False  # Important only for googlenet

    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    loss_fn = nn.CrossEntropyLoss()

    model_time = train_loop(model, optimizer, imagenette_train_loader, loss_fn,

    top_1_acc, _ = eval_imagenette(model, imagenette_val_loader,
                                  len(imagenette_val), device=DEVICE)

  return times, top_1_accuracies

Plot accuracies vs. training speed#

# @title Plot accuracies vs. training speed
def get_parameter_count(model):
  Get parameter count per model

    model: nn.module

    Parameter count for model
  return sum([torch.numel(p) for p in model.parameters()])

def plot_acc_speed(times, accs, models):
  Plots Accuracy vs Speed

    times: list
      Log of running times
    accs: list
      Log of accuracies
    models: dict
      Log of models

  ti = [t*1000 for t in times]
  for i, model in enumerate(list(models.keys())):
    scale = get_parameter_count(models[model])*1e-6
    plt.scatter(ti[i], accs[i], s=scale, label=model)
  plt.xlabel('Speed [ms]')
  plt.title('Accuracy vs. Speed')
def create_models(weights):
  Creates models

    weights: list of strings
      If True, load pretrained models.

    models: dict
      Log of models
    lr_rates: list
      Log of learning rates
  # Fill in all missing code below (...),
  # then remove or comment the line below to test your function
  raise NotImplementedError("create pretrained models")
  # Load three pretrained models from torchvision.models
  # [these are just examples, other models are possible as well]
  model1 = ...
  model2 = ...
  model3 = ...

  models = {'...': model1, '...': model2, '...': model3}
  lr_rates = [1e-4, 1e-4, 1e-4]

  return models, lr_rates

weight_list = ['ResNet18_Weights.DEFAULT', 'AlexNet_Weights.DEFAULT', 'VGG19_Weights.DEFAULT']
## Uncomment below to test your function
# models, lr_rates = create_models(weights=weight_list)
# times, top_1_accuracies = run_models(models, lr_rates)
# plot_acc_speed(times, top_1_accuracies, models)

Click for solution

Example output:

Solution hint

Submit your feedback

# @title Submit your feedback

Bonus Exercise 1: Finding the best model#

Look at the plot above. It shows the training speed vs. the accuracy of the models you chose. The training speed is measured as the mean time the training takes per epoch. The size of the marker visualizes the number of parameters of the model.

Which model seems to be the best for this task and why? Explain your conclusion based on speed, accuracy and number of parameters.

Click for solution

Submit your feedback

# @title Submit your feedback

Bonus Exercise 2: Speed and accuracy correlation#

How does the speed correlate with the accuracy? Are faster models also more accurate?

Click for solution

Submit your feedback

# @title Submit your feedback