# Tutorial 2: Out-of-distribution (OOD) Learning

## Contents

# Tutorial 2: Out-of-distribution (OOD) Learning¶

**Week 3, Day 4: Continual Learning**

**By Neuromatch Academy**

**Content creators:** Avishree Khare, Het Shah, Joshua Vogelstein

**Content reviewers:** Arush Tagade, Jeremy Forest, Kelson Shilling-Scrivo

**Content editors:** Gagana B, Anoop Kulkarni, Spiros Chavlis

**Production editors:** Deepak Raya, Spiros Chavlis

**Post-Production team:** Gagana B, Spiros Chavlis

**Our 2021 Sponsors, including Presenting Sponsor Facebook Reality Labs**

# Tutorial Objectives¶

Deep Learning has seen tremendous growth in recent years thanks to more data, more compute and *very* deep neural networks. Although these networks perform extremely well on the specified task, they fail to generalize to newer tasks.

In this tutorial, we will explore Out-of-distribution (OOD) Learning and the several OOD paradigms that have been gaining popularity in recent years. We will understand what OOD really means and how it is different from anything else that we’ve looked at so far. We’ll also take a look at Transfer Learning, Multi-task Learning and Meta-Learning which aim to facilitate OOD Learning in different ways. Here is a list of topics that we would be covering in this tutorial:

Introduction to OOD Learning

Transfer Learning

Multi-Task Learning

Meta-Learning

# Setup¶

## Install dependencies¶

```
# @title Install dependencies
from IPython.display import clear_output
!pip install Pillow --quiet
!pip install pandas --quiet
!pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html --quiet
!pip install git+https://github.com/NeuromatchAcademy/evaltools --quiet
from evaltools.airtable import AirtableForm
# Generate airtable form
atform = AirtableForm('appn7VdPRseSoMXEG','W3D4_T2','https://portal.neuromatchacademy.org/api/redirect/to/1d7fcd5d-f1e9-4ac5-ae58-b0ade54a4f87')
clear_output()
```

```
# Imports
import os
import copy
import time
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.optim import lr_scheduler
from torch.utils.data import Dataset, DataLoader, random_split
from torchvision.datasets import CelebA, Omniglot
from torchvision import datasets, models, transforms
```

## Figure settings¶

```
# @title Figure settings
import ipywidgets as widgets # Interactive display
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/content-creation/main/nma.mplstyle")
```

## Plotting functions¶

```
# @title Plotting functions
def visualize_siamese_sample(sample):
"""
Helper function to visualize siamese samples
Args:
sample: tuple
Tuple containing two images and corresponding label
Returns:
Nothing
"""
to_PIL = transforms.ToPILImage()
fig, axs = plt.subplots(1, 2, figsize=(8, 10))
x1, x2, y = sample
label = y.item()
similarity = "Same character" if label == 1.0 else "Different characters"
print(f"-------------------- Label: {label} ({similarity}) -------------------")
axs[0].imshow(to_PIL(x1), cmap='gray')
axs[1].imshow(to_PIL(x2), cmap='gray')
for i in range(2):
axs[i].xaxis.set_ticks([])
axs[i].yaxis.set_ticks([])
plt.tight_layout()
plt.show()
def visualize_mtfl_sample(sample):
"""
Helper function to visualize MTFL samples
Args:
sample: tuple
Tuple containing image and corresponding label
Returns:
Nothing
"""
to_PIL = transforms.ToPILImage()
img, labels = sample
fig, ax = plt.subplots()
ax.xaxis.set_ticks([])
ax.yaxis.set_ticks([])
print(f"Labels: {labels.tolist()}")
ax.imshow(to_PIL(img))
plt.show()
def visualize_one_shot_sample(sample):
"""
Helper function to visualize one-shot sample
Args:
sample: tuple
Tuple containing query image, support image, support labels and similarity
Returns:
Nothing
"""
to_PIL = transforms.ToPILImage()
query_img, support_imgs, support_labels, similarity = sample
n_rows = 1
n_cols = len(support_imgs) + 1
fig, axs = plt.subplots(n_rows, n_cols, figsize=(n_cols*4, 10))
axs[0].imshow(to_PIL(query_img), cmap='gray')
axs[0].set_title('Query Image')
axs[0].xaxis.set_ticks([])
axs[0].yaxis.set_ticks([])
for i, (s_img, s_label) in enumerate(zip(support_imgs, support_labels)):
axs[i+1].imshow(to_PIL(s_img), cmap='gray')
label = s_label.item()
similarity = "Same character" if label == 1.0 else "Different characters"
axs[i+1].set_title(f"Support Image {i+1} \n "
f"Label: {s_label.item()} ({similarity})")
axs[i+1].xaxis.set_ticks([])
axs[i+1].yaxis.set_ticks([])
plt.tight_layout()
plt.show()
```

## Dataset Definitions and Helper Functions¶

`pandas`

and `PIL`

libraries should be installed.

```
# @title Dataset Definitions and Helper Functions
# @markdown `pandas` and `PIL` libraries should be installed.
import pandas as pd
from PIL import Image
class SiameseOmniglotDataset(Dataset):
"""
Datasets for Omniglot
"""
def __init__(self, num_samples=10000,
data_transforms=transforms.ToTensor(), download=True):
"""
Initialize Siamese Omniglot Dataset Parameters
Args:
num_samples: int
Number of samples
data_transforms: torch.tensor
Specification of data transforms
download: boolean
If true, download Omniglot Dataset
Returns:
Nothing
"""
self.dataset = Omniglot(root='./data/',
background=True,
download=download,
transform=data_transforms)
self.num_samples = num_samples
self.num_classes = len(self.dataset._characters)
self.classes = range(self.num_classes)
self.instances_per_class = dict()
self.get_instances_per_class()
def __getitem__(self, idx):
"""
Helper function to get instances from same/different classes
Args:
idx: int
If index is even, get instances from same class
Else, get instances from different classes
Returns:
Instances from same class/different classes depending on idx
"""
if idx % 2 == 0:
return self.get_instances_from_same_class()
return self.get_instances_from_diff_classes()
def get_instances_from_same_class(self):
"""
Helper function to get instances from same class
Args:
None
Returns:
ch1: torch.tensor
Sample from class
ch2: torch.tensor
Another sample from same class
torch.tensor([1.0])
"""
c = random.randint(0, self.num_classes-1)
[ch_1, ch_2] = random.sample(self.instances_per_class[c], 2)
return ch_1, ch_2, torch.tensor([1.0])
def get_instances_from_diff_classes(self):
"""
Helper function to get instances from same class
Args:
None
Returns:
ch1: torch.tensor
Sample from class 1
ch2: torch.tensor
Sample from class 2
torch.tensor([0.0])
"""
[c1, c2] = random.sample(self.classes, 2)
[ch_1] = random.sample(self.instances_per_class[c1], 1)
[ch_2] = random.sample(self.instances_per_class[c2], 1)
return ch_1, ch_2, torch.tensor([0.0])
def get_random_instance(self):
"""
Helper function to get random instance
Args:
None
Returns:
character_1: torch.tensor
Sample from one class
character_2: torch.tensor
Sample from another class
similarity: torch.tensor
Vector comparison of character_1 and character_2
"""
ids = np.random.randint(0, len(self.dataset), 2)
character_1 = self.dataset[ids[0]][0]
character_2 = self.dataset[ids[1]][0]
label_1 = self.dataset[ids[0]][1]
label_2 = self.dataset[ids[1]][1]
similarity = torch.tensor([label_1 == label_2], dtype=torch.float)
return character_1, character_2, similarity
def __len__(self):
"""
Helper function returning number of samples
Args:
None
Returns:
num_samples: int
Number of samples
"""
return self.num_samples
def get_instances_per_class(self):
"""
Gets instances per class
Args:
None
Returns:
Nothing
"""
for (image, label) in self.dataset:
if label in self.instances_per_class:
self.instances_per_class[label].append(image)
else:
self.instances_per_class[label] = [image]
class NWayOneShotOmniglotDataset(Dataset):
"""
Datasets for N-way 1-shot
classification on the Omniglot Dataset
"""
def __init__(self, data_transforms=transforms.ToTensor(), n_ways=5,
download=True):
"""
Initialize Siamese Omniglot Dataset Parameters
Args:
data_transforms: torch.tensor
Specification of data transforms
download: boolean
If true, download Omniglot Dataset
Returns:
Nothing
"""
self.dataset = Omniglot(root='./data/',
background=False,
download=download,
transform=data_transforms)
self.size = len(self.dataset)
self.num_classes = len(self.dataset._characters)
self.classes = range(self.num_classes)
self.n_ways = n_ways
self.instances_per_class = dict()
self.get_instances_per_class()
def __getitem__(self, idx):
"""
Helper function to get instances from same/different classes
Args:
idx: int
Get query image and class corresponding to specified idx
Returns:
query_img: iterable
Query image from dataset (extracted from corresponding idx)
support_imgs: list
Find (n_ways - 1) distinct images different from query_class
support_labels: list
Find (n_ways - 1) distinct labels different from query_class
similarity: torch.tensor
Similarity vector between query image and image
"""
query_img, query_class = self.dataset[idx]
# Find (n_ways - 1) distinct characters different from query_class
support_imgs = []
support_labels = []
support_classes = random.sample([c for c in self.classes if c != query_class], self.n_ways-1)
# Find 1 support image from query_class
support_classes.append(query_class)
random.shuffle(support_classes)
for c in support_classes:
[img] = random.sample(self.instances_per_class[c], 1)
support_imgs.append(img)
support_labels.append(torch.tensor([c == query_class], dtype=torch.float))
_, similarity = torch.max(torch.tensor(support_labels), axis=0)
return query_img, support_imgs, support_labels, similarity
def __len__(self):
"""
Helper function returning number of samples
Args:
None
Returns:
Number of samples
"""
return self.size
def get_instances_per_class(self):
"""
Helper function to get instances per class
Args:
None
Returns:
Nothing
"""
for (image, label) in self.dataset:
if label in self.instances_per_class:
self.instances_per_class[label].append(image)
else:
self.instances_per_class[label] = [image]
class MTFLDataset(Dataset):
"""
Download MTFL Dataset
"""
def __init__(self, data_file,
num_samples=10000,
data_transforms=transforms.ToTensor(), ):
"""
Initialize parameters of MTFL data
Args:
data_file: string
Specifies path of requisite CSV file
num_samples: int
Number of samples in the dataset
data_transforms: torch.tensor
Specification of data transforms
Returns:
Nothing
"""
self.df = pd.read_csv(data_file, sep=' ', header=None,
skipinitialspace=True, nrows=num_samples)
self.df.iloc[:, 0] = self.df.iloc[:, 0].apply(lambda s: s.replace('\\', '/'))
self.transform = data_transforms
def __getitem__(self, idx):
"""
Helper function to get instances from same/different classes
Args:
idx: int
Specifies index for corresponding image and label
Returns:
img: torch.tensor
Transformed image from corresponding idx
np.ndarray of corresponding labels
"""
item = self.df.iloc[idx]
img_name = item[0]
labels = (item[11:] - 1) # 1-indexed to 0-indexed
img = Image.open(img_name)
img = self.transform(img)
return img, torch.from_numpy(np.array(labels, dtype=np.float32)).long()
def __len__(self):
"""
Helper function returning number of samples
Args:
None
Returns:
Number of samples
"""
return len(self.df)
"""
Load datasets
"""
def get_train_val_datasets(background_dataset_size=10000,
val_split=0.2, download=True):
"""
Helper function to get training and validation sets
Args:
val_split: float
Specifies percentage of data reserved for validation
download: boolean
If true, download dataset
background_dataset_size: int
Number of datapoints in the dataset
Returns:
train_dataset: iterable [map style]
Training dataset
val_dataset: iterable [map style]
Validation dataset
Refer https://pytorch.org/docs/stable/data.html#map-style-datasets for more details
"""
dataset_size = background_dataset_size
val_split = 0.2
train_size = int(dataset_size * (1 - val_split))
val_size = dataset_size - train_size
background_dataset = SiameseOmniglotDataset(num_samples=dataset_size,
download=download)
train_dataset, val_dataset = random_split(background_dataset, [train_size, val_size])
return train_dataset, val_dataset
def get_test_dataset(n_ways=5, download=True):
"""
Get Test dataset
Args:
n_ways: int
Number of tasks
download: boolean
If true, download dataset
Returns:
Test data from NWayOneShotOmniglotDataset instance
"""
return NWayOneShotOmniglotDataset(n_ways=n_ways, download=download)
def get_train_val_datasets_mtfl(dirname, dataset_size=10000, val_split=0.2):
"""
Get training and validation datasets
Args:
dirname: string
Specifies directory path
dataset_size: int
Size of the dataset
val_split: float
Specifies percentage of data reserved for validation
Returns:
train_dataset: iterable [map style]
Training dataset
val_dataset: iterable [map style]
Validation dataset
"""
data_transforms = transforms.Compose([transforms.Resize((256, 256)),
transforms.ToTensor()])
train_size = int(dataset_size * (1 - val_split))
val_size = dataset_size - train_size
background_dataset = MTFLDataset(data_file=f'{dirname}/MTFL/training.txt',
num_samples=dataset_size,
data_transforms=data_transforms)
train_dataset, val_dataset = random_split(background_dataset, [train_size, val_size])
return train_dataset, val_dataset
```

## Training/Evaluation Helper Functions¶

```
# @title Training/Evaluation Helper Functions
# @markdown `train_multi_task(model, trainloader, valloader, criterion, optimizer, epochs=10, device='cpu')`
def train_multi_task(model, trainloader, valloader, criterion, optimizer,
epochs=10, device='cpu'):
"""
Helper function to train multitask network
Args:
model: instance of siamese class
Describes model
epochs: int
Number of epochs
device: string
GPU/CUDA if available, CPU otherwise
trainloader: torch.loader
Training dataset
valloader: torch.loader
Validation dataset
criterion: torch.nn type
Criterion specifies loss function
optimizer: torch.optim type
Implements Adam algorithm.
Returns:
model: dict
Dictionary of best model weights
"""
best_model_weights = copy.deepcopy(model.state_dict())
best_overall_acc = 0.0
for ep in range(epochs):
print("")
print("======== Epoch {:} / {:} ========".format(ep + 1, epochs))
running_corrects = np.zeros(4)
running_loss = 0.0
model.train()
model.to(device)
for inps, labels in trainloader:
inps = inps.to(device)
labels = labels.to(device)
outs = model(inps)
overall_loss = 0.0
for i in range(4):
overall_loss += criterion(outs[i].to(device), labels[:, i].to(device))
_, preds = torch.max(outs[i], 1)
running_corrects[i] += torch.sum(preds == labels[:, i].data)
running_loss += overall_loss.item()
optimizer.zero_grad()
overall_loss.backward()
optimizer.step()
train_loss = running_loss/(len(trainloader) * inps.size(0))
train_accs = running_corrects / (len(trainloader) * inps.size(0))
train_overall_acc = np.mean(train_accs)
running_corrects_val = np.zeros(4)
model.eval()
with torch.no_grad():
for inps, labels in valloader:
inps = inps.to(device)
labels = labels.to(device)
outs = model(inps)
overall_loss = 0.0
for i in range(4):
_, preds = torch.max(outs[i], 1)
running_corrects_val[i] += torch.sum(preds == labels[:, i].data)
val_accs = running_corrects_val / (len(valloader) * inps.size(0))
val_overall_acc = np.mean(val_accs)
if val_overall_acc > best_overall_acc:
best_overall_acc = val_overall_acc
best_model_weights = copy.deepcopy(model.state_dict())
print(f"Training => Avg. Task Accuracy: {train_overall_acc}")
print(f"Validation => Avg. Task Accuracy: {val_overall_acc}")
model.load_state_dict(best_model_weights)
return model
# @markdown `train_siamese_network(model, criterion, optimizer, train_loader, device='cpu', print_freq=100)`
def train_siamese_network(model, criterion, optimizer, train_loader,
device="cpu", print_freq=100):
"""
Helper function to train siamese network
Args:
model: instance of Siamese class
Describes model
criterion: torch.nn type
Criterion specifies loss function
optimizer: torch.optim type
Implements Adam algorithm.
device: string
GPU/CUDA if available, CPU otherwise
trainloader: torch.loader
Training dataset
print_freq: int
Frequency of printing training progress
Returns:
Nothing
"""
model.to(device)
running_loss = 0.0
correct = 0.0
total = 0.0
for batch_idx, data in enumerate(train_loader):
x1 = data[0].to(device)
x2 = data[1].to(device)
y = data[2].to(device)
optimizer.zero_grad() #Set parameter gradients to zero
y_pred = model(x1, x2)
# print(y_pred.shape, y.shape)
loss = criterion(y_pred.view(-1), y.view(-1))
loss.backward() #Set the values for gradients (`grad`) of all the parameters
optimizer.step() #Update the parameter values using the computed gradients
running_loss += loss.item()
y_out = torch.round(torch.sigmoid(y_pred)) #To do: do necessary modifs from sigmoid to regular
correct += y_out.eq(y.to(device)).sum()
total += y_out.shape[0]
# if (batch_idx % print_freq) == (print_freq - 1):
# print(f"Batch: {batch_idx} Running loss: {running_loss / print_freq} Train accuracy: {acc}")
# running_loss = 0.0
print(f"Training => Average Loss: {running_loss/len(train_loader)} | Accuracy: {correct/total}")
# @markdown `evaluate_siamese_network(model, criterion, val_loader, device='cpu')`
def evaluate_siamese_network(model, criterion,
val_loader, device='cpu'):
"""
Helper function to train siamese network
Args:
model: instance of Siamese class
Describes model
criterion: torch.nn type
Criterion specifies loss function
device: string
GPU/CUDA if available, CPU otherwise
valloader: torch.loader
Validation dataset
Returns:
Nothing
"""
model.eval()
model.to(device)
correct = 0.0
total = 0.0
val_loss = 0.0
with torch.no_grad():
for (x1, x2, y) in val_loader:
y_pred = model(x1.to(device), x2.to(device))
loss = criterion(y_pred.view(-1), y.to(device).view(-1))
val_loss += loss.item()
y_out = torch.round(torch.sigmoid(y_pred)) #To do: do necessary modifs from sigmoid to regular
correct += y_out.eq(y.to(device)).sum()
total += y_out.shape[0]
val_acc = correct/total
val_loss = val_loss/len(val_loader)
print(f"Validation => Average Loss: {val_loss} | Accuracy: {val_acc}")
def evaluate_one_shot(model, test_loader, device='cpu'):
"""
Evaluation of One-Shot model
Args:
model: instance of One-Shot Siamese class
Describes model
device: string
GPU/CUDA if available, CPU otherwise
test_loader: torch.loader
Test dataset
Returns:
Nothing
"""
model.eval()
model.to(device)
correct = 0.0
total = 0.0
with torch.no_grad():
for (query_img, support_imgs, support_labels, similarity) in test_loader:
x_query = query_img.to(device)
y_pred_fc = [model(x_query, x_support.to(device)) for x_support in support_imgs]
y_pred = [torch.sigmoid(pred) for pred in y_pred_fc]
_, y_out = torch.max(torch.cat(y_pred, dim=1), 1)
correct += y_out.eq(similarity.to(device)).sum()
total += y_out.shape[0]
print(f'Testing (One-shot) => Accuracy: {correct/total}')
```

## Set random seed¶

Executing `set_seed(seed=seed)`

sets the seed

```
# @title Set random seed
# @markdown Executing `set_seed(seed=seed)` sets the seed
# For DL its critical to set the random seed so that students can have a
# baseline to compare their results to expected results.
# Read more here: https://pytorch.org/docs/stable/notes/randomness.html
# Call `set_seed` function in the exercises to ensure reproducibility.
import random
import torch
def set_seed(seed=None, seed_torch=True):
"""
Function that controls randomness. NumPy and random modules must be imported.
Args:
seed : Integer
A non-negative integer that defines the random state. Default is `None`.
seed_torch : Boolean
If `True` sets the random seed for pytorch tensors, so pytorch module
must be imported. Default is `True`.
Returns:
Nothing.
"""
if seed is None:
seed = np.random.choice(2 ** 32)
random.seed(seed)
np.random.seed(seed)
if seed_torch:
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
print(f'Random seed {seed} has been set.')
# In case that `DataLoader` is used
def seed_worker(worker_id):
"""
DataLoader will reseed workers following randomness in
multi-process data loading algorithm.
Args:
worker_id: integer
ID of subprocess to seed. 0 means that
the data will be loaded in the main process
Refer: https://pytorch.org/docs/stable/data.html#data-loading-randomness for more details
Returns:
Nothing
"""
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)
```

## Set device (GPU or CPU). Execute `set_device()`

¶

```
# @title Set device (GPU or CPU). Execute `set_device()`
# especially if torch modules used.
# Inform the user if the notebook uses GPU or CPU.
def set_device():
"""
Set the device. CUDA if available, CPU otherwise
Args:
None
Returns:
Nothing
"""
device = "cuda" if torch.cuda.is_available() else "cpu"
if device != "cuda":
print("WARNING: For this notebook to perform best, "
"if possible, in the menu under `Runtime` -> "
"`Change runtime type.` select `GPU` ")
else:
print("GPU is enabled in this notebook.")
return device
```

```
SEED = 2021
set_seed(seed=SEED)
DEVICE = set_device()
```

```
Random seed 2021 has been set.
WARNING: For this notebook to perform best, if possible, in the menu under `Runtime` -> `Change runtime type.` select `GPU`
```

# Section 1: Introduction to Out-of-Distribution (OOD) Learning¶

*Time estimate: ~5mins*

In this section, we’ll take a brief look at what OOD Learning is and how it can be useful in solving various problems with traditional Deep Learning.

## Video 1: The Future of Learning¶

## Think! 1.1: What are the problems that *you* want to solve?¶

Think of a problem that you would want to have solved using Deep Learning. Don’t restrict yourself to the problems you have looked at so far!

Take 2 mins to think in silence and jot down a brief description of the problem here.

```
# @markdown
Problem_that_you_want_to_solve = '' # @param {type:"string"}
```

Now discuss the following as a group:

Can these problems be solved using the techniques that you have looked at so far? If not, why?

Is there an OOD element attached to any of the problems, which makes them harder to solve?

Can you think of ways you could solve these problems? (Don’t worry if you struggle, we’ll hopefully find some suggestions by the end of this tutorial)

# Section 2: Transfer and Multi-Task Learning¶

*Time estimate: ~35mins*

In this tutorial we will learn transfer learning and multi-task learning. Transfer learning is a machine learning method where a model which is trained on a task (or dataset) can be used as a good initialization for training on a completely different and unrelated task (or dataset). Transfer learning is a very commonly used practice in a lot of Deep Learning works. This method is not limited to any particular subdomain and is widely used in almost all the domains, namely Computer Vision, Natural Language Processing, Reinforcement Learning, etc.

Multi-task learning aims to learn multiple tasks simultaneously using shared knowledge across these tasks. We can use Transfer learning to initialize the shared parts of the network.

We aim to learn these concepts via a simple problem of attributes classification for the celebA dataset.

## Video 2: Transfer and Multi-Tasking Learning¶

## Section 2.1: Getting the data¶

We will be using the MTFL dataset for demonstrating the concepts of Transfer and Multi-Task Learning.

### Get the current working directory in `CWD`

variable.¶

```
# @title Get the current working directory in `CWD` variable.
CWD = os.getcwd()
print(f'Current dir: {CWD}')
```

```
Current dir: /home/runner/work/course-content-dl/course-content-dl/tutorials/W3D4_ContinualLearning/student
```

### Download and unzip the dataset¶

```
# @title Download and unzip the dataset
import requests, zipfile
# Originally from 'http://mmlab.ie.cuhk.edu.hk/projects/TCDCN/data/MTFL.zip'
os.chdir(CWD)
name = 'MTFL'
fname = f"{name}.zip"
url = "https://osf.io/u5emj/download"
if not os.path.exists(name):
print("Start downloading and unzipping `MTFL` dataset...")
r = requests.get(url, allow_redirects=True)
with open(fname, 'wb') as fd:
fd.write(r.content)
with zipfile.ZipFile(fname, 'r') as zip_ref:
zip_ref.extractall(name)
print("Download completed.")
else:
print('Data has been already downloaded.')
print('Change working direcrtory!')
os.chdir(name)
print(f'Current dir: {os.getcwd()}')
```

```
Start downloading and unzipping `MTFL` dataset...
```

```
Download completed.
Change working direcrtory!
Current dir: /home/runner/work/course-content-dl/course-content-dl/tutorials/W3D4_ContinualLearning/student/MTFL
```

Lets load the dataset into dataloaders which will help us in training the model!

You can check the implementation of the `get_train_val_datasets_mtfl()`

in the hidden cell `Dataset Definition and Helper Functions`

```
"""
Create dataloaders for the train and validation datasets
"""
train_dataset, val_dataset = get_train_val_datasets_mtfl(CWD)
# Change this for a different batch size
batch_size = 16
g_seed = torch.Generator()
g_seed.manual_seed(SEED)
train_loader = DataLoader(
dataset=train_dataset,
batch_size=batch_size,
num_workers=2,
worker_init_fn=seed_worker,
generator=g_seed
)
val_loader = DataLoader(
dataset=val_dataset,
batch_size=batch_size,
num_workers=2,
worker_init_fn=seed_worker,
generator=g_seed
)
```

We visualize the samples from the dataset, we can see that the input will be an image and there are multiple labels for the same image. The description of each label -

gender : 0 - male, 1 - female

smiling : 0 - yes, 1 - no

wearing glasses : 0 - yes, 1 - no

head pose : 0° - 0, +30° - 1, -30° - 2, +60° - 3, -60° - 4

```
sample = train_dataset[0]
visualize_mtfl_sample(sample)
```

```
Labels: [0, 0, 0, 2]
```

Can you interpret the above image and the labels corresponding to the image? Let’s have a look at some more instances from the dataset.

## Section 2.2: Defining the model¶

A typical multi-task model looks like below, with shared layers being layers where knowledge is shared. Task-specific layers, as the name suggests, are specific to the tasks.

For the shared layers we will use a pre-trained backbone network. Here you can try out various models, we have provided an example of using resnet18 backbone. We have to remove the last fully connected layer, because that is specifically used for classification task. Futher, for each task we have some fully connected layers, you can play around with more layers on your own!

### Coding Exercise 2.2: Creating a Multi-Task model¶

Complete the custom model class `Multi_task_model`

. Complete the `forward()`

function, by adding your solution to collect outputs for all the tasks using the fully connected for the corresponding task.

You can find an entire list of pre-trained models provided by PyTorch here.

```
class Multi_task_model(nn.Module):
"""
Defines Multi-task model
"""
def __init__(self, pretrained=True, num_tasks=4, load_file=None,
num_labels_per_task=[2, 2, 2, 5]):
"""
Initialize parameters of the multi-task model
Args:
pretrained: boolean
If true, load pretrained model
num_tasks: int
Number of tasks
load_file: string
If specified, load requisite file [default: None]
num_labels_per_task: list
Specifies number of labels per task
Returns:
Nothing
"""
super(Multi_task_model, self).__init__()
self.backbone = models.resnet18(pretrained=pretrained) # You can play around with different pre-trained models
if load_file:
self.backbone.load_state_dict(torch.load(load_file))
self.backbone = torch.nn.Sequential(*(list(self.backbone.children())[:-1])) # Remove the last fully connected layer
if pretrained:
for param in self.backbone.parameters():
param.requires_grad = False
self.fcs = []
self.num_tasks = num_tasks
for i in range(self.num_tasks):
self.fcs.append(nn.Sequential(
nn.Linear(512, 128),
nn.ReLU(),
nn.Dropout(0.4),
nn.Linear(128, num_labels_per_task[i]),
################################
# Add more layers if you want! #
################################
nn.Softmax(dim=1),
))
self.fcs = nn.ModuleList(self.fcs)
def forward(self, x):
"""
Forward pass of multi-task model
Args:
x: torch.tensor
Input Data
Returns:
outs: list
Fully connected layer outputs for each task
"""
x = self.backbone(x)
x = torch.flatten(x, 1)
outs = []
#################################################
# Collect outputs for each task
# Fill in missing code below (...),
# then remove or comment the line below to test your implementation
raise NotImplementedError("Complete the model!")
#################################################
for i in range(...):
outs.append(...)
return outs
# Add event to airtable
atform.add_event('Coding Exercise 2.2: Creating a Multi-Task model')
```

To avoid any major blackout due to multiple downloads, we download a pretrained ResNet model locally.

**Note:** If `pretrained=False`

and `load_file`

exists, we load the pretrained model from a file. If you want to use a different model, set `load_file`

to `None`

.

#### Download `resnet18`

pretrained¶

```
# @title Download `resnet18` pretrained
url = "https://osf.io/2kd98/download"
fname = "resnet18-f37072fd.pth"
r = requests.get(url, allow_redirects=True)
with open(fname, 'wb') as fd:
fd.write(r.content)
```

```
"""
Initialize two models with and without pre-trained weights
"""
model_with_pre_trained_backbone = Multi_task_model(pretrained=False,
load_file=fname).to(DEVICE)
model_without_pre_trained_backbone = Multi_task_model(pretrained=False).to(DEVICE)
```

Let’s define the loss function and optimizer. Since this is a classification task, we will be using cross-entropy loss. For the optimizer we will be using Stochastic Gradient descent.

```
criterion = nn.CrossEntropyLoss()
# Observe that all parameters are being optimized
optimizer_without_pre_trained_backbone = optim.SGD(
model_without_pre_trained_backbone.parameters(),
lr=0.001,
momentum=0.9
)
optimizer_pre_trained_backbone = optim.SGD(
model_with_pre_trained_backbone.parameters(),
lr=0.001,
momentum=0.9
)
```

## Section 2.3: Training model without pretrained backbone¶

```
model_without_pre_trained_backbone = train_multi_task(
model_without_pre_trained_backbone,
train_loader,
val_loader,
criterion,
optimizer_without_pre_trained_backbone,
epochs=5, device=DEVICE
)
```

## Section 2.4: Training model with pretrained backbone¶

```
model_pre_trained_backbone = train_multi_task(model_with_pre_trained_backbone,
train_loader,
val_loader,
criterion,
optimizer_pre_trained_backbone,
epochs=5, device=DEVICE)
```

## Section 2.5: Section Summary¶

We can summarize this section as follows

Transfer learning helps in initializing the model better and thus makes learning of the model faster. It also gives better performance compared to the same model that was trained from scratch. Below is a table which contains validation accuracies after 5 epochs. Note that your accuracies can be slightly different when you run the training loops. It will also depend on the number of epochs and the model you have implemented.

Multi-task learning helps us learn multiple tasks together, by sharing knowledge across layers. Task specific layers then help the model learn task specific features.

# Section 3: Interpretability¶

## Video 3: Interpretability¶

Let’s find the most important feature for Cancer prediction and see if it belongs to the set of two most important features for the same problem!

## Section 3.1: Defining the problem setup¶

Imagine that you want to apply all the knowledge that you have gained from this course to a real health problem, say predicting cancer. You develop an algorithm for predicting cancer using certain input features from patient data.

This is a composite model that takes in three binary variables: does the patient have a family history of cancer (1=yes, 0=no), does the patient smoke (1=yes, 0=no), and is the patient young (age < 40) (1=yes, 0=no).

The algorithm does not favour either of the outcomes, so \(P(cancer) = P(no\_cancer) = .5\)

You wish to find the most important feature according to the model and you have the following list of conditional probabilities.

## Section 3.2: Defining a feature importance metric¶

Let’s define a metric that can help us quantify how important a feature (or a given set of features) is!

### Think! 3.2.1: Think of a suitable feature importance metric!¶

Take a moment to think about feature importance and a simple metric for quantifying it! Remember that the feature importance should depend on how useful the feature is for predicting the outcome *alone*.

#### Student Response¶

```
# @title Student Response
from ipywidgets import widgets
text=widgets.Textarea(
value='Type your answer here and click on `Submit!`',
placeholder='Type something',
description='',
disabled=False
)
button = widgets.Button(description="Submit!")
display(text,button)
def on_button_clicked(b):
atform.add_answer('q1' , text.value)
print("Submission successful!")
button.on_click(on_button_clicked)
```

Now discuss the following with the group:

Each member of the group should briefly talk about the feature metric they came up with and why they think it is suitable for the given problem setup.

Identify the most important characterics of a good feature importance metric. (As different members of the group used different criteria to select their metrics, find out the most important criteria that helped!). Try to limit yourself to the two most important characteristics.

Note down the most important characteristic here!

#### Student Response¶

```
# @title Student Response
from ipywidgets import widgets
text=widgets.Textarea(
value='Type the most important characteristic and click on `Submit!`',
placeholder='Type something',
description='',
disabled=False
)
button = widgets.Button(description="Submit!")
display(text,button)
def on_button_clicked(b):
atform.add_answer('q2' , text.value)
print("Submission successful!")
button.on_click(on_button_clicked)
```

Also note down the second most important characteristic here.

#### Student Response¶

```
# @title Student Response
from ipywidgets import widgets
text=widgets.Textarea(
value='Type the second most important characteristic and click on `Submit!`',
placeholder='Type something',
description='',
disabled=False
)
button = widgets.Button(description="Submit!")
display(text,button)
def on_button_clicked(b):
atform.add_answer('q3' , text.value)
print("Submission successful!")
button.on_click(on_button_clicked)
```

Thank you for coming up with some guidelines for what makes a feature importance metric good!

Although the metrics defined by you must be great, we here define a simple metric for maintaining consistency throughout the rest of this tutorial.

Let’s assume that we wish to find the K best features. We know that a set of K features would be most important if they can predict the outcome with high confidence and better than any other set of the same size! We could hence try to predict the outcome using only the given set of features and record the accuracy. If this accuracy is higher than that using any other set of the same size, then we can say that the given set has the most important K features!

In essence, we would want to find the accuracies considering all feature sets of size K and then pick the one with the highest accuracy.

If we wish to define a loss instead (the lower the better), we could simply extend the above notion as follows:

### Think! 3.2.2: Is our feature importance metric good?¶

Now that we have provided a new feature importance metric, think of the following:

What do you think of this new feature importance metric, and how is it different from the one that you came up with?

Does it satify the criteria that you came up with?

If not, why? Does the metric need to be modified or the criteria?

#### Student Response¶

```
# @title Student Response
from ipywidgets import widgets
text=widgets.Textarea(
value='Type your answer here and click on `Submit!`',
placeholder='Type something',
description='',
disabled=False
)
button = widgets.Button(description="Submit!")
display(text,button)
def on_button_clicked(b):
atform.add_answer('q4' , text.value)
print("Submission successful!")
button.on_click(on_button_clicked)
```

## Section 3.3: Finding the most important feature¶

Considering the feature importance metric that we defined above, let’s try to find the most important and the two most important features!

The accuracy can be calculated using the probabilities specified above. Here are the accuracies for different feature sets!

Note that we have not detailed the calculation of these accuracies. If you are curious, refer to the bonus section!

**Note:** For a detailed explanation, see *Bonus 3*.

### Think! 3.3.1: Finding important features¶

Given the table above, what is the single most important feature?

#### Student Response¶

```
# @title Student Response
from ipywidgets import widgets
text=widgets.Textarea(
value='Type your answer here and click on `Submit!`',
placeholder='Type something',
description='',
disabled=False
)
button = widgets.Button(description="Submit!")
display(text,button)
def on_button_clicked(b):
atform.add_answer('q5' , text.value)
print("Submission successful!")
button.on_click(on_button_clicked)
```

Can you also find out the second most important features?

#### Student Response¶

```
# @title Student Response
from ipywidgets import widgets
text=widgets.Textarea(
value='Type your answer here and click on `Submit!`',
placeholder='Type something',
description='',
disabled=False
)
button = widgets.Button(description="Submit!")
display(text,button)
def on_button_clicked(b):
atform.add_answer('q6' , text.value)
print("Submission successful!")
button.on_click(on_button_clicked)
```

The most important feature is not one of the two most important features! Can you find out why this is the case?

## Section 3.4: Section Summary¶

In this section, we discussed the following:

What is Interpretability and why is it important?

How can input features be used to understand the model’s predictions (aka feature importances)?

# Summary¶

Recall the problem that you set out to solve at the beginning of this tutorial.

*Run me to print the problem that you wanted to solve!*

```
# @markdown *Run me to print the problem that you wanted to solve!*
print(Problem_that_you_want_to_solve)
```

```
```

Take a moment to think about the following:

Did this tutorial help you find some ideas to solve this problem?

Can you now come up with a solution to this problem by using all the tools that you learnt from this course?

We sincerely hope that your answers to these two questions are yes and YES! Recall all that you learnt from this course and how this knowledge and practical experience can help with a problem that *you* want to solve. Go back and experiment with your solutions and see if you can solve this problem that matters to you!

**Congratulations! You have finished the NMA-DL course!**

## Airtable Submission Link¶

```
# @title Airtable Submission Link
from IPython import display as IPydisplay
IPydisplay.HTML(
f"""
<div>
<a href= "{atform.url()}" target="_blank">
<img src="https://github.com/NeuromatchAcademy/course-content-dl/blob/main/tutorials/static/SurveyButton.png?raw=1"
alt="button link end of day Survey" style="width:410px"></a>
</div>""" )
```

# Bonus 1: Introduction to Meta Learning¶

*Time estimate: ~20mins*

In this section, we will *literally* learn to learn by exploring the concept of Meta Learning. Meta Learning attempts to improve the generalization capabilities of neural networks by teaching them *how to learn new tasks fast*. We aim to introduce you to the following topics:

Meta-Learning and its applications

Few-shot classification (the most common application of Meta-Learning in supervised classification settings)

One-shot Learning with Convolutional Siamese Networks

## Video 4: Meta Learning¶

Meta-Learning is most commonly observed as Few-shot Learning in Supervised learning settings.

Few-shot Learning aims to answer the following question: How can a neural network learn a task well with *very little* data?

In the context of supervised classification, K-shot learning refers to learning with only K examples of each class. An extension of this would be N-way K-shot learning, which attempts to train a network with only K examples of each of the N classes. Let’s take an example of a 5-way 1-shot problem:

Now that we know what Few-shot learning is, let’s take a look at a One-shot classification problem using Siamese Networks!

## Bonus Section 1.1: Introduction to Omniglot¶

The Omniglot data set is a a standardized benchmark for evaluating the performance of Few-shot Learning algorithms. It contains 1623 different handwritten characters from 50 different alphabets. The alphabets range from well-established international languages like Latin and Korean to lesser known local dialects. Fictitious character sets such as Aurek-Besh and Klingon are also included. Here are a few examples from the Omniglot dataset.

## Bonus Section 1.2: Convolutional Siamese Networks¶

The simplest way to evaluate where an image belongs to a class is by comparing it with other images of the same class. Convolutional Siamese networks help us in comparing two images and quantify how similar or different they are from one another. This is known as verification.

Verification of two images is done as follows: Both the images are passed through the convolutional network to generate *feature vectors* representing them. These feature vectors are then compared for similarity (find a better image for siamese network maybe).

How can we compare if two vectors are similar? By using a distance metric! **L1 distance** is one such metric that evaluates the distance between two vectors by computing the absolute value of the difference of individual components of both vectors. More specifically, given vectors `v1`

and `v2`

, the component-wise L1 distance would be as follows:

```
l1_distance = abs(v1 - v2)
```

After this, the problem boils down to a binary classification problem where two images either belong to the same class or not!

### Bonus Coding Exercise 1.2: Creating a Convolutional Siamese Network¶

Let’s create a Convolutional Siamese Network! The structure of the network is created in the `ConvSiameseNet`

class. You have to compute the L1 distance (`l1_distance`

) between the feature vectors (`x1_fv`

and `x2_fv`

). You can use `torch.abs()`

to compute the absolute value of a tensor.

```
# Define the Siamese Network
class ConvSiameseNet(nn.Module):
"""
Convolutional Siamese Network from "Siamese Neural Networks for One-shot Image Recognition"
Paper can be found at http://www.cs.toronto.edu/~rsalakhu/papers/oneshot1.pdf
Structure of the network is as follows:
nn.Conv2d(1, 64, 10) + pool(F.relu(self.conv1(x))) # First Convolutional + Pooling Block
nn.Conv2d(64, 128, 7) + pool(F.relu(self.conv2(x))) # Second Convolutional + Pooling Block
nn.Conv2d(128, 128, 4) + pool(F.relu(self.conv3(x))) # Third Convolutional + Pooling Block
nn.Conv2d(128, 256, 4) + F.relu(self.conv4(x)) # Fourth Convolutional Layer
nn.MaxPool2d(2, 2) # Pooling Block
nn.Linear(256*6*6, 4096) # First Fully Connected Layer
nn.Linear(4096, 1) # Second Fully Connected Layer
"""
def __init__(self):
"""
Initialize convolutional Siamese network parameters
Args:
None
Returns:
Nothing
"""
super().__init__()
self.conv1 = nn.Conv2d(1, 64, 10)
self.conv2 = nn.Conv2d(64, 128, 7)
self.conv3 = nn.Conv2d(128, 128, 4)
self.conv4 = nn.Conv2d(128, 256, 4)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(256*6*6, 4096)
self.fc2 = nn.Linear(4096, 1)
def model(self, x):
"""
Defines model structure and flow
Args:
x: Dataloader instance
Input Dataset
Returns:
x: torch.tensor
Output of first fully connected layer
"""
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
x = F.relu(self.conv4(x))
x = torch.flatten(x, 1)
x = torch.sigmoid(self.fc1(x))
return x
def forward(self, x1, x2):
"""
Calculates L1 distance between model pass of sample 1 and sample 2
Args:
x1: torch.tensor
Sample 1
x2: torch.tensor
Sample 2
Returns:
Output from final fully connected layer on recieving L1 distance as input
"""
x1_fv = self.model(x1)
x2_fv = self.model(x2)
############################################################################
## TODO: Calculate the component-wise l1_distance between x1_fv and x2_fv ##
# Fill out function and remove
raise NotImplementedError("Student exercise: Calculate l1_distance")
############################################################################
# Calculate L1 distance (as l1_distance) between x1_fv and x2_fv
l1_distance = ...
return self.fc2(l1_distance)
```

## Bonus Section 1.3: Verification using Siamese Networks¶

Let’s train the siamese network we created on verification of images. We’ll take pairs of images from the Omniglot dataset and train the network to identify if the pairs belong to the same class or not.

Let’s first get the Omniglot dataset with pairs of images. A pair is labelled “1” if the images are of the same character, and “0” otherwise.

### Download Omniglot dataset with pairs¶

```
# @title Download Omniglot dataset with pairs
import zipfile, os, requests
# original location: https://github.com/brendenlake/omniglot/tree/master/python
os.chdir(CWD)
print(f'Change dir: {os.getcwd()}')
dirname = 'data/omniglot-py/'
if not os.path.exists(dirname):
os.makedirs(dirname)
fname = 'images_background.zip'
url = "https://osf.io/6hq9u/download"
if not os.path.exists(dirname + 'images_background'):
print('Downlading the dataset...')
r = requests.get(url, allow_redirects=True)
with open(dirname + fname, 'wb') as fd:
fd.write(r.content)
with zipfile.ZipFile(dirname + fname, 'r') as zip_ref:
zip_ref.extractall(dirname)
print('Dataset is downloaded.')
else:
print('Dataset has already been downloaded.')
```

```
Change dir: /home/runner/work/course-content-dl/course-content-dl/tutorials/W3D4_ContinualLearning/student
Downlading the dataset...
```

```
Dataset is downloaded.
```

```
"""
Load train and validation datasets for training the Siamese Network
"""
train_dataset, val_dataset = get_train_val_datasets(background_dataset_size=10000, val_split=0.2, download=False)
```

```
"""
Visualize a sample from the training dataset
"""
# Change this to visualize another sample from the dataset
sample_idx = 1
sample = train_dataset[sample_idx]
visualize_siamese_sample(sample)
```

```
-------------------- Label: 1.0 (Same character) -------------------
```

```
"""
Create dataloaders for the train and validation datasets
"""
# Change this for a different batch size
batch_size = 16
train_loader = DataLoader(
dataset=train_dataset,
batch_size=batch_size,
num_workers=2,
worker_init_fn=seed_worker,
generator=g_seed
)
val_loader = DataLoader(
dataset=val_dataset,
batch_size=batch_size,
num_workers=2,
worker_init_fn=seed_worker,
generator=g_seed
)
```

Now that we have the necessary dataloaders, let’s train a siamese network!

```
"""
Train the network on Omniglot
"""
siamese_net = ConvSiameseNet()
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(siamese_net.parameters(), lr=0.0001)
# Change this to train for any number of epochs
num_epochs = 10
for epoch in range(num_epochs):
print(f"\n Epoch {epoch + 1} / {num_epochs} ================================")
train_siamese_network(siamese_net, criterion, optimizer, train_loader, DEVICE)
evaluate_siamese_network(siamese_net, criterion, val_loader, DEVICE)
```

## Bonus Section 1.4: One-shot Classification with Siamese Networks¶

Now that we have a siamese network trained on verification, let’s try to extend it to one-shot classification. How can we do this?

Let’s first try to understand what the current network offers: Given two images, find out if they belong to the same class or not.

What does N-way one-shot classification ask for? Given one image from each of the N classes (the support set), find out which class does the query image belong to.

Take a moment to think about how the verification network can be extended to work in the one-shot setting.

Given a query image, compare it with different images in the support set using our siamese network. Pick the class that is most similar to the query image!

Let’s get the test dataset where one sample corresponds to a query image and N support images!

### Download the test dataset¶

```
# @title Download the test dataset
import zipfile, os, requests
# original location: https://github.com/brendenlake/omniglot/tree/master/python
dirname = 'data/omniglot-py/'
if not os.path.exists(dirname):
os.makedirs(dirname)
fname = 'images_evaluation.zip'
url = "https://osf.io/uq4gw/download"
if not os.path.exists(dirname + 'images_evaluation'):
print('Downlading the dataset...')
r = requests.get(url, allow_redirects=True)
with open(dirname + fname, 'wb') as fd:
fd.write(r.content)
with zipfile.ZipFile(dirname + fname, 'r') as zip_ref:
zip_ref.extractall(dirname)
print('Dataset is downloaded.')
else:
print('Dataset has already been downloaded.')
```

```
Downlading the dataset...
```

```
Dataset is downloaded.
```

```
"""
Load the test dataset for N-way One-shot classification
"""
# Change this to change the number of classes in the support set
n_ways = 5
test_dataset = get_test_dataset(n_ways=n_ways, download=True)
```

```
Files already downloaded and verified
```

```
"""
Visualize a sample from the test dataset
"""
# Change this to visualize another sample from the dataset
sample_idx = 1
sample = test_dataset[sample_idx]
visualize_one_shot_sample(sample)
```

We are in the endgame now. Let’s look at how our siamese network performs on N-way One-shot classification!

```
"""
Evaluate our Siamese Network on N-way One-shot classification
"""
batch_size = 16
test_loader = DataLoader(
dataset=test_dataset,
batch_size=batch_size,
num_workers=2,
worker_init_fn=seed_worker,
generator=g_seed
)
evaluate_one_shot(siamese_net, test_loader, DEVICE)
```

Congratulations! We successfully solved our first Meta-Learning problem. Feel free to run through this section again with different parameters and compare the differences!

## Bonus Section 1.5: Section Summary¶

In this section, we learnt the following (try to answer these questions to ensure that you understood the content fairly well):

What is Meta-Learning and how is it used to solve various problems with Deep Learning?

What is Few-shot Learning? We took a look at what is N-way K-shot classification using the Omniglot dataset and Convolutional Siamese Networks.

# Bonus 2: Continual and Life-Long Learning¶

## Video 5: Continual and Life-Long Learning¶

## Bonus Think! 2: Why are Continual and Life-Long Learning tough?¶

Take a moment to think about the following:

You have had an extensive tutorial on Continual Learning. Can you quickly recall the different problems with Continual Learning and how these problems were solved?

What do you think makes Life-Long Learning tough to implement? How is this learning paradigm different from those that you have seen before?

# Bonus 3 - Feature Importance Tutorial¶

If you wish to understand how the accuracies are calculated in Section 3.3, continue reading!

Recall the problem setup:

There exist binary-valued random variables \(X^{(1)}, X^{(2)}, X^{(3)}, Y \in {0,1}\) such that \(X^{(1)}, X^{(2)}\), and \(X^{(3)}\) are conditionally independent (given \(Y\)).

Let \(P(Y=1) = 1/2\). Then the joint distribution of \(X^{(1)}, X^{(2)}, X^{(3)}, Y\) is specified by the conditional probabilities \(P(X_i=1 | Y=0)\) and \(P(X_i=1 | Y=1)\), for \(i=1,2,3\):

## Define a feature importance metric¶

Here we’ll use 0-1 loss to define feature importance. We can imagine using \(X^{(1)}, X^{(2)}\), and \(X^{(3)}\) to predict \(Y\). \(Y\) can only be 0 or 1, so your predictions must be 0 or 1 and will either be right or wrong (i.e, perfect for 0-1 loss).

## Calculate the individual feature importance (0-1 loss) for \(X^{(1)}\), \(X^{(2)}\), and \(X^{(3)}\)¶

**Hint:** You can accomplish this by finding the probability that X and Y match. You’re given \(P(X^{(1)}=1|Y=0)\), but you need \(P(X^{(1)}=0|Y=0)\) . Recall that \(P(Y)=1/2\) and compute an equally weighted average of the two accuracies you can compute given the two possible values of \(X^{(1)}\) .

**Solution:** If \(P(X^{(1)} = 1|Y = 0)=0.10\), then \(P(X^{(1)} = 0|Y = 0)=0.90\).

There’s a symmetry we can take advantage of here: we know that 90% of the time \(Y=1\) we have \(X=1\), which implies that 90% of the time \(X=1\) we have \(Y=1\), which is a more natural way of thinking about prediction.

This means we predict correctly 90% of the time when \(Y=0\), and we predict correctly 90% of the time when \(Y=1\). Since \(P(Y)=1/2\), we have 0-1 loss \(L = 1-(0.5 \times 0.9 + 0.5 \times 0.9)=0.1\).

In total, we have:

**Check:** If you did not observe that feature \(X^{(1)}\) is the most important feature by virtue of having the smallest 0-1 loss (0.1), check your calculations before proceeding.

### Calculate the importance of feature pairs (0-1 loss) for \(\{X^{(1)}, X^{(2)}\}\), \(\{X^{(1)}, X^{(3)}\}\), and \(\{X^{(2)}, X^{(3)}\}\)¶

### Let’s start with \(\{X^{(1)}, X^{(2)}\}\)¶

Working out these calculations is a bit tedious, so we’ll provide various tables and descriptions of their origins.

First, we compute conditional PMFs \(P\left( X^{(1)}, X^{(2)}∣Y \right)\) by multiplying the provided conditionally independent probabilities together.

**Conditional PMFs:**

We can then obtain a marginal PMF for \(\{X^{(1)}, X^{(2)}\}\) by element-wise averaging these two 2x2 tables because \(P(Y)=1/2\).

**Marginal PMF:**

We now use the conditional PMFs to evaluate the conditional probability expressions we care about: \(P\left( Y∣X^{(1)}, X^{(2)} \right)\).

**Conditional Probabilities: \(P \left( Y=\{0, 1\}|X_1,X_2 \right)\)**

Using the \(\frac{P(X_1,X_2 | Y=0)}{P(X_1,X_2|Y=0) + P(X_1,X_2|Y=1)}\), we obtain the following matrix:

Using the \(\frac{P(X_1, X_2 | Y=1)}{P(X_1,X_2|Y=0) + P(X_1,X_2|Y=1)}\), we obtain the following matrix:

We now implement a 0-1 classifier that predicts \(\hat{Y}=0\) or \(\hat{Y}=1\) based on the element-wise maximum in these two tables. For example, when \(X^{(1)}=0\) and \(X^{(2)}=0\), there’s a 0.977 probability that \(Y=0\) and a 0.023 probability that \(Y=1\). We would of course predict \(\hat{Y}=1\) , and we expect to be correct 97.7% of the time.

We now know the expected probability of success for our classifier for each \(\{X^{(1)}, X^{(2)}\}\) combination.

**Maximum Conditional Probability:**

Only one task remains for you to complete: compute the expected 0-1 loss for this classifier.

**Hint:** We know how likely we are to be correct given some \(\{X(1), X(2)\}\) combination. We also know how likely those combinations are to appear.

**Solution:**

```
marginal = np.array([[0.4375, 0.0625],
[0.1375, 0.3625]])
max_cond_prob = np.array([[0.977, 0.64],
[0.6545, 0.993]])
# We need the sum-product of these two arrays.
# (1) Element-wise multiplication to produce one array.
# (2) Sum the elements of the resulting array
accuracy = np.sum(np.multiply(marginal, max_cond_prob))
print(accuracy)
```

```
0.9173937499999999
```

```
# 0-1 loss is 1 - accuracy.
print(1 - accuracy)
```

```
0.0826062500000001
```

**Check:** You should have gotten 0-1 loss for \(\{X^{(1)}, X^{(2)}\}\) to be \(0.0826\).

### Let’s move on to \(\{X^{(2)}, X^{(3)}\}\)¶

Our claim was that the most important single feature, already found to be \(\{X^{(1)}\)}, is not in the most important pair of features, so we just need to show that the 0-1 loss for \(\{X^{(2)}, X^{(3)}\}\) is less than 0.0826.

First, we compute conditional PMFs \(P \left( X^{(2)}, X^{(3)}∣Y \right)\) by multiplying conditionally independent probabilities together.

**Conditional PMFs:**

We can then obtain a marginal PMF for \(\{X^{(2)}, X^{(3)}\}\) by element-wise averaging these two 2x2 tables because \(P(Y)=1/2\).

**Marginal PMF:**

We now use the conditional PMFs to evaluate the conditional probability expressions we care about: \(P \left( Y∣X^{(2)}, X^{(3)} \right)\).

**Conditional Probabilities: \(P \left( Y=\{0, 1\}|X_2, X_3 \right)\)**

Using the \(\frac{P(X_2,X_3 | Y=0)}{P(X_2, X_3|Y=0) + P(X_2, X_3|Y=1)}\), we obtain the following matrix:

Using the \(\frac{P(X_2,X_3 | Y=1)}{P(X_2, X_3|Y=0) + P(X_2, X_3|Y=1)}\), we obtain the following matrix:

We now implement a 0-1 classifier that predicts \(\hat{Y}=0\) or \(\hat{Y}=1\) based on the element-wise maximum in these two tables. For example, when \(X^{(2)}=0\) and \(X^{(3)}=0\), there’s a 0.942 probability that \(Y=0\) and a 0.058 probability that \(Y=1\). We would of course predict \(\hat{Y}=1\) , and we expect to be correct 94.2% of the time.

We now know the expected probability of success for our classifier for each \(\{X^{(2)}, X^{(3)}\}\) combination.

**Maximum Conditional Probability:**

Only one task remains for you to complete: compute the expected 0-1 loss for this classifier.

**Hint:** We know how likely we are to be correct given some \(\{X(2), X(3)\}\) combination. We also know how likely those combinations are to appear.

**Solution:**

```
marginal = np.array([[0.49925, 0.07575],
[0.14075, 0.28425]])
max_cond_prob = np.array([[0.941912869, 0.937293729],
[0.824156306,0.999120493]])
# We need the sum-product of these two arrays.
# (1) Element-wise multiplication to produce one array.
# (2) Sum the elements of the resulting array
accuracy = np.sum(np.multiply(marginal, max_cond_prob))
print(accuracy)
```

```
0.9412500000247499
```

```
# 0-1 loss is 1 - accuracy.
print(1 - accuracy)
```

```
0.0587499999752501
```

**Check:** You should have gotten 0-1 loss for \(\{X^{(2)}, X^{(3)}\}\) to be \(0.05875\).

## Conclusion¶

We demonstrated that our intuition could be misleading - it’s not always the case that a feature that ranks highly among individual features would necessarily warrant being included at all when larger sets of features are considered.

Something to think about: What does this tell you about stepwise feature selection approaches?