# Tutorial 2: Deep Learning Thinking 1: Cost Functions¶

Week 2, Day 2: Convnets and DL Thinking

Content creators: Konrad Kording, Lyle ungar, Ashish Sahoo

Content reviewers: Kelson Shilling-Scrivo

Content editors: Kelson Shilling-Scrivo

Production editors: Gagana B, Spiros Chavlis

# Tutorial Objectives¶

In this tutorial, you will practice thinking like a deep learning practitioner and determine how to design cost functions for different scenarios.

By the end of this tutorial, you will be better able to:

• Appreciate the importance of cost function engineering

• Translate domain knowledge into cost functions

# Setup¶

## Install dependencies¶

# @title Install dependencies

from evaltools.airtable import AirtableForm


# Section 1: Intro to Deep Learning Thinking¶

## Video 1: Intro to DL Thinking¶

This tutorial is a bit different from others - there will be no coding! Instead you will watch a series of vignettes about various scenarios where you want to use a neural network. This tutorial will focus on cost functions, a tutorial you will see later in the course will be similar but focused on designing architectures.

Each section below will start with a vignette where either Lyle or Konrad is trying to figure out how to set up a neural network for a specific problem. Try to think of questions you want to ask them as you watch, then pay attention to what questions Lyle and Konrad are asking. Were they what you would have asked? How do their questions help quickly clarify the situation?

You will work together as a group to try to come up with cost functions for each example, with hints available along the way. This may be difficult - deep learning in the real world often is! So try your best but don’t get discouraged if you don’t reach the solution - you’ll learn a lot from the process of trying to.

You have already seen cost functions (sometimes also called objective functions or loss functions) for deep neural networks - you need one to perform gradient descent and train a neural network. It turns out what cost function you choose to minimize is incredibly important - it is how you define success of your network after all, so you want to define success in a good way! And cost functions are not one size fits all - you need to carefully choose cost functions according to what you want your neural network to do - as you will seen in the following scenarios.

# Section 2: Cost function for neurons¶

## Video 3: Spiking Neuron Predictions Set-up¶

Konrad, a neuroscientist, wants to predict what neurons in someone’s motor cortex are doing while they are riding a motorcycle.

Upon discussion with Lyle, it emerges that we have data on 12 parameters of motorcycle riding, including acceleration, angle, braking, degrees of leaning. These inputs are fairly smooth over time, the angle of the motorcycle typically does not change much in 100 ms for example.

We also have recorded data on the timing of spikes of $$N$$ neurons in motor cortex. The underlying firing rate is smooth but every millisecond spikes are random and independent. This means we can assume that the number of spikes in a short interval can be modeled using a Poisson distribution with an underlying firing rate for that interval $$\lambda$$.

For neuron $$i$$, the probability of seeing $$k_{i}$$ spikes in some interval given an underlying firing rate $$\lambda_{i}$$ is:

(68)$$$\mathcal{f(k_{i}:λ_{i})} = \mathcal{Pr(X=k_{i})} = \frac {\lambda_{i}^{k_{i}}e^{-\lambda_{i}}}{k_{i}!}$$$

So this poisson distribution may be relevant if we want to, in a way, have a good model for the spiking of neurons.

## Think! 1: Designing a cost function to predict neural activities¶

Given everything you know, how would you design a cost function for a neural network that Konrad is training to predict neural activity given the motorcycle riding parameters? Remember that we are predicting the activity of all $$N$$ neurons, not just one. Try to write out an equation!

Please discuss as a group. If you get stuck, you can uncover the hints below one at a time. Please spend some time discussing before uncovering the next hint though! You are being real deep learning scientists now and the answers won’t be easy

### Student Response¶

# @title Student Response
from ipywidgets import widgets

text=widgets.Textarea(
value='Type your answer here and click on Submit!',
placeholder='Type something',
description='',
disabled=False
)

button = widgets.Button(description="Submit!")

display(text,button)

def on_button_clicked(b):
print("Submission successful!")

button.on_click(on_button_clicked)


You get time-stamps for the spikes. You will want to do binning into 50 ms bins. You get $$k_{i, t}$$ for every neuron $$i$$ and time bin $$t$$, the spike count for that neuron in that time bin. What will the neural network predict?

For each bin you can use your neural network model to predict an estimate of $$\lambda_{i,t}$$, the number of spikes for neuron $$i$$ expected at that time bin $$t$$. The network should get as input the relevant aspects of the motorcycle riding at the relevant times (and potentially of the previous times).

You need an equation relating $$\lambda_{i,t}$$ (the model prediction) with $$k_{i, t}$$ (your data) where changing $$\lambda_{i,t}$$ to minimize or maximize the number resulting from this equation results in better predictions. What do we already know about the relationship between $$\lambda_{i,t}$$ and $$k_{i, t}$$ that helps us here?

Once you have that, how do you extend to incorporate all neurons and time bins?

We can treat the bins independently as the spikes are random and independent every millisecond.

First, we will convert our spike timing data to the number of spikes per time bin for time bins of size 50 ms. This gives us $$k_{i,t}$$ for every neuron $$i$$ and time bin $$t$$.

We are assuming a Poisson distribution for our spiking. That means that we get the probability of seeing spike count $$k_{i, t}$$ given underlying firing rate $$\lambda_{i, t}$$ using this equation:

(69)$$$\mathcal{f(k_{i,t}:\lambda_{i,t})} = \mathcal{Pr}(X=k_{i,t}) = \frac {\lambda_{i,t}^{k_{i,t}}e^{-\lambda_{i,t}}}{k_{i,t}!}$$$

That seems a pretty good thing to optimize to make our predictions as good as possible! We want a high probability of seeing the actual spike count we recorded given the neural network prediction of the underlying firing rate.

We will make this negative later so we have an equation that we want to minimize rather than maximize, so we can use all our normal tricks for minimization (instead of maximization). First though, let’s scale up to include all our neurons and time bins.

We can treat each time bin as independent because, while the underlying probability of firing changes slowly, every milisecond spiking is random and independent. From probability, we know that we can compute the probability of a set of independent events (all the spike counts) by multiplying the probabilities of each event. So the probability of seeing all of our data given the neural network predictions is all of our probabilities of $$k_{i,t}$$ multiplied together:

(70)\begin{align} \mathcal{Pr}(\text{all_data}) &= \prod_{i=1}^{N}\prod_{t=1}^\top \mathcal{Pr}(X=k_{i,t})\\ &= \prod_{i=1}^{N}\prod_{t=1}^\top \frac {\lambda_{i,t}^{k_{i,t}}e^{-\lambda_{i,t}}}{k_{i,t}!} \end{align}

This is also known as our likelihood!

We usually use the log likelihood instead of the likelihood when minimizing or maximizing for numerical computation reasons. W We can convert the above equation to log likelihood:

(71)\begin{align} \text{log likelihood} &= \sum_{i=1}^N\sum_{t=1}^\top \text{log}(\mathcal{Pr}(X=k_{i,t}) \\ &= \sum_{i=1}^N\sum_{t=1}^\top k_{i,t} \text{log}(\lambda_{i,t}) - \lambda_{i,t} - \text{log}(k_{i,t}!) \end{align}

And last but not least, we want to make it negative so we can minimize instead of maximize:

(72)$$$\text{negative log likelihood} = \sum_{i=1}^N\sum_{t=1}^\top - k_{i,t} \text{log}(\lambda_{i,t}) + \lambda_{i,t} + \text{log}(k_{i,t}!)$$$

### Video 4: Spiking Neurons Wrap-up¶

Check out the papers mentioned in the above video:

## (Bonus) Think!: Non-Poisson neurons¶

If you have time discuss the following. The spiking distributions don’t seem quite Poisson. Find a good replacement for your cost function.

# Section 3: How can an ANN know its uncertainty¶

## Video 6: ANN Uncertainty Set-up¶

Lyle wants to build an artificial neural network that has a measure of its own uncertainty about it’s predictions. He wants the neural network to give a prediction/estimate and an uncertainty, or standard deviation, measurement on it.

Let’s say Lyle wants to estimate the location of an atom in a chemical molecule based on various inputs. He wants to have the estimate of the location and an estimate of the variance. We don’t train neural networks on one data point at a time though - he wants a cost function that takes in N data points (input and atom location pairings).

We think we may be able to use a Gaussian distribution to help Lyle here:

(73)$$$g(x) = \frac{1}{\sigma\sqrt{2\pi}} \text{exp} \left( -\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2} \right)$$$

## Think! 2: Designing a cost function so we measure uncertainty¶

Given everything you know, how would you design a cost function for a neural network that Lyle is training so that he can get the estimate and the uncertainty of the estimate? Try to write out an equation!

Please discuss as a group. If you get stuck, you can uncover the hints below one at a time. Please spend some time discussing before uncovering the next hint, though! You are being real deep learning scientists now, and the answers won’t be easy.

### Student Response¶

# @title Student Response
from ipywidgets import widgets

text=widgets.Textarea(
value='Type your answer here and click on Submit!',
placeholder='Type something',
description='',
disabled=False
)

button = widgets.Button(description="Submit!")

display(text,button)

def on_button_clicked(b):
print("Submission successful!")

button.on_click(on_button_clicked)


Look at the Gaussian equation. What is the true location? Where is there the estimate of location? Where is there the uncertainty?

What do you want the neural network to predict for one data point (recorded location) given the inputs?

What did you learn from working through Section 2 that you can use here?

In section 2, you learned that you want to go from probabilities to negative log likelihoods to form cost functions.

For a given set of inputs, we want the neural network to predict the location of the atom and the uncertainty of that estimate. Standard deviation is a great measure of uncertainty so we can predict the mean and standard deviation of the location (instead of just the mean as is more common).

So how do we a design a cost function that involves the mean and standard deviation? We can assume a Gaussian distribution over the location. The neural network can predict the mean of that Gaussian (that’s the estimate of the location) and the standard deviation of that Gaussian (that’s the uncertainty measure) for a given set of inputs.

Now that we’ve got that figured out, we can take a very similar approach to what we did in Section 2 with spiking neurons. For a given data point $$i$$, the neural network predicts the mean ($$\mu_i$$) and standard deviation ($$\sigma_i$$) of the location given the inputs. We can then compute the probability of seeing the actual recorded location ($$x_i$$) given these predictions:

(74)$$$g(x) = \frac{1}{\sigma\sqrt{2\pi}} \text{exp}\left( -\frac{1}{2}\frac{(x_i-\mu_i)^2}{\sigma_i^2} \right)$$$

The location of the atom is independent in each data point so we can get the overall likelihood by multiplying the probabilities for the individual data points.

(75)$$$\text{likelihood} = \prod_{i=1}^N\frac{1}{\sigma\sqrt{2\pi}} \text{exp}\left( -\frac{1}{2}\frac{(x_i-\mu_i)^2}{\sigma_i^2} \right)$$$

And, as before, we want to take the log of this for numerical reasons and convert to negative log likelihood:

(76)$$$\text{negative log likelihood} = \sum_{i=1}^N \text{log} \left( \frac{1}{\sigma\sqrt{2\pi}} \text{exp}\left( -\frac{1}{2}\frac{(x_i-\mu_i)^2}{\sigma_i^2} \right) \right)$$$

Changing the parameters of the neural network so it predicts $$\mu_i$$ and $$\sigma_i$$ that minimize this equation will give us (hopefully fairly accurate) predictions of the location and the network uncertainty about the location!

### Video 7: ANN Uncertainty Wrap-up¶

Check out the papers mentioned in the above video:

## (Bonus) Think!: Negative standard deviations¶

If the standard deviation is negative, the negative log-likelihood will fail as you’d take the log of a negative number. What should we do to ensure we don’t run into this while training our neural network?

# Section 4: Embedding faces¶

## Video 9: Embedding Faces Set-up¶

Konrad needs help recognizing faces. He wants to build a network that embeds photos of faces so that photos of the same person are nearby in the embedding space and photos of different people are far in the embedding space. We can’t just use pixel space because the pixels will be very different between a photo of someone straight on vs. from their side!

We will use a neural network to go from the pixels of each image to an embedding space. Let’s say you have a convolutional neural network with m units in the last layer. If you feed a face photo $$i$$ through the CNN, the activities of the units in the last layer form an $$m$$ dimensional vector $$\bar{y}_i$$ - this is an embedding of that face photo in $$m$$ dimensional space.

We think we might be able to incorporate Euclidean distance to help us here. The Euclidean distance between two vectors is:

(77)$$$d(\bar{y}_i, \bar{y}_j) = \sqrt{\sum_{c=1}^m(\bar{y}_{i_c} - \bar{y}_{j_c})^2}$$$

Note: a minor remark here, there is an indexing error in the video where it says $$i$$ instead of $$j$$.

## Think! 3: Designing a cost function for face embedding¶

Given everything you know, how would you design a cost function for a neural network that Konrad is training so that he can get a helpful embedding of faces? Try to write out an equation!

Please discuss as a group. If you get stuck, you can uncover the hints below one at a time. Please spend some time discussing before uncovering the next hint, though! You are being real deep learning scientists now, and the answers won’t be easy.

### Student Response¶

# @title Student Response
from ipywidgets import widgets

text=widgets.Textarea(
value='Type your answer here and click on Submit!',
placeholder='Type something',
description='',
disabled=False
)

button = widgets.Button(description="Submit!")

display(text,button)

def on_button_clicked(b):
print("Submission successful!")

button.on_click(on_button_clicked)


How do we want to deal with the same faces? Can we just build a cost function based on similar faces? What would happen?

You need to also include different faces. How do you want to deal with different faces?

Similar faces should have low Euclidean distance between their embeddings. Different faces should have high Euclidean distance between their embeddings. Can we phrase this with 3 faces?

We want the same faces to have similar embeddings. Let’s say we have one photo of Lyle $$a$$ and another photo of Lyle $$p$$. We want the embeddings of those photos to be very similar: we want the Euclidean distance between $$\bar{y}_a$$ and $$\bar{y}_p$$ (the activitys of the last layer of the CNN when photo $$a$$ and $$p$$ are fed through) to be small.

So one possible cost function is:

(78)$$$\text{Cost function} = d(\bar{y}_a, \bar{y}_p)$$$

Imagine if we just feed in pairs of the same face and minimize that though. There would be no motivation to ever have different embeddings, we would be only minimizing the distance between embeddings. If the CNN was smart, it would just have the same embedding for every single photo - then the cost function would equal 0!

This is clearly not what we want. We want to motivate the CNN to have similar embeddings only when the faces are the same. This means we need to also train it to maximize distance when the faces are different.

We could choose another two photos of different people and maximize that distance but then there’s no relation to the embeddings we’ve already established of the two photos of Lyle. Instead, we will add one more photo to the mix: a photo of Konrad $$n$$. We want the distance of this photo to be far from our original photos of Lyle $$a$$ and $$p$$. So we want the distance between $$a$$ and $$p$$ to be small and the distance between $$a$$ and $$n$$ for example to be large:

(79)$$$\text{Cost function} = d(\bar{y}_a, \bar{y}_p) - d(\bar{y}_a, \bar{y}_n)$$$

We could compare $$n$$ to both $$a$$ and $$p$$:

(80)$$$\text{Cost function} = d(\bar{y}_a, \bar{y}_p) - d(\bar{y}_a, \bar{y}_n) - d(\bar{y}_p, \bar{y}_n)$$$

But then the cost function is a bit unbalanced, there are two dissimiliarty terms and they might dominate (so achieving the similarity is less important). So let’s go with just including one dissimilarity term.

This is an established cost function - triplet loss! We chose the subscripts $$a$$, $$p$$, and $$n$$ for a reason: we have an anchor image, a positive image (the same person’s face as the anchor) and a negative image (a different person’s face as the anchor). We can then sum over N data points where each data point is a set of three images:

(81)$$$\text{Cost function} = \sum_{i=1}^N [d(\bar{y}_{a, i}, \bar{y}_{p, i}) - d(\bar{y}_{a, i}, \bar{y}_{n, i})]$$$

There’s one little addition in triplet loss. Instead of just using the above cost function, researchers add a constant $$\alpha$$ and then make the cost function 0 if it becomes negative. Why do you think they do this?

(82)$$$\text{Cost function} = \text{max} \left( \sum_{i=1}^N \left[ d(\bar{y}_{a, i}, \bar{y}_{p, i}) - d(\bar{y}_{a, i}, \bar{y}_{n, i}) + \alpha \right], 0 \right)$$$

### Video 10: Embedding Faces Wrap-up¶

Check out the papers mentioned in the above video:

# Summary¶

Today we have seen a range of different cost functions. So we want to dwell a bit on what we want people to take away from these exercises. We have seen several cost functions:

• Log Poisson likelihood for neurons

• Uncertainty as a modeled entity

• Face embeddings

What we saw in all these cases is that these cost functions emerge from insights into the problem domain. We saw how one needs to, in a way, pull these insights out of the domain experts. And how, at the same time, the cost functions come from computational insights. Coming up with the proper cost functions requires listening to what domain experts say and probing the things they may mean but not say.