Open In Colab   Open in Kaggle

Tutorial 2: Modern RNNs and their variants

Week 2, Day 3: Modern RNNs

By Neuromatch Academy

Content creators: Bhargav Srinivasa Desikan, Anis Zahedifard, James Evans

Content reviewers: Lily Cheng, Melvin Selim Atay, Ezekiel Williams, Kelson Shilling-Scrivo

Content editors: Gagana B, Spiros Chavlis

Production editors: Roberto Guidotti, Spiros Chavlis

Post-Production team: Gagana B, Spiros Chavlis

Our 2021 Sponsors, including Presenting Sponsor Facebook Reality Labs


Tutorial objectives

In this tutorial you will learn about:

  1. Modern Recurrent Neural Networks and their use

  2. Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU) and the memory cell

  3. Sequence to Sequence and Encoder-Decoder Networks

  4. Models of Attention for text classification

Tutorial slides

These are the slides for the videos in this tutorials. If you want to locally download the slides, click here.


Setup

We will use the IMDB dataset, which consists of a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. We will use torchtext to download the dataset and prepare it for training, validation and testing. Our goal is to build a model that performs binary classification between positive and negative movie reviews.

We use fix_length argument to pad sentences of length less than sentence_length or truncate sentences of length greater than sentence_length.

Install dependecies

There may be errors and/or warnings reported during the installation. However, they are to be ignored.

# @title Install dependecies

# @markdown There may be *errors* and/or *warnings* reported during the installation. However, they are to be ignored.
!pip install torchtext --quiet
!pip install unidecode --quiet
!pip install d2l --quiet
!pip install nltk --quiet
!pip install matplotlib==3.1.1 --quiet

!pip install git+https://github.com/NeuromatchAcademy/evaltools --quiet
from evaltools.airtable import AirtableForm

atform = AirtableForm('appn7VdPRseSoMXEG','W2D3_T2','https://portal.neuromatchacademy.org/api/redirect/to/3412a777-eb0e-4312-9254-eec266f0bee4')
WARNING: You are using pip version 22.0.4; however, version 22.1 is available.
You should consider upgrading via the '/opt/hostedtoolcache/Python/3.7.13/x64/bin/python -m pip install --upgrade pip' command.

WARNING: You are using pip version 22.0.4; however, version 22.1 is available.
You should consider upgrading via the '/opt/hostedtoolcache/Python/3.7.13/x64/bin/python -m pip install --upgrade pip' command.

WARNING: You are using pip version 22.0.4; however, version 22.1 is available.
You should consider upgrading via the '/opt/hostedtoolcache/Python/3.7.13/x64/bin/python -m pip install --upgrade pip' command.

WARNING: You are using pip version 22.0.4; however, version 22.1 is available.
You should consider upgrading via the '/opt/hostedtoolcache/Python/3.7.13/x64/bin/python -m pip install --upgrade pip' command.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
d2l 0.17.5 requires matplotlib==3.5.1, but you have matplotlib 3.1.1 which is incompatible.

WARNING: You are using pip version 22.0.4; however, version 22.1 is available.
You should consider upgrading via the '/opt/hostedtoolcache/Python/3.7.13/x64/bin/python -m pip install --upgrade pip' command.

WARNING: You are using pip version 22.0.4; however, version 22.1 is available.
You should consider upgrading via the '/opt/hostedtoolcache/Python/3.7.13/x64/bin/python -m pip install --upgrade pip' command.

# Imports
import math
import time
import nltk
import random
import collections

import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.nn import functional as F

from torchtext.legacy import data, datasets

from d2l import torch as d2l
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_73088/1638087646.py in <module>
     13 from torch.nn import functional as F
     14 
---> 15 from torchtext.legacy import data, datasets
     16 
     17 from d2l import torch as d2l

ModuleNotFoundError: No module named 'torchtext.legacy'

Figure Settings

# @title Figure Settings
import ipywidgets as widgets
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/content-creation/main/nma.mplstyle")

Download the dataset

# @title Download the dataset
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('brown')
nltk.download('webtext')
True

Helper functions

# @title Helper functions

def plot_train_val(x, train, val, train_label,
                   val_label, title, y_label,
                   color):
  """
  Plots training/validation performance per epoch

  Args:
    x: np.ndarray
      Input data
    train: list
      Training data performance
    val: list
      Validation data performance
    train_label: string
      Train Label [specifies training criterion]
    color: string
      Specifies color of plot
    val_label: string
      Validation Label [specifies validation criterion]
    title: string
      Specifies title of plot
    y_label: string
      Specifies performance criterion

  Returns:
    Nothing
  """
  plt.plot(x, train, label=train_label, color=color)
  plt.plot(x, val, label=val_label, color=color, linestyle='--')
  plt.legend(loc='lower right')
  plt.xlabel('epoch')
  plt.ylabel(y_label)
  plt.title(title)

def count_parameters(model):
  """
  Helper function to count parameters

  Args:
    model: nn.module
      NeuralNet instance

  Returns:
    parameters: int
      Number of parameters in model
  """
  parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
  return parameters

def init_weights(m):
  """
  Helper function to initialize weights

  Args:
    m: nn.module
      Type of layer

  Returns:
    Nothing
  """
  if type(m) in (nn.Linear, nn.Conv1d):
    nn.init.xavier_uniform_(m.weight)

def load_dataset(sentence_length=50, batch_size=32, seed=2021):
  """
  Dataset Loader

  Args:
    sentence_length: int
      Length of sentence
    seed: int
      Set seed for reproducibility
    batch_size: int
      Batch size

  Returns:
    TEXT: Field instance
      Text
    vocab_size: int
      Specifies size of TEXT
    train_iter: BucketIterator
      Training iterator
    valid_iter: BucketIterator
      Validation iterator
    test_iter: BucketIterator
      Test iterator
  """
  TEXT = data.Field(sequential=True,
                    tokenize=nltk.word_tokenize,
                    lower=True,
                    include_lengths=True,
                    batch_first=True,
                    fix_length=sentence_length)
  LABEL = data.LabelField(dtype=torch.float)

  train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

  # If no specific vector embeddings are specified,
  # Torchtext initializes random vector embeddings
  # which would get updated during training through backpropagation.
  TEXT.build_vocab(train_data)
  LABEL.build_vocab(train_data)

  train_data, valid_data = train_data.split(split_ratio=0.7,
                                            random_state=random.seed(seed))
  train_iter, valid_iter, test_iter = data.BucketIterator.splits((train_data, valid_data, test_data),
                                                                  batch_size=batch_size, sort_key=lambda x: len(x.text),
                                                                  repeat=False, shuffle=True)
  vocab_size = len(TEXT.vocab)

  print(f"Data loading is completed. Sentence length: {sentence_length}, "
        f"Batch size: {batch_size}, and seed: {seed}")

  return TEXT, vocab_size, train_iter, valid_iter, test_iter


def text_from_dict(arr, dictionary):
  """
  Helper function to extract text from dictionary

  Args:
    dictionary: dict
      Dictionary of words and corresponding indices
    arr: list
      Sequence of words

  Returns:
    text: list
      Log of keys from dictionary
  """
  text = []
  for element in arr:
    text.append(dictionary[element])
  return text

def view_data(TEXT, train_iter):
  """
  Helper function to view data

  Args:
    TEXT: Field instance
      Text
    train_iter: BucketIterator
      Training iterator

  Returns:
    Nothing
  """
  for idx, batch in enumerate(train_iter):
    text = batch.text[0]
    target = batch.label

    for itr in range(25, 30):
      print('Review: ', ' '.join(text_from_dict(text[itr], TEXT.vocab.itos)))
      print('Label: ', int(target[itr].item()), '\n')

    print('[0: Negative Review, 1: Positive Review]')
    if idx==0:
      break


def train(model, device, train_iter, valid_iter,
          epochs, learning_rate):
  """
  Training function

  Args:
    model: nn.module
      NeuralNet instance
    device: string
      GPU if available, CPU otherwise
    epochs: int
      Number of epochs to train model for
    learning_rate: float
      Learning rate
    train_iter: BucketIterator
      Training iterator
    valid_iter: BucketIterator
      Validation iterator

  Returns:
    train_loss: list
      Log of training loss
    validation_loss: list
      Log of validation loss
    train_acc: list
      Log of training accuracy
    validation_acc: list
      Log of validation accuracy
  """
  criterion = nn.CrossEntropyLoss()
  optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

  train_loss, validation_loss = [], []
  train_acc, validation_acc = [], []

  for epoch in range(epochs):
    # Train
    model.train()
    running_loss = 0.
    correct, total = 0, 0
    steps = 0

    for idx, batch in enumerate(train_iter):
      text = batch.text[0]
      target = batch.label
      target = torch.autograd.Variable(target).long()
      text, target = text.to(device), target.to(device)

      optimizer.zero_grad()
      output = model(text)

      loss = criterion(output, target)
      loss.backward()
      optimizer.step()
      steps += 1
      running_loss += loss.item()

      # Get accuracy
      _, predicted = torch.max(output, 1)
      total += target.size(0)
      correct += (predicted == target).sum().item()

    train_loss.append(running_loss/len(train_iter))
    train_acc.append(correct/total)

    print(f'Epoch: {epoch + 1}, '
          f'Training Loss: {running_loss/len(train_iter):.4f}, '
          f'Training Accuracy: {100*correct/total: .2f}%')

    # Evaluate on validation data
    model.eval()
    running_loss = 0.
    correct, total = 0, 0

    with torch.no_grad():
      for idx, batch in enumerate(valid_iter):
        text = batch.text[0]
        target = batch.label
        target = torch.autograd.Variable(target).long()
        text, target = text.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(text)

        loss = criterion(output, target)
        running_loss += loss.item()

        # get accuracy
        _, predicted = torch.max(output, 1)
        total += target.size(0)
        correct += (predicted == target).sum().item()

    validation_loss.append(running_loss/len(valid_iter))
    validation_acc.append(correct/total)

    print (f'Validation Loss: {running_loss/len(valid_iter):.4f}, '
           f'Validation Accuracy: {100*correct/total: .2f}%')

  return train_loss, train_acc, validation_loss, validation_acc


def test(model, device, test_iter):
  """
  Testing function

  Args:
    model: nn.module
      NeuralNet instance
    device: string
      GPU if available,
    test_iter: BucketIterator
      Test iterator

  Returns:
    acc: float
      Test Accuracy
  """
  model.eval()
  correct = 0
  total = 0
  with torch.no_grad():
    for idx, batch in enumerate(test_iter):
      text = batch.text[0]
      target = batch.label
      target = torch.autograd.Variable(target).long()
      text, target = text.to(device), target.to(device)

      outputs = model(text)
      _, predicted = torch.max(outputs, 1)
      total += target.size(0)
      correct += (predicted == target).sum().item()

    acc = 100 * correct / total
    return acc

Set random seed

Executing set_seed(seed=seed) you are setting the seed

# @title Set random seed

# @markdown Executing `set_seed(seed=seed)` you are setting the seed

# For DL its critical to set the random seed so that students can have a
# baseline to compare their results to expected results.
# Read more here: https://pytorch.org/docs/stable/notes/randomness.html

# Call `set_seed` function in the exercises to ensure reproducibility.
import random

def set_seed(seed=None, seed_torch=True):
  """
  Function that controls randomness.
  NumPy and random modules must be imported.

  Args:
    seed : Integer
      A non-negative integer that defines the random state. Default is `None`.
    seed_torch : Boolean
      If `True` sets the random seed for pytorch tensors, so pytorch module
      must be imported. Default is `True`.

  Returns:
    Nothing.
  """
  if seed is None:
    seed = np.random.choice(2 ** 32)
  random.seed(seed)
  np.random.seed(seed)
  if seed_torch:
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

  print(f'Random seed {seed} has been set.')

# In case that `DataLoader` is used
def seed_worker(worker_id):
  """
  DataLoader will reseed workers following randomness in
  multi-process data loading algorithm.

  Args:
    worker_id: integer
      ID of subprocess to seed. 0 means that
      the data will be loaded in the main process
      Refer: https://pytorch.org/docs/stable/data.html#data-loading-randomness for more details

  Returns:
    Nothing
  """
  worker_seed = torch.initial_seed() % 2**32
  np.random.seed(worker_seed)
  random.seed(worker_seed)

Set device (GPU or CPU). Execute set_device()

# @title Set device (GPU or CPU). Execute `set_device()`
# Inform the user if the notebook uses GPU or CPU.

def set_device():
  """
  Set the device. CUDA if available, CPU otherwise

  Args:
    None

  Returns:
    Nothing
  """
  device = "cuda" if torch.cuda.is_available() else "cpu"
  if device != "cuda":
    print("WARNING: For this notebook to perform best, "
        "if possible, in the menu under `Runtime` -> "
        "`Change runtime type.`  select `GPU` ")
  else:
    print("GPU is enabled in this notebook.")

  return device
DEVICE = set_device()
SEED = 2021
set_seed(seed=SEED)
WARNING: For this notebook to perform best, if possible, in the menu under `Runtime` -> `Change runtime type.`  select `GPU` 
Random seed 2021 has been set.

Section 1: Recurrent Neural Networks (RNNs)

Time estimate: ~27mins

Video 1: Recurrent Neural Networks

Recurrent neural networks, or RNNs , are a family of neural networks for processing sequential data. Just as a convolutional network is specialized for processing a grid of values X such as an image, a recurrent neural network is specialized for processing a sequence of values. RNNs prove useful in many scenarios where other deep learning models are not effective.

  • Not all problems can be converted into one with fixed length inputs and outputs.

  • The deep learning models we have seen so far pick samples randomly. This might not be the best strategy for a task of understanding meaning from a piece of text. Words in a text occur in a sequence and therefore cannot be permuted randomly to get the meaning.

The following provides more data than the video (but can be skipped for now). For more detail, see the sources, the deep learning book, and d2l.ai

When the recurrent network is trained to perform a task that requires predicting the future from the past, the network typically learns to use a hidden state at time step \(t\), \(H_t\) as a kind of lossy summary of the task-relevant aspects of the past sequence of inputs up to \(t\). This summary is in general necessarily lossy, since it maps an arbitrary length sequence \((X_t, X_{t-1}, X_{t-2}, \dots, X_{2}, X_{1})\) to a fixed length vector \(H_t\).

We can represent the unfolded recurrence after \(t\) steps with a function \(G_t\):

(75)\[\begin{align} H_t &= G_t(X_t, X_{t-1}, X_{t-2}, \dots, X_{2}, X_{1}) \\ &= f(H_{t−1}, X_{t}; \theta) \end{align}\]

where \(\theta\) denotes the model parameters, i.e., weights and biases.

Source blog.floydhub.com

The function \(g_t\) takes the whole past sequence \((X_t, X_{t-1}, X_{t-2}, \dots , X_{2}, X_{1})\) as input and produces the current state, but the unfolded recurrent structure allows us to factorize \(g_t\) into repeated application of a function f. The unfolding process thus introduces two major advantages:

  • Regardless of the sequence length, the learned model always has the same input size, because it is specified in terms of transition from one state to another state, rather than specified in terms of a variable-length history of states.

  • It is possible to use the same transition function \(f\) with the same parameters at every time step.

We will now formally write down the equations of a recurrent unit.

Assume that we have a minibatch of inputs \(X_t \in R^{n \times d}\) at time step \(t\). In other words, for a minibatch of \(n\) sequence examples, each row of \(X_t\) corresponds to one example at time step \(t\) from the sequence. Next, we denote by \(H_t \in R^{n \times h}\) the hidden variable of time step \(t\). Unlike the MLP, here we save the hidden variable \(H_{t-1}\) from the previous time step and introduce a new weight parameter \(W_{hh} \in R^{h \times h}\) to describe how to use the hidden variable of the previous time step in the current time step. Specifically, the calculation of the hidden variable of the current time step is determined by the input of the current time step together with the hidden variable of the previous time step:

(76)\[\begin{equation} H_t = \phi(X_t W_{xh} + H_{t-1}W_{hh} + b_h) \end{equation}\]

For time step \(t\), the output of the output layer is similar to the computation in the MLP:

(77)\[\begin{equation} O_t = H_t W_{hq} + b_q \end{equation}\]

Parameters of the RNN include the weights \(W_{xh} \in R^{d \times h}, W_{hh} \in R^{h \times h}\) , and the bias \(b_h \in R^{1 \times h}\) of the hidden layer, together with the weights \(W_{hq} \in R^{h \times q}\) and the bias \(b_q \in R^{1 \times q}\) of the output layer. It is worth mentioning that even at different time steps, RNNs always use these model parameters. Therefore, the parameterization cost of an RNN does not grow as the number of time steps increases.

Source d2l.ai

Section 1.1: Load and View of the dataset

Let us first load the dataset using the helper function load_data, which takes three arguments; the sentence_length, batch_size, and the seed. The default values are 50, 32, and 522, respectively. Execute the cell below to load the data.

Dataset Loading with default params

# @markdown Dataset Loading with default params
TEXT, vocab_size, train_iter, valid_iter, test_iter = load_dataset(seed=SEED)

Now, let’s view the data!

Visualize dataset

# @markdown Visualize dataset
view_data(TEXT, train_iter)

Coding Exercise 1.1: Vanilla RNN

Now it’s your turn to write a Vanilla RNN using PyTorch.

  • Once again we will use nn.Embedding. You are given the vocab_size which is the size of the dictionary of embeddings, and the embed_size which is the size of each embedding vector.

  • Add 2 RNN layers. This would mean stacking two RNNs together to form a stacked RNN, with the second RNN taking in outputs of the first RNN and computing the final results.

  • Determine the size of inputs and outputs to the fully-connected layer.

class VanillaRNN(nn.Module):
  """
  Vanilla RNN with following structure:
  Embedding of size vocab_size * embed_size # Embedding Layer
  RNN of size embed_size * hidden_size * self.n_layers # RNN Layer
  Linear of size self.n_layers*hidden_size * output_size # Fully connected layer
  """

  def __init__(self, layers, output_size, hidden_size, vocab_size, embed_size,
               device):
    """
    Initialize parameters of VanillaRNN

    Args:
      layers: int
        Number of layers
      output_size: int
        Size of final fully connected layer
      hidden_size: int
        Size of hidden layer
      vocab_size: int
        Size of vocabulary
      device: string
        GPU if available, CPU otherwise
      embed_size: int
        Size of embedding

    Returns:
      Nothing
    """
    super(VanillaRNN, self).__init__()
    self.n_layers= layers
    self.hidden_size = hidden_size
    self.device = device
    ####################################################################
    # Fill in missing code below (...),
    # then remove or comment the line below to test your function
    raise NotImplementedError("Define the Vanilla RNN components")
    ####################################################################
    # Define the embedding
    self.embeddings = ...
    # Define the RNN layer
    self.rnn = ...
    # Define the fully connected layer
    self.fc = ...

  def forward(self, inputs):
    """
    Forward pass of VanillaRNN

    Args:
      inputs: torch.tensor
        Input features

    Returns:
      logits: torch.tensor
        Output of final fully connected layer
    """
    input = self.embeddings(inputs)
    input = input.permute(1, 0, 2)
    h_0 = torch.zeros(2, input.size()[1], self.hidden_size).to(self.device)
    output, h_n = self.rnn(input, h_0)
    h_n = h_n.permute(1, 0, 2)
    # Reshape the data and create a copy of the tensor such that the
    # order of its elements in memory is the same as if it had been created
    # from scratch with the same data. Without contiguous it may raise an error
    # RuntimeError: input is not contiguous;
    # Note that this is necessary as permute may return a non-contiguous tensor
    h_n = h_n.contiguous().reshape(h_n.size()[0], h_n.size()[1]*h_n.size()[2])
    logits = self.fc(h_n)

    return logits


# Add event to airtable
atform.add_event('Coding Exercise 1.1: Vanilla RNN')

## Uncomment to test VanillaRNN class
# sampleRNN = VanillaRNN(2, 10, 50, 1000, 300, DEVICE)
# print(sampleRNN)

Click for solution

VanillaRNN(
  (embeddings): Embedding(1000, 300)
  (rnn): RNN(300, 50, num_layers=2)
  (fc): Linear(in_features=100, out_features=10, bias=True)
)

Section 1.2: Train and test the network

# Model hyperparamters
learning_rate = 0.0002
layers = 2
output_size = 2
hidden_size = 50  # 100
embedding_length = 100
epochs = 10


# Initialize model, training and testing
set_seed(SEED)
vanilla_rnn_model = VanillaRNN(layers, output_size, hidden_size, vocab_size,
                               embedding_length, DEVICE)
vanilla_rnn_model.to(DEVICE)
vanilla_rnn_start_time = time.time()
vanilla_train_loss, vanilla_train_acc, vanilla_validation_loss, vanilla_validation_acc = train(vanilla_rnn_model,
                                                                                               DEVICE,
                                                                                               train_iter,
                                                                                               valid_iter,
                                                                                               epochs,
                                                                                               learning_rate)
print("--- Time taken to train = %s seconds ---" % (time.time() - vanilla_rnn_start_time))
test_accuracy = test(vanilla_rnn_model, DEVICE, test_iter)
print(f'Test Accuracy: {test_accuracy} with len=50\n')

# Number of model parameters
print(f'Number of parameters = {count_parameters(vanilla_rnn_model)}')

Now, let’s plot the accuracies!

# Plot accuracy curves
plt.figure()
plt.subplot(211)
plot_train_val(np.arange(0, epochs), vanilla_train_acc, vanilla_validation_acc,
               'train accuracy', 'val accuracy',
               'Vanilla RNN on IMDB text classification', 'accuracy',
               color='C0')
plt.legend(loc='upper left')
plt.subplot(212)
plot_train_val(np.arange(0, epochs), vanilla_train_loss,
               vanilla_validation_loss,
               'train loss', 'val loss',
               'Vanilla RNN on IMDB text classification',
               'loss [a.u.]',
               color='C0')
plt.legend(loc='upper left')
plt.show()

Change the input length

Now let’s increase the sentence_length to see how RNN performs when long reviews are allowed..

Load dataset with sentence_length=200

# @markdown Load dataset with `sentence_length=200`
TEXT_long, vocab_size_long, train_iter_long, valid_iter_long, test_iter_long = load_dataset(sentence_length=200)

Re-run the network

# Model hyperparamters
learning_rate = 0.0002
layers = 2
output_size = 2
hidden_size = 50  # 100
embedding_length = 100
epochs = 10

# Initialize model, training, testing
set_seed(SEED)
vanilla_rnn_model_long = VanillaRNN(layers, output_size, hidden_size,
                                    vocab_size_long, embedding_length, DEVICE)
vanilla_rnn_model_long.to(DEVICE)
vanilla_rnn_start_time_long = time.time()
vanilla_train_loss_long, vanilla_train_acc_long, vanilla_validation_loss_long, vanilla_validation_acc_long = train(vanilla_rnn_model_long,
                                                                                                                   DEVICE,
                                                                                                                   train_iter_long,
                                                                                                                   valid_iter_long,
                                                                                                                   epochs,
                                                                                                                   learning_rate)
print("--- Time taken to train = %s seconds ---" % (time.time() - vanilla_rnn_start_time_long))
test_accuracy = test(vanilla_rnn_model_long, DEVICE, test_iter_long)
print(f'Test Accuracy: {test_accuracy} with len=200\n')

# Number of parameters
print(f'\nNumber of parameters = {count_parameters(vanilla_rnn_model_long)}')
# Compare accuracies of model trained on different sentence lengths
plot_train_val(np.arange(0, epochs), vanilla_train_acc,
               vanilla_validation_acc,
               'train accuracy, len=50', 'val accuracy, len=50',
               '', 'accuracy',
               color='C0')
plot_train_val(np.arange(0, epochs), vanilla_train_acc_long,
               vanilla_validation_acc_long,
               'train accuracy, len=200', 'val accuracy, len=200',
               'Training and Validation Accuracy for Sentence Lengths 50 and 200',
               'accuracy',
               color='C1')
plt.legend(loc='upper left')
plt.show()

Section 1.3: Architectures

Video 2: Bidirectional RNNs

RNN models are mostly used in the fields of natural language processing and speech recognition. Below are types of RNNs. Depending on which outputs we use, RNN can be used for variety of tasks. The text classification problem we solved was an instance of the many to one architecture. Write down the applications of other architectures.

Source blog.floydhub.com

Section 1.4: Vanishing and Exploding Gradients

For an RNN to learn via backprop through time on a loss calculated at time \(T\), \(\mathcal{L}_T\), with respect to an input \(t\) time steps in the past, the RNN weights must be updated based on how they contributed to the hidden state at this past time step. This contribution is learned through the term \(\frac{\partial h_{-t}}{\partial W}\), in the gradient of the loss, \(\frac{\partial\mathcal{L}_T}{\partial W}\).

However, because one has to backpropagate error through \(t-1\) hidden states, \(\frac{\partial h_{-t}}{\partial W}\) is multiplied by \(\prod_{i=0}^{t-1} \frac{\partial{h_i}}{\partial{h_{i-1}}}\) in the expression for \(\frac{\partial\mathcal{L}_T}{\partial W}\), which are summarized mathematically:

(78)\[\begin{equation} \frac{\partial{\mathcal{L}_T}}{\partial{W}} \propto \frac{\partial h_t }{ \partial W} + \sum_{k=0}^{t-1} \left( \prod_{i=k+1}^{t} \frac{\partial{h_i}}{\partial{h_{i-1}}} \right) \frac{\partial{h_k}}{\partial{W}} \end{equation}\]

The product term leads to two common problems during the backpropagation of time-series data:

  • Vanishing gradients, if \( \left| \left| \frac{\partial{h_i}}{\partial{h_{i-1}}} \right| \right|_2 < 1\)

  • Exploding gradients, if \( \left| \left| \frac{\partial{h_i}}{\partial{h_{i-1}}} \right| \right|_2 > 1\)

Given a sufficiently long sequence, the gradients get multiplied by the weight matrix at every time step. If the weight matrix contains very small values, then the norm of gradients will become smaller and smaller exponentially, the so-called vanishing gradient problem. On the other hand, if we have a weight matrix with very large values, the gradients will increase exponentially, leading to the exploding gradients problem: where the weights diverge at the update step.

An example that has the vanishing gradient problem:

The input is the characters from a C Program. The system will tell whether it is a syntactically correct program. A syntactically correct program should have a valid number of braces and parentheses. Thus, the network should remember how many open parentheses and braces there are to check, and whether we have closed them all. The network has to store such information in hidden states like a counter. However, because of vanishing gradients, it will fail to preserve such information in a long program.


Section 2: LSTM, GRU and Memory Cell

Time estimate: ~28mins

Video 3: LSTM, GRU & The Memory Cells

Section 2.1: Architecture

The core idea behind an LSTM is the cell state \(C_t\) that runs along all the LSTM units in a layer, and gets updated along the way. These updates are possible through “gates”. Gates are made out of a sigmoid neural net layer and a pointwise multiplication operation.

Each LSTM unit performs the following distinct steps using the input \(X_t\), current cell state \(C_t\) and previous hidden state \(H_{t-1}\):

  • Forget Gate: Should I throw away information from this cell?

(79)\[\begin{equation} F_t = \sigma (W_f \cdot [H_{t-1}, X_t] + b_f) \end{equation}\]
  • Input Gate:

    • Should I add new values to this cell?

      (80)\[\begin{equation} I_t = \sigma (W_i \cdot [H_{t-1}, X_t] + b_i) \end{equation}\]
    • What new candidate values should I store?

      (81)\[\begin{equation} \tilde{C}_t = tanh (W_C \cdot [H_{t-1}, X_t] + b_C) \end{equation}\]
  • Update cell state: Forget things from the past and add new things from the candidates

    (82)\[\begin{equation} C_t = (F_t \cdot C_{t-1}) + (I_t \cdot \tilde{C}_t) \end{equation}\]
  • Output Gate:

    • What information should I output?

      (83)\[\begin{equation} O_t = \sigma (W_o \cdot [H_{t-1}, X_t] + b_o) \end{equation}\]
    • How much of the cell state should I store in the hidden state?

      (84)\[\begin{equation} H_t = O_t \cdot tanh(C_t) \end{equation}\]

The architecture can be summarized by the diagram below:

Source d2l.ai

Coding Exercise 2.1: Implementing LSTM

It is now your turn to build an LSTM network in PyTorch. Feel free to refer to the documentation here.

  • Once again we will use nn.Embedding. You are given the vocab_size and the embed_size.

  • Add the LSTM layers.

  • Define a dropout layer of 0.5.

  • Determine the size of inputs and outputs to the fully-connected layer.

  • Pay special attention to the shapes of your inputs and outputs as you write the forward function.

class LSTM(nn.Module):
  """
  LSTM (Long Short Term Memory) with following structure
  Embedding layer of size vocab_size * embed_size
  Dropout layer with dropout_probability of 0.5
  LSTM layer of size embed_size * hidden_size * num_layers
  Fully connected layer of n_layers*hidden_size * output_size
  """

  def __init__(self, layers, output_size, hidden_size, vocab_size, embed_size,
               device):
    """
    Initialize parameters of LSTM

    Args:
      layers: int
        Number of layers
      output_size: int
        Size of final fully connected layer
      hidden_size: int
        Size of hidden layer
      vocab_size: int
        Size of vocabulary
      device: string
        GPU if available, CPU otherwise
      embed_size: int
        Size of embedding

    Returns:
      Nothing
    """
    super(LSTM, self).__init__()
    self.n_layers = layers
    self.output_size = output_size
    self.hidden_size = hidden_size
    self.device = device
    ####################################################################
    # Fill in missing code below (...),
    # then remove or comment the line below to test your function
    raise NotImplementedError("LSTM Init")
    ####################################################################
    # Define the word embeddings
    self.word_embeddings = ...
    # Define the dropout layer
    self.dropout = ...
    # Define the lstm layer
    self.lstm = ...
    # Define the fully-connected layer
    self.fc = ...


  def forward(self, input_sentences):
    """
    Forward pass of LSTM
    Hint: Make sure the shapes of your tensors match the requirement

    Args:
      input_sentences: torch.tensor
        Input Sentences

    Returns:
      logits: torch.tensor
        Output of final fully connected layer
    """
    ####################################################################
    # Fill in missing code below (...),
    # then remove or comment the line below to test your function
    raise NotImplementedError("LSTM Forward")
    ####################################################################
    # Embeddings
    # `input` shape: (`num_steps`, `batch_size`, `num_hiddens`)
    input = ...

    hidden = (torch.randn(self.n_layers, input.shape[1],
                          self.hidden_size).to(self.device),
              torch.randn(self.n_layers, input.shape[1],
                          self.hidden_size).to(self.device))
    # Dropout for regularization
    input = self.dropout(input)
    # LSTM
    output, hidden = ...

    h_n = hidden[0].permute(1, 0, 2)
    h_n = h_n.contiguous().view(h_n.shape[0], -1)

    logits = self.fc(h_n)

    return logits


# Add event to airtable
atform.add_event('Coding Exercise 2.1: Implementing LSTM')

## Uncomment to run
# sampleLSTM = LSTM(3, 10, 100, 1000, 300, DEVICE)
# print(sampleLSTM)

Click for solution

LSTM(
  (word_embeddings): Embedding(1000, 300)
  (dropout): Dropout(p=0.5, inplace=False)
  (lstm): LSTM(300, 100, num_layers=3)
  (fc): Linear(in_features=300, out_features=10, bias=True)
)
# Hyperparameters
learning_rate = 0.0003
layers = 2
output_size = 2
hidden_size = 16
embedding_length = 100
epochs = 10

# Model, training, testing
set_seed(SEED)
lstm_model = LSTM(layers, output_size, hidden_size, vocab_size,
                  embedding_length, DEVICE)
lstm_model.to(DEVICE)
lstm_train_loss, lstm_train_acc, lstm_validation_loss, lstm_validation_acc = train(lstm_model,
                                                                                   DEVICE,
                                                                                   train_iter,
                                                                                   valid_iter,
                                                                                   epochs,
                                                                                   learning_rate)
test_accuracy = test(lstm_model, DEVICE, test_iter)
print(f'\n\nTest Accuracy: {test_accuracy} of the LSTM model\n')

# Plotting accuracy curve
plt.figure()
plt.subplot(211)
plot_train_val(np.arange(0, epochs), lstm_train_acc, lstm_validation_acc,
               'train accuracy',
               'val accuracy',
               'LSTM on IMDB text classification',
               'accuracy',
               color='C0')
plt.legend(loc='upper left')
plt.subplot(212)
plot_train_val(np.arange(0, epochs), lstm_train_loss, lstm_validation_loss,
               'train loss',
               'val loss',
               '',
               'loss',
               color='C0')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

Section 2.2: Gated Recurrent Units (GRU)

The GRU architecture looks very similar to the LSTM, and is often used as an alternative to the traditional LSTM. It also contains some variations that reduce it’s complexity. For example, it combines the forget and input gates into a single “update gate”; it contains a “hidden state” but not a “cell state”. In the next section we will be using GRUs as the choice of recurrent unit in our models, but you can always swap out the GRU for an LSTM later on (make sure that you take care of input and output dimensions in this case). Here is a description of the parts of the GRU:

  • Reset Gate: How much of the previous hidden state should I remember?

    (85)\[\begin{equation} R_t = \sigma (W_r \cdot [H_{t-1}, X_t]) \end{equation}\]
  • Update Gate:

    • How much of the new state is different from the old state?

      (86)\[\begin{equation} Z_t = \sigma (W_z \cdot [H_{t-1}, X_t]) \end{equation}\]
    • What new candidate values should I store?

      (87)\[\begin{equation} \tilde{H}_t = tanh (W \cdot [R_t \cdot H_{t-1}, X_t]) \end{equation}\]
  • Update hidden state: Deciding how much of the old hidden state to keep and discard

    (88)\[\begin{equation} H_t = ((1-Z_t) \cdot H_{t-1} ) + (Z_t \cdot \tilde{H}_t) \end{equation}\]

Here is what the architecture looks like:

Source d2l.ai

Coding Exercise 2.2: BiLSTM

Let’s apply the knowledge to write a bi-LSTM using PyTorch.

  • Use an Embedding layer

  • Dropout of 0.5

  • Add 2 LSTM layers

  • Linear layer

class biLSTM(nn.Module):
  """
  Bidirectional LSTM with following structure
  Embedding layer of size vocab_size * embed_size
  Dropout layer with dropout_probability of 0.5
  biLSTM layer of size embed_size * hidden_size * num_layers
  Fully connected layer of n_layers*hidden_size * output_size
  """

  def __init__(self, output_size, hidden_size, vocab_size, embed_size,
               device):
    """
    Initialize parameters of biLSTM

    Args:
      output_size: int
        Size of final fully connected layer
      hidden_size: int
        Size of hidden layer
      vocab_size: int
        Size of vocabulary
      device: string
        GPU if available, CPU otherwise
      embed_size: int
        Size of embedding

    Returns:
      Nothing
    """
    super(biLSTM, self).__init__()
    self.output_size = output_size
    self.hidden_size = hidden_size
    self.device = device
    ####################################################################
    # Fill in missing code below (...)
    raise NotImplementedError("biLSTM")
    ####################################################################
    # Define the word embeddings
    self.word_embeddings = ...
    # Define the dropout layer
    self.dropout = ...
    # Define the bilstm layer
    self.bilstm = ...
    # Define the fully-connected layer; 4 = 2*2: 2 for stacking and 2 for bidirectionality
    self.fc = ...

  def forward(self, input_sentences):
    """
    Forward pass of biLSTM

    Args:
      input_sentences: torch.tensor
        Input Sentences

    Returns:
      logits: torch.tensor
        Output of final fully connected layer
    """
    input = self.word_embeddings(input_sentences).permute(1, 0, 2)
    hidden = (torch.randn(4, input.shape[1], self.hidden_size).to(self.device),
              torch.randn(4, input.shape[1], self.hidden_size).to(self.device))
    input = self.dropout(input)

    output, hidden = self.bilstm(input, hidden)

    h_n = hidden[0].permute(1, 0, 2)
    h_n = h_n.contiguous().view(h_n.shape[0], -1)
    logits = self.fc(h_n)

    return logits


# Add event to airtable
atform.add_event('Coding Exercise 2.2: BiLSTM')

## Uncomment to run
# sampleBiLSTM = biLSTM(10, 100, 1000, 300, DEVICE)
# print(sampleBiLSTM)

Click for solution

biLSTM(
  (word_embeddings): Embedding(1000, 300)
  (dropout): Dropout(p=0.5, inplace=False)
  (bilstm): LSTM(300, 100, num_layers=2, bidirectional=True)
  (fc): Linear(in_features=400, out_features=10, bias=True)
)
# Hyperparameters
learning_rate = 0.0003
output_size = 2
hidden_size = 16
embedding_length = 100
epochs = 10

# Model, training, testing
set_seed(SEED)
bilstm_model = biLSTM(output_size, hidden_size, vocab_size,
                      embedding_length, DEVICE)
bilstm_model.to(DEVICE)
bilstm_train_loss, bilstm_train_acc, bilstm_validation_loss, bilstm_validation_acc = train(bilstm_model,
                                                                                           DEVICE,
                                                                                           train_iter,
                                                                                           valid_iter,
                                                                                           epochs,
                                                                                           learning_rate)
test_accuracy = test(bilstm_model, DEVICE, test_iter)
print(f'Test Accuracy: {test_accuracy} of the biLSTM model\n')

# Plotting accuracy curve
plt.figure()
plt.subplot(211)
plot_train_val(np.arange(0, epochs), bilstm_train_acc, bilstm_validation_acc,
               'train accuracy',
               'val accuracy',
               'biLSTM on IMDB text classification',
               'accuracy',
               color='C1')
plt.legend(loc='upper left')
plt.subplot(212)
plot_train_val(np.arange(0, epochs), bilstm_train_loss, bilstm_validation_loss,
               'train loss',
               'val loss',
               '',
               'loss',
               color='C1')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
# Compare accuracies of LSTM and biLSTM
plot_train_val(np.arange(0, epochs), lstm_train_acc,
               lstm_validation_acc,
               'train accuracy LSTM', 'val accuracy LSTM',
               '', 'accuracy',
               color='C0')
plot_train_val(np.arange(0, epochs), bilstm_train_acc,
               bilstm_validation_acc,
               'train accuracy biLSTM', 'val accuracy biLSTM',
               'Training and Validation Accuracy for LSTM and biLSTM models',
               'accuracy',
               color='C1')
plt.legend(loc='upper left')
plt.show()

Section 3: Sequence to Sequence (Seq2Seq) & Encoder/ Decoder Networks

Time estimate: ~15mins

Video 4: Seq2Seq & Encoder-Decoder Nets

Sources: d2l.ai on encoders; d2l.ai on seq2seq; Jalammar’s blog

Sequence-to-sequence models take in a sequence of items (words, characters, etc) as input and produces another sequence of items as output. The most simple seq2seq models are composed of two parts: the encoder, the context (“state” in the figure) and the decoder. The encoder and decoder usually consist of recurrent units that we’ve seen before (RNNs, GRUs or LSTMs). A high-level schematic of the architecture is as follows:

Source d2l.ai

The encoder’s recurrent unit processes the input one item at a time. Once the entire sequence is processed, the final hidden state vector produced is known as a context vector. The size of the context vector is defined while setting up the model, and is equal to the number of hidden states used in the encoder RNN. The encoder then passes the context to the decoder. The decoder’s recurrent unit uses the context to produce the items for the output sequence one by one.

One of the most popular applications of seq2seq models is “machine translation”: the task of taking in a sentence in one language (the source) and producing its translation in another language (the target); with words in both languages being the sequence units. This is a supervised learning task, and requires the dataset to have “parallel sentences”; i.e., each sentence in the source language must be labelled with its translation in the target language.

Here is an intuitive visualization for understanding seq2seq models for machine translation from English to French.

Since the vocabulary of an entire language is very large, training such models to give meaningful performance requires significant time and resources. In this section, we will train a seq2seq model to perform machine translation from English to Pig-Latin. We will modify the task to perform character-level machine translation, so that vocabulary size does not grow exponentially.

Coding Exercise 3: Encoder

Let us consider a sequence example (batch_size=1). Suppose that the input sequence is \(x_1, \ldots, x_T\), such that \(x_t\) is the \(t^{\mathrm{th}}\) token in the input text sequence. At time step \(t\), the RNN transforms the input feature vector \(\mathbf{x}_t\) for \(x_t\) and the hidden state \(\mathbf{h} _{t-1}\) from the previous time step into the current hidden state \(\mathbf{h}_t\).

We can use a function \(f\) to express the transformation of the RNN’s recurrent layer:

(89)\[\begin{equation} \mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1}) \end{equation}\]

In general, the encoder transforms the hidden states at all the time steps into the context variable through a customized function \(q\):

(90)\[\begin{equation} \mathbf{c} = q(\mathbf{h}_1, \ldots, \mathbf{h}_T) \end{equation}\]

For example, when choosing \(q(\mathbf{h}_1, \ldots, \mathbf{h}_T) = \mathbf{h}_T\) the context variable is just the hidden state \(\mathbf{h}_T\) of the input sequence at the final time step.

So far we have used a unidirectional RNN to design the encoder, where a hidden state only depends on the input subsequence at and before the time step of the hidden state. We can also construct encoders using bidirectional RNNs. In this case, a hidden state depends on the subsequence before and after the time step (including the input at the current time step), which encodes the information of the entire sequence.

Now let us implement the RNN encoder. Note that we use an embedding layer to obtain the feature vector for each token in the input sequence. The weight of an embedding layer is a matrix whose number of rows is equal to the size of the input vocabulary (vocab_size) and the number of columns equals to the feature vector’s dimension (embed_size). For any input token index \(i\), the embedding layer fetches the \(i^{\mathrm{th}}\) row (starting from 0) of the weight matrix to return its feature vector. Here we choose a multilayer GRU to implement the encoder.

The returned variables of recurrent layers have been completely explained at this link. Let us still use a concrete example to illustrate the above encoder implementation. Below we instantiate a two-layer GRU encoder whose number of hidden units is 16. Given a minibatch of sequence inputs \(X\) (batch_size=4, number_of_time_steps=7), the hidden states of the last layer at all the time steps (output returned by the encoder’s recurrent layers) are a tensor of shape (number of time steps, batch size, number of hidden units).

class Seq2SeqEncoder(d2l.Encoder):
  """
  RNN encoder for sequence to sequence learning.
  RNN has the following structure:
  Embedding layer with size vocab_size * embed_size
  RNN layer with size embed_size * num_hiddens * num_layers + dropout
  """

  def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                dropout=0, **kwargs):
    """
    Initialize parameters of Seq2SeqEncoder

    Args:
      num_layers: int
        Number of layers in GRU/RNN
      num_hiddens: int
        Size of hidden layer
      vocab_size: int
        Size of vocabulary
      embed_size: int
        Size of embedding
      dropout: int
        Dropout [default: 0]

    Returns:
      Nothing
    """
    super(Seq2SeqEncoder, self).__init__(**kwargs)
    ####################################################################
    # Fill in missing code below (...),
    # then remove or comment the line below to test your function
    raise NotImplementedError("Encoder Unit")
    ####################################################################
    # Embedding layer
    self.embedding = ...
    # Here you're going to implement a GRU as the RNN unit
    self.rnn = ...

  def forward(self, X, *args):
    """
    Forward pass of Seq2SeqEncoder

    Args:
      X: torch.tensor
        Input features

    Returns:
      output: torch.tensor
        Output with shape (`num_steps`, `batch_size`, `num_hiddens`)
      state: torch.tensor
        State with shape (`num_layers`, `batch_size`, `num_hiddens`)
    """
    # The output `X` shape: (`batch_size`, `num_steps`, `embed_size`)
    X = self.embedding(X)
    # In RNN models, the first axis corresponds to time steps
    X = X.permute(1, 0, 2)
    ####################################################################
    # Fill in missing code below (...),
    # then remove or comment the line below to test your function
    raise NotImplementedError("Forward pass")
    ####################################################################
    # When state is not mentioned, it defaults to zeros, the output should be a RNN function of X!
    output, state = ...
    # `output` shape: (`num_steps`, `batch_size`, `num_hiddens`)
    # `state` shape: (`num_layers`, `batch_size`, `num_hiddens`)

    return output, state


# Add event to airtable
atform.add_event('Coding Exercise 3: Encoder')

X = torch.zeros((4, 7), dtype=torch.long)
## uncomment the lines below.
# encoder = Seq2SeqEncoder(vocab_size=10, embed_size=8, num_hiddens=16, num_layers=2)
# encoder.eval()
# output, state = encoder(X)
# print(output.shape)
# print(state.shape)

Click for solution

torch.Size([7, 4, 16])
torch.Size([2, 4, 16])

Section 3.1: Decoder

As we just mentioned, the context variable \(\mathbf{c}\) of the encoder’s output encodes the entire input sequence \(x_1, \ldots, x_T\). Given the output sequence \(y_1, y_2, \ldots, y_{T'}\) from the training dataset, for each time step \(t'\) (the symbol differs from the time step \(t\) of input sequences or encoders), the probability of the decoder output \(y_{t'}\) is conditional on the previous output subsequence \(y_1, \ldots, y_{t'-1}\) and the context variable \(\mathbf{c}\), i.e., \(P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c})\).

To model this conditional probability on sequences, we can use another RNN as the decoder. At any time step \(t^\prime\) on the output sequence, the RNN takes the output \(y_{t^\prime-1}\) from the previous time step and the context variable \(\mathbf{c}\) as its input, then transforms them and the previous hidden state \(\mathbf{s}_{t^\prime-1}\) into the hidden state \(\mathbf{s}_{t^\prime}\) at the current time step.

As a result, we can use a function \(g\) to express the transformation of the decoder’s hidden layer:

(91)\[\begin{equation} \mathbf{s}_{t^\prime} = g(y_{t^\prime-1}, \mathbf{c}, \mathbf{s}_{t^\prime-1}) \end{equation}\]

After obtaining the hidden state of the decoder, we can use an output layer and the softmax operation to compute the conditional probability distribution \(P(y_{t^\prime} \mid y_1, \ldots, y_{t^\prime-1}, \mathbf{c})\) for the output at time step \(t^\prime\).

Following fig_seq2seq, when implementing the decoder as follows, we directly use the hidden state at the final time step of the encoder to initialize the hidden state of the decoder.

This requires that the RNN encoder and the RNN decoder have the same number of layers and hidden units. To further incorporate the encoded input sequence information, the context variable is concatenated with the decoder input at all the time steps. To predict the probability distribution of the output token, a fully-connected layer is used to transform the hidden state at the final layer of the RNN decoder.

class Seq2SeqDecoder(d2l.Decoder):
  """
  RNN decoder for sequence to sequence learning.
  Seq2SeqDecoder has the following structure:
  nn.Embedding(vocab_size, embed_size) # Embedding Layer
  nn.GRU(embed_size + num_hiddens, num_hiddens, num_layers, dropout=dropout) # RNN Layer
  nn.Linear(num_hiddens, vocab_size) # Fully connected layer
  """

  def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                dropout=0, **kwargs):
    """
    Initialize parameters of Seq2SeqDecoder

    Args:
      num_layers: int
        Number of layers in GRU/RNN
      num_hiddens: int
        Size of hidden layer
      vocab_size: int
        Size of vocabulary
      embed_size: int
        Size of embedding
      dropout: int
        Dropout [default: 0]

    Returns:
      Nothing
    """
    super(Seq2SeqDecoder, self).__init__(**kwargs)
    self.embedding = nn.Embedding(vocab_size, embed_size)
    self.rnn = nn.GRU(embed_size + num_hiddens, num_hiddens, num_layers,
                      dropout=dropout)
    self.dense = nn.Linear(num_hiddens, vocab_size)

  def init_state(self, enc_outputs, *args):
    """
    Initialise Seq2SeqDecoder state

    Args:
      enc_outputs: Seq2SeqEncoder instance
        Output of the Seq2SeqEncoder

    Returns:
      Init state of Seq2SeqDecoder as enc_outputs
    """
    return enc_outputs[1]

  def forward(self, X, state):
    """
    Forward pass of Seq2SeqDecoder

    Args:
      X: torch.tensor
        Input features
      state: Seq2SeqEncoder instance
        Output of the Seq2SeqEncoder

    Returns:
      output: torch.tensor
        Output with shape (`batch_size`, `num_steps`, `vocab_size`)
      state: torch.tensor
        State with shape (`num_layers`, `batch_size`, `num_hiddens`)
    """
    # The output `X` shape: (`num_steps`, `batch_size`, `embed_size`)
    X = self.embedding(X).permute(1, 0, 2)
    # Broadcast `context` so it has the same `num_steps` as `X`
    context = state[-1].repeat(X.shape[0], 1, 1)
    X_and_context = torch.cat((X, context), 2)
    output, state = self.rnn(X_and_context, state)
    output = self.dense(output).permute(1, 0, 2)
    # `output` shape: (`batch_size`, `num_steps`, `vocab_size`)
    # `state` shape: (`num_layers`, `batch_size`, `num_hiddens`)
    return output, state

To illustrate the implemented decoder, below we instantiate it with the same hyperparameters from the aforementioned encoder. As we can see, the output shape of the decoder becomes (batch size, number of time steps, vocabulary size), where the last dimension of the tensor stores the predicted token distribution.

decoder = Seq2SeqDecoder(vocab_size=10, embed_size=8, num_hiddens=16,
                         num_layers=2)
state = decoder.init_state(encoder(X))
output, state = decoder(X, state)
output.shape, len(state), state[0].shape

Section 3.2: Loss Function

At each time step, the decoder predicts a probability distribution for the output tokens. Similar to language modeling, we can apply softmax to obtain the distribution and calculate the cross-entropy loss for optimization. Recall that the special padding tokens are appended to the end of sequences so sequences of varying lengths can be efficiently loaded in minibatches of the same shape. However, prediction of padding tokens should be excluded from loss calculations.

To this end, we can use the following sequence_mask function to mask irrelevant entries with zero values so later multiplication of any irrelevant prediction with zero equals to zero. For example, if the valid length of two sequences excluding padding tokens (i.e., pads each sequence to the same length usually matching the longest sequence) are one and two, respectively, the remaining entries after the first one and the first two entries are cleared to zeros.

def sequence_mask(X, valid_len, value=0):
  """
  Mask irrelevant entries in sequences.

  Args:
    X: torch.tensor
      Unmasked sequence as input
    valid_len: torch.tensor
      Valid Length
    value: int
      Mask valur

  Returns:
    X: torch.tensor
      Output post masking
  """
  maxlen = X.size(1)
  mask = torch.arange((maxlen), dtype=torch.float32,
                      device=X.device)[None, :] < valid_len[:, None]
  X[~mask] = value
  return X


X = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(sequence_mask(X, torch.tensor([1, 2])))
X = torch.ones(2, 3, 4)
print(sequence_mask(X, torch.tensor([1, 2]), value=-1))
tensor([[1, 0, 0],
        [4, 5, 0]])
tensor([[[ 1.,  1.,  1.,  1.],
         [-1., -1., -1., -1.],
         [-1., -1., -1., -1.]],

        [[ 1.,  1.,  1.,  1.],
         [ 1.,  1.,  1.,  1.],
         [-1., -1., -1., -1.]]])

Now we can extend the softmax cross-entropy loss to allow the masking of irrelevant predictions. Initially, masks for all the predicted tokens are set to one. Once the valid length is given, the mask corresponding to any padding token will be cleared to zero. In the end, the loss for all the tokens will be multiplied by the mask to filter out irrelevant predictions of padding tokens in the loss.

class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
  """
  The softmax cross-entropy loss with masks.
  """

  def forward(self, pred, label, valid_len):
    """
    Forward pass of MaskedSoftmaxCELoss

    Args:
      pred: torch.tensor
        Predictions of shape: (`batch_size`, `num_steps`, `vocab_size`)
      label: torch.tensor
        Label of shape: (`batch_size`, `num_steps`)
      valid_len: torch.tensor
        Valid Length of shape (`batch_size`,)

    Returns:
      weighted_loss: float
        Weighted Loss
    """
    weights = torch.ones_like(label)
    weights = sequence_mask(weights, valid_len)
    self.reduction = 'none'
    unweighted_loss = super(MaskedSoftmaxCELoss,
                            self).forward(pred.permute(0, 2, 1), label)
    weighted_loss = (unweighted_loss * weights).mean(dim=1)

    return weighted_loss


loss = MaskedSoftmaxCELoss()
loss(torch.ones(3, 4, 10),
     torch.ones((3, 4), dtype=torch.long),
     torch.tensor([4, 2, 0]))
tensor([2.3026, 1.1513, 0.0000])

In the following training loop, we concatenate the special beginning-of-sequence token and the original output sequence excluding the final token as the input to the decoder. This is called teacher forcing because the original output sequence (token labels) is fed into the decoder. Alternatively, we could also feed the predicted token from the previous time step as the current input to the decoder.

Training

#@title Training
def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
  """
  Train a model for sequence to sequence.
  """

  def xavier_init_weights(m):
    """
    Function to initialise weights

    Args:
      m: nn.module
        Type of layer

    Returns:
      Nothing
    """
    if type(m) == nn.Linear:
      nn.init.xavier_uniform_(m.weight)
    if type(m) == nn.GRU:
      for param in m._flat_weights_names:
        if "weight" in param:
          nn.init.xavier_uniform_(m._parameters[param])


  net.apply(xavier_init_weights)
  net.to(device)
  optimizer = torch.optim.Adam(net.parameters(), lr=lr)
  loss = MaskedSoftmaxCELoss()
  net.train()
  animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                          xlim=[10, num_epochs])
  for epoch in range(num_epochs):
    timer = d2l.Timer()
    metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
    for batch in data_iter:
      optimizer.zero_grad()
      X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch]
      bos = torch.tensor([tgt_vocab['<bos>']] * Y.shape[0],
                          device=device).reshape(-1, 1)
      dec_input = torch.cat([bos, Y[:, :-1]], 1)  # Teacher forcing
      Y_hat, _ = net(X, dec_input, X_valid_len)
      l = loss(Y_hat, Y, Y_valid_len)
      l.sum().backward()  # Make the loss scalar for `backward`
      d2l.grad_clipping(net, 1)
      num_tokens = Y_valid_len.sum()
      optimizer.step()
      with torch.no_grad():
          metric.add(l.sum(), num_tokens)
    if (epoch + 1) % 10 == 0:
      animator.add(epoch + 1, (metric[0] / metric[1],))
  print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
        f'tokens/sec on {str(device)}')

Now we can create and train an RNN encoder-decoder model for sequence to sequence learning on the machine translation dataset.

embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.1
batch_size, num_steps = 64, 10
lr, num_epochs = 0.005, 300

train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
encoder = Seq2SeqEncoder(len(src_vocab), embed_size, num_hiddens, num_layers,
                         dropout)
decoder = Seq2SeqDecoder(len(tgt_vocab), embed_size, num_hiddens, num_layers,
                         dropout)
net = d2l.EncoderDecoder(encoder, decoder)
train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, DEVICE)

To predict the output sequence token by token, at each decoder time step the predicted token from the previous time step is fed into the decoder as an input.

Similar to training, at the initial time step the beginning-of-sequence (“<bos>”) token is fed into the decoder. This prediction process is illustrated in seq2seq figure. When the end-of-sequence (“<eos>”) token is predicted, the prediction of the output sequence is complete.

Source d2l.ai

Prediction

# @title Prediction
def predict_seq2seq(net, src_sentence, src_vocab,
                    tgt_vocab, num_steps,
                    device, save_attention_weights=False):
  """
  Predict for sequence to sequence.

  Args:
    net: nn.module
      Instance of model
    src_sentence: string
      Source Sentence
    src_vocab: dict
      Source vocabulary
    tgt_vocab: dict
      Target vocabulary
    num_steps: int
      Number of steps
    save_attention_weights: boolean
      If true, save attention weights
    device: string
      If available, GPU/CUDA. CPU otherwise.

  Returns:
    Sequence predicted using tokenized target vocabulary
    obtained through attention weights
  """
  # Set `net` to eval mode for inference
  net.eval()
  src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
      src_vocab['<eos>']]
  enc_valid_len = torch.tensor([len(src_tokens)], device=device)
  src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])

  # Add the batch axis
  enc_X = torch.unsqueeze(
      torch.tensor(src_tokens, dtype=torch.long, device=device), dim=0)
  enc_outputs = net.encoder(enc_X, enc_valid_len)
  dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)

  # Add the batch axis
  dec_X = torch.unsqueeze(
      torch.tensor([tgt_vocab['<bos>']], dtype=torch.long, device=device),
      dim=0)
  output_seq, attention_weight_seq = [], []
  for _ in range(num_steps):
    Y, dec_state = net.decoder(dec_X, dec_state)

    # We use the token with the highest prediction likelihood as the input
    # of the decoder at the next time step
    dec_X = Y.argmax(dim=2)
    pred = dec_X.squeeze(dim=0).type(torch.int32).item()

    # Save attention weights (to be covered later)
    if save_attention_weights:
        attention_weight_seq.append(net.decoder.attention_weights)

    # Once the end-of-sequence token is predicted, the generation of the
    # output sequence is complete
    if pred == tgt_vocab['<eos>']:
        break
    output_seq.append(pred)
  return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq

We can evaluate a predicted sequence by comparing it with the label sequence (the ground-truth). BLEU (Bilingual Evaluation Understudy), though originally proposed for evaluating machine translation results in Papieni et al., 2002, has been extensively used in measuring the quality of output sequences for different applications.

In principle, for any \(n\)-grams in the predicted sequence, BLEU evaluates whether this \(n\)-grams appears in the label sequence.

Denote by \(p_n\) the precision of \(n\)-grams, which is the ratio of the number of matched \(n\)-grams in the predicted and label sequences to the number of \(n\)-grams in the predicted sequence. To explain, given a label sequence \(A\), \(B\), \(C\), \(D\), \(E\), \(F\), and a predicted sequence \(A\), \(B\), \(B\), \(C\), \(D\), we have \(p_1 = 4/5\), \(p_2 = 3/4\), \(p_3 = 1/3\), and \(p_4 = 0\).

Besides, let \(\mathrm{len}_{\text{label}}\) and \(\mathrm{len}_{\text{pred}}\) be the numbers of tokens in the label sequence and the predicted sequence, respectively.

Then, BLEU is defined as

(92)\[\begin{equation} \exp\left(\min\left(0, 1 - \frac{\mathrm{len}_{\text{label}}}{\mathrm{len}_{\text{pred}}}\right)\right) \prod_{n=1}^k p_n^{1/2^n}, \end{equation}\]

where \(k\) is the longest \(n\)-grams for matching.

Based on the definition of BLEU in the above equation, whenever the predicted sequence is the same as the label sequence, BLEU is 1.

Moreover, since matching longer \(n\)-grams is more difficult, BLEU assigns a greater weight to a longer \(n\)-gram precision. Specifically, when \(p_n\) is fixed, \(p_n^{1/2^n}\) increases as \(n\) grows (the original paper uses \(p_n^{1/n}\)).

Furthermore, since predicting shorter sequences tends to obtain a higher \(p_n\) value, the coefficient before the multiplication term in the above equation penalizes shorter predicted sequences.

For example, when \(k=2\), given the label sequence \(A\), \(B\), \(C\), \(D\), \(E\), \(F\) and the predicted sequence \(A\), \(B\), although \(p_1 = p_2 = 1\), the penalty factor \(\exp(1-6/2) \approx 0.14\) lowers the BLEU.

We implement the BLEU measure as follows.

Evaluation of Predicted Sequences

#@title Evaluation of Predicted Sequences
def bleu(pred_seq, label_seq, k):
  """
  Compute the BLEU Score

  Args:
    pred_seq: string
      Predicted Sequence
    label_seq: string
      Ground truth
    k: int
      Number of iterations

  Returns:
    score: float
      BLEU score
      The score between 0 and 1, indicates how
      similar the predicted and reference statements are.
  """
  pred_tokens, label_tokens = pred_seq.split(' '), label_seq.split(' ')
  len_pred, len_label = len(pred_tokens), len(label_tokens)
  score = math.exp(min(0, 1 - len_label / len_pred))
  for n in range(1, k + 1):
    num_matches, label_subs = 0, collections.defaultdict(int)
    for i in range(len_label - n + 1):
      label_subs[''.join(label_tokens[i:i + n])] += 1
    for i in range(len_pred - n + 1):
      if label_subs[''.join(pred_tokens[i:i + n])] > 0:
        num_matches += 1
        label_subs[''.join(pred_tokens[i:i + n])] -= 1
    score *= math.pow(num_matches / (len_pred - n + 1), math.pow(0.5, n))
  return score

In the end, we use the trained RNN encoder-decoder to translate a few English sentences into French and compute the BLEU of the results.

engs = ['go .', "i lost .", 'he\'s calm .', 'i\'m home .']
# fras = ['va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .']
fras = [ 'je suis chez moi .', 'j\'ai perdu .','va !', 'il est calme .']
for eng, fra in zip(engs, fras):
  translation, attention_weight_seq = predict_seq2seq(net,
                                                      eng,
                                                      src_vocab,
                                                      tgt_vocab,
                                                      num_steps,
                                                      DEVICE)
  print(f'{eng} => {translation}, bleu {bleu(translation, fra, k=2):.3f}')

Section 4: Ethical aspects

Time estimate: ~7mins

Video 5: Ethics of Representation and Generation


Summary

During this day, we have learned about modern RNNs and their variants. Now let’s see some ethical aspects of representation and Generation, and then we will close the tutorials with an overview.

Video 6: Beyond Sequence


Bonus: Attention

Video 7: Attention mechanisms

Previously, we designed an encoder-decoder architecture based on two RNNs for sequence to sequence learning. Specifically, the RNN encoder transforms a variable-length sequence into a fixed-shape context variable, then the RNN decoder generates the output (target) sequence token by token based on the generated tokens and the context variable. However, even though not all the input (source) tokens are useful for decoding a certain token, the same context variable that encodes the entire input sequence is still used at each decoding step. It is challenging for the models to deal with long sentences.

In Bahdanau et al., 2014, the authors proposed a technique called attention. When predicting a token, if not all the input tokens are relevant, the model aligns (or attends) only to parts of the input sequence that are relevant to the current prediction.

In contrast to seq2seq model, the encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder.

In order to focus on the parts of input relevant to the decoder, look at the set of encoder hidden states it received. Each encoder hidden state is at most associated with a certain word in the input sentence. We can assign each hidden state a score and multiply it with the softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores.

Reference Links:

Media 1: Sequence to Sequence model with Attention

# @markdown Media 1: Sequence to Sequence model with Attention

url = "https://jalammar.github.io/images/seq2seq_7.mp4"
from IPython.display import HTML
HTML(f"""<video src={url} width=750 controls/>""")

Media 2: Mapping input to output

# @markdown Media 2: Mapping input to output

url = "https://jalammar.github.io/images/seq2seq_9.mp4"
from IPython.display import HTML
HTML(f"""<video src={url} width=750 controls/>""")

Queries, Keys, and Values

To calculate the attention mechanism we make use of Queries, Keys and Values. But what are Queries, Keys and Values? Query, Value and Key are the transformations of the input vector.

In an attention mechanism the context vector is computed as a weighted sum of values, where the weight assigned to each value is computed through an attention score. The score is usually the dot product between the query and key. The scores then go through the softmax function to yield a set of weights whose sum equals 1.

The query is from the decoder hidden state whereas the key and value are from the encoder hidden state.

Take a minute and look at this article. It has detailed graphical explanation on how to calculate attention scores.

Bonus Coding Exercise: Attention for Text Classification

Until now, we looked at attention aimed at seq2seq networks. Let’s try implementing attention for the above IMDB sentiment analysis dataset. Previously, using the LSTM, the classification completely depended on the last hidden state. In this exercise, we will compute the attention scores between the last hidden state and output of each sequence. The final attention vector will be the weighted average of the outputs at each sequence, with the weights being the attention scores. Lastly, we will concatenate the attention vector and the last hidden state to get the final output.

For simplicity’s sake, let’s implement attention over an LSTM with 1 layer.

Code reference

class AttentionModel(torch.nn.Module):
  """
  Attention Model with following structure:
  nn.Embedding(vocab_size, embedding_length) + nn.Parameter(weights, requires_grad=False) # Embedding Layer
  nn.LSTM(embedding_length, hidden_size) # LSTM layer
  nn.Linear(2*hidden_size, output_size) # First Fully Connected layer
  """

  def __init__(self, batch_size, output_size, hidden_size, vocab_size,
               embedding_length, weights, device):
    """
    Initialize parameters of AttentionModel

    Args:
      batch_size: int
        Batch size
      output_size: int
        Size of output layer
      hidden_size: int
        Size of hidden layer
      vocab_size: int
        Vocabulary size
      weights: torch.tensor
        Attention Weights
      device: string
        GPU/CUDA if available. CPU otherwise.
      embedding_length: int
        Length of the embeddding

    Returns:
      Nothing
    """
    super(AttentionModel, self).__init__()
    self.hidden_size = hidden_size
    self.word_embeddings = nn.Embedding(vocab_size, embedding_length)
    self.word_embeddings.weights = nn.Parameter(weights, requires_grad=False)
    self.lstm = nn.LSTM(embedding_length, hidden_size)
    self.fc1 = nn.Linear(2*hidden_size, output_size)
    self.device = device
    self.num_seq = sentence_length

  def attention_net(self, lstm_output, final_state):
    """
    Returns hidden states based on AttentionNet

    Args:
      lstm_output : torch.tensor
        LSTM Output of shape: (num_seq, batch_size, hidden_size)
      final_state : torch.tensor
        Final State of shape: (1, batch_size, hidden_size)

    Returns:
      new_hidden_state: torch.tensor
        Weighted LSTM output
    """
    ####################################################
    # Implement the AttentionNet
    # Fill in missing code below (...)
    raise NotImplementedError("perform the convolution")
    ####################################################
    # Permute the output to get the shape (batch_size, num_seq, hidden_size)
    # Get the attention weights
    # Use torch.bmm to compute the attention weights between each output and last hidden state
    # Pay attention to the tensor shapes, you may have to use squeeze and unsqueeze functions
    # Softmax the attention weights
    # Get the new hidden state, use torch.bmm to get the weighted lstm output
    # Pay attention to the tensor shapes, you may have to use squeeze and unsqueeze functions
    lstm_output = ...
    hidden = ...
    attn_weights = ...  # Expected shape: (batch_size, num_seq)
    soft_attn_weights = ...
    new_hidden_state = ...

    return new_hidden_state

  def forward(self, input_sentences):
    """
    Forward pass of NeuralNet

    Args:
      input_sentences: string
        Input Sentences

    Returns:
      logits: torch.tensor
        Output of the final fully connected layer
    """
    input = self.word_embeddings(input_sentences)
    input = input.permute(1, 0, 2)

    h_0 = torch.zeros(1, input.shape[1], self.hidden_size).to(self.device)
    c_0 = torch.zeros(1, input.shape[1], self.hidden_size).to(self.device)

    output, (final_hidden_state, final_cell_state) = self.lstm(input, (h_0, c_0))
    attn_output = self.attention_net(output, final_hidden_state)
    final_output = torch.cat((attn_output, final_hidden_state[0]), 1)
    logits = self.fc1(final_output)

    return logits


# Uncomment to check AttentionModel class
# attention_model = AttentionModel(32, 2, 16, 20, 200, TEXT.vocab.vectors, DEVICE)
# print(attention_model)
class AttentionModel(torch.nn.Module):
  """
  Attention Model with following structure:
  nn.Embedding(vocab_size, embedding_length) + nn.Parameter(weights, requires_grad=False) # Embedding Layer
  nn.LSTM(embedding_length, hidden_size) # LSTM layer
  nn.Linear(2*hidden_size, output_size) # First Fully Connected layer
  """

  def __init__(self, batch_size, output_size, hidden_size, vocab_size, embedding_length, sentence_length, weights, device):
    """
    Initialize parameters of AttentionModel

    Args:
      batch_size: int
        Batch size
      output_size: int
        Size of output layer
      hidden_size: int
        Size of hidden layer
      vocab_size: int
        Vocabulary size
      weights: torch.tensor
        Attention Weights
      device: string
        GPU/CUDA if available. CPU otherwise.
      embedding_length: int
        Length of the embeddding

    Returns:
      Nothing
    """
    super(AttentionModel, self).__init__()
    self.hidden_size = hidden_size
    self.word_embeddings = nn.Embedding(vocab_size, embedding_length)
    self.word_embeddings.weights = nn.Parameter(weights, requires_grad=False)
    self.lstm = nn.LSTM(embedding_length, hidden_size)
    self.fc1 = nn.Linear(2*hidden_size, output_size)
    self.device = device
    self.num_seq = sentence_length

  def attention_net(self, lstm_output, final_state, batch_size=32):
    """
    Returns hidden states based on AttentionNet

    Args:
      lstm_output : torch.tensor
        LSTM Output of shape: (num_seq, batch_size, hidden_size)
      final_state : torch.tensor
        Final State of shape: (1, batch_size, hidden_size)

    Returns:
      new_hidden_state: torch.tensor
        Weighted LSTM output
    """
    # Permute the output to get the shape (batch_size, num_seq, hidden_size)
    # Get the attention weights
    # Use torch.bmm to compute the attention weights between each output and last hidden state
    # Pay attention to the tensor shapes, you may have to use squeeze and unsqueeze functions
    # Softmax the attention weights
    # Get the new hidden state, use torch.bmm to get the weighted lstm output
    # Pay attention to the tensor shapes, you may have to use squeeze and unsqueeze functions
    lstm_output = lstm_output.permute(1, 0, 2)
    hidden = final_state.squeeze(0).unsqueeze(2)
    attn_weights = torch.matmul(lstm_output, hidden)
    attn_weights = torch.reshape(attn_weights, ([batch_size,self.num_seq])) # Expected shape: (batch_size, num_seq)
    soft_attn_weights = F.softmax(attn_weights, 1)
    new_hidden_state = torch.bmm(lstm_output.transpose(1, 2), soft_attn_weights.unsqueeze(2)).squeeze(2)
    return new_hidden_state

  def forward(self, input_sentences):
    """
    Forward pass of NeuralNet

    Args:
      input_sentences: string
        Input Sentences

    Returns:
      logits: torch.tensor
        Output of the final fully connected layer
    """
    input = self.word_embeddings(input_sentences)
    input = input.permute(1, 0, 2)
    h_0 = torch.zeros(1, input.shape[1], self.hidden_size).to(self.device)
    c_0 = torch.zeros(1, input.shape[1], self.hidden_size).to(self.device)
    output, (final_hidden_state, final_cell_state) = self.lstm(input, (h_0, c_0))
    attn_output = self.attention_net(output, final_hidden_state, input.shape[1])
    final_output = torch.cat((attn_output, final_hidden_state[0]), 1)
    logits = self.fc1(final_output)
    return logits

# Uncomment to check AttentionModel class
attention_model = AttentionModel(32, 2, 16, 20, 200, 50, TEXT.vocab.vectors, DEVICE)
print(attention_model)
AttentionModel(
  (word_embeddings): Embedding(20, 200)
  (lstm): LSTM(200, 16)
  (fc1): Linear(in_features=32, out_features=2, bias=True)
)

Reload dataset using the default params since variables have been overwritten

# @markdown Reload dataset using the default params since variables have been overwritten
TEXT, vocab_size, train_iter, valid_iter, test_iter = load_dataset(seed=SEED)
learning_rate = 0.0001
batch_size = 32  # Initially was 16
output_size = 2
hidden_size = 16
embedding_length = 300
epochs = 10  # Initially was 12
sentence_length = 50

word_embeddings = TEXT.vocab.vectors
vocab_size = len(TEXT.vocab)

attention_model = AttentionModel(batch_size,
                                 output_size,
                                 hidden_size,
                                 vocab_size,
                                 embedding_length, sentence_length,
                                 word_embeddings,
                                 DEVICE)
attention_model.to(DEVICE)
attention_start_time = time.time()
set_seed(SEED)
attention_train_loss, attention_train_acc, attention_validation_loss, attention_validation_acc = train(attention_model,
                                                                                                       DEVICE,
                                                                                                       train_iter,
                                                                                                       valid_iter,
                                                                                                       epochs,
                                                                                                       learning_rate)
print("--- Time taken to train = %s seconds ---" % (time.time() - attention_start_time))
test_accuracy = test(attention_model, DEVICE, test_iter)
print(f'\n\nTest Accuracy: {test_accuracy}%')
plt.figure()
plt.subplot(211)
plot_train_val(np.arange(0, epochs),
               attention_train_acc,
               attention_validation_acc,
               'train accuracy',
               'val accuracy',
               'attention on IMDB text classification',
               'loss',
               color='C0')
plt.legend(loc='upper left')
plt.subplot(212)
plot_train_val(np.arange(0, epochs),
               attention_train_loss,
               attention_validation_loss,
               'train loss',
               'val loss',
               '',
               'loss',
               color='C1')
plt.tight_layout()
plt.legend(loc='upper left')
plt.show()