Open In Colab   Open in Kaggle

Tutorial 2: Natural Language Processing and LLMs

Week 3, Day 1: Time Series and Natural Language Processing

By Neuromatch Academy

Content creators: Lyle Ungar, Jordan Matelsky, Konrad Kording, Shaonan Wang

Content reviewers: Shaonan Wang, Weizhe Yuan

Content editors: Konrad Kording, Shaonan Wang

Production editors: Konrad Kording, Spiros Chavlis

Our 2021 Sponsors, including Presenting Sponsor Facebook Reality Labs

#Tutorial Objectives

This tutorial provides a comprehensive overview of modern natural language processing. It introduces two influential natural language processing (NLP) architectures, BERT and GPT, along with a detailed exploration of the underlying NLP pipeline. Participants will learn about the core concepts, functionalities, and applications of these architectures, as well as gain insights into prompt engineering and the current and future developments of GPT.

Tutorial slides

These are the slides for the videos in all tutorials today

If you want to download the slides:


Install dependencies

WARNING: There may be errors and/or warnings reported during the installation. However, they are to be ignored.

# @title Install dependencies

# @markdown **WARNING**: There may be *errors* and/or *warnings* reported during the installation. However, they are to be ignored.

!pip3 install gensim==4.3.1 --quiet
!pip3 install pytorch_lightning --quiet
!pip3 install typing_extensions --quiet

!pip install accelerate --quiet
!pip3 install datasets --quiet
!pip3 install transformers==4.28.0 --quiet
!pip3 install evaluate --quiet
!pip3 install vibecheck datatops --quiet

from vibecheck import DatatopsContentReviewContainer

def content_review(notebook_section: str):
    return DatatopsContentReviewContainer(
        "",  # No text prompt
            "url": "",
            "name": "public_testbed",
            "user_key": "3zg0t05r",
# Imports
import random
from typing import Iterable, List
from tqdm.notebook import tqdm
from typing import Dict
import pytorch_lightning as pl

import torch
import torch.nn as nn
import torch.nn.functional as F
from import DataLoader, Dataset
from tokenizers import Tokenizer, Regex, models, normalizers, pre_tokenizers, trainers, processors

Set random seed

Executing set_seed(seed=seed) you are setting the seed

# @title Set random seed

# @markdown Executing `set_seed(seed=seed)` you are setting the seed

# for DL its critical to set the random seed so that students can have a
# baseline to compare their results to expected results.
# Read more here:

# Call `set_seed` function in the exercises to ensure reproducibility.
import random
import numpy as np

def set_seed(seed=None):
  if seed is None:
    seed = np.random.choice(2 ** 32)
  print(f'Random seed {seed} has been set.')

set_seed(seed=2023)  # change 2023 with any number you like
Random seed 2023 has been set.

Set device (GPU or CPU). Execute set_device()

# @title Set device (GPU or CPU). Execute `set_device()`

# Inform the user if the notebook uses GPU or CPU.

def set_device():
  Set the device. CUDA if available, CPU otherwise


  device = "cuda" if torch.cuda.is_available() else "cpu"
  if device != "cuda":
    print("WARNING: For this notebook to perform best, "
        "if possible, in the menu under `Runtime` -> "
        "`Change runtime type.`  select `GPU` ")
    print("GPU is enabled in this notebook.")

  return device
DEVICE = set_device()
SEED = 2021
GPU is enabled in this notebook.
Random seed 2021 has been set.

Section 1: NLP architectures

From RNN/LSTM to Transformers.

Video 1: Intro to NLPs and LLMs

A core principle of much of Natural Language Processing is embedding words as vectors. In the relevant vector space words with similar meaning are close to one another.

In classical transformer systems, a core principle is encoding and decoding. We can take an input sequence and encode it as a vector (that implicitly codes what we just read). And we can then take this vector and decode it, e.g. as a new sentence. So a sequence-to-sequence (e.g. sentence translation) system may read a sentence (made out of words that are embedded in a relevant space) and encode it as an overall vector. It then takes the resulting encoding of the sentence and decodes it into a translated sentence.

In modern transformer systems, such as GPT, all of the words are used in parallel. In that sense the transformers are a generalization of the encoding/decoding idea. Examples of this strategy include all the modern large language models (such as GPT).

Today we will talk about these two approaches.

Submit your feedback

# @title Submit your feedback

Section 2: The NLP pipeline

Tokenize, pretrain, fine-tune

Video 2: NLP pipeline

Submit your feedback

# @title Submit your feedback


Today we will practise embedding techniques, and continue our march toward large language models and transformers by discussing one of the critical developments of the modern NLP stack: Tokenization. Tokenizers convert inputs as a set of discrete tokens.

Learning Goals

  • Understand the concept of tokenization and why it is useful.

  • Learn how to write a tokenizer from scratch, taking advantage of context.

  • Get an intuition for how modern tokenizers work by playing with a few pre-trained tokenizers from industry.

Generating a dataset

As we continue to move closer to “production-grade” NLP, we’ll start to use industry standards such as the HuggingFace library. Huggingface is a large company that facilitates the exchange of aspects of modern deep learning systems.

We’ll start by generating a training dataset. hf has a convenient datasets module that allows us to download a variety of datasets, including the Wikipedia text corpus. We’ll use this to generate a dataset of text from Wikipedia:

from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
Downloading and preparing dataset wikitext/wikitext-103-raw-v1 to /home/spiros/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...
Dataset wikitext downloaded and prepared to /home/spiros/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.
{'text': ' Gray wolves howl to assemble the pack ( usually before and after hunts ) , to pass on an alarm ( particularly at a den site ) , to locate each other during a storm or unfamiliar territory and to communicate across great distances . Wolf howls can under certain conditions be heard over areas of up to 130 km2 ( 50 sq mi ) . Wolf howls are generally indistinguishable from those of large dogs . Male wolves give voice through an octave , passing to a deep bass with a stress on " O " , while females produce a modulated nasal baritone with stress on " U " . Pups almost never howl , while yearling wolves produce howls ending in a series of dog @-@ like yelps . Howling consists of a fundamental frequency that may lie between 150 and 780 Hz , and consists of up to 12 harmonically related overtones . The pitch usually remains constant or varies smoothly , and may change direction as many as four or five times . Howls used for calling pack mates to a kill are long , smooth sounds similar to the beginning of the cry of a horned owl . When pursuing prey , they emit a higher pitched howl , vibrating on two notes . When closing in on their prey , they emit a combination of a short bark and a howl . When howling together , wolves harmonize rather than chorus on the same note , thus creating the illusion of there being more wolves than there actually are . Lone wolves typically avoid howling in areas where other packs are present . Wolves from different geographic locations may howl in different fashions : the howls of European wolves are much more protracted and melodious than those of North American wolves , whose howls are louder and have a stronger emphasis on the first syllable . The two are however mutually intelligible , as North American wolves have been recorded to respond to European @-@ style howls made by biologists . \n'}
def generate_n_examples(dataset, n=512):
  Produce a generator that yields n examples at a time from the dataset.
  for i in range(0, len(dataset), n):
    yield dataset[i:i + n]['text']

Now we are going to create the actual Tokenizer, adhering to the hf.Tokenizer protocol. (Adhering to a standard protocol enables us to swap in our tokenizer for any tokenizer in the huggingface ecosystem, or to apply our own tokenizer to any model in the huggingface ecosystem.)

Let’s sketch out the steps of writing a Tokenizer. We need to solve two problems:

  • Given a string, split it into a list of tokens.

  • If you don’t recognize a word, still figure out a way to tokenize it!

This may feel like we’re reinventing our one-hot encoder, but with a richer vocabulary. Why is it that the One-Hot-Encoder, which output a vector of length \(|V|\), where \(|V|\) is the size of our vocabulary, is not sufficient, but a tokenizer that outputs a list of indices into a vocabulary of size \(|V|\) is sufficient? The answer is that while our encoder was responsible for embedding words into a high-dimensional space, our tokenizer is NOT; the “win” of a tokenizer is that it breaks up a string into in-vocab elements. For certain workflows, the very next step might be adding an embedder onto the end of the tokenizer. (As we’ll soon see, this is exactly the strategy employed by modern Transformer models.)

Tokens will almost always be different from words; for example, we might want to split “don’t” into “do” and “n’t”, or we might want to split “don’t” into “do” and “not”. Or we might even want to split “don’t” into “d”, “o”, “n”, and “t”. We can choose any strategy we want here; unlike Word2Vec, our tokenizer will NOT be limited to outputting one vector per English word. Here, we’ll use an off-the-shelf subword splitter, which we discuss below.

# Try playing with these hyperparameters!
VOCAB_SIZE = 12_000
# Create a tokenizer object that uses the "WordPiece" model. The WorkPiece model
# is a subword tokenizer that uses a vocabulary of common words and word pieces
# to tokenize text. The "unk_token" parameter specifies the token to use for
# unknown tokens, i.e. tokens that are not in the vocabulary. (Remember that the
# vocabulary will be built from our dataset, so it will include subchunks of
# English words.)
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

Tokenizer Features

Now let’s start dressing up our tokenizer with some useful features. First, let’s clean up the text. This process is formally called “normalization”, and it is a critical step in any NLP pipeline. We’ll start by removing punctuation, and then we’ll convert all the text to lowercase. We’ll also remove diacritics (accents) from the text:

# Think of a Normalizer Sequence the same way you would think of a PyTorch
# Sequential model. It is a sequence of normalizers that are applied to the
# text before tokenization, in the order that they are added to the sequence.

tokenizer.normalizer = normalizers.Sequence([
    normalizers.Replace(Regex(r"[\s]"), " "), # Convert all whitespace to single space
    normalizers.Lowercase(), # Convert all text to lowercase
    normalizers.NFD(), # Decompose all characters into their base characters
    normalizers.StripAccents(), # Remove all accents

Next we’ll add a pre-tokenizer. The pre-tokenizer is applied to the text after we normalize it, but before it’s tokenized. The pre-tokenizer is useful for splitting text into chunks that are easier to tokenize. For example, we can split text into chunks that are separated by punctuation or whitespace:

tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.WhitespaceSplit(), # Split on whitespace
    pre_tokenizers.Digits(individual_digits=True), # Split digits into individual tokens
    pre_tokenizers.Punctuation(), # Split punctuation into individual tokens

Finally we’ll train the tokenizer with our dataset, after all we want to obtain a tokenizer that works well on this dataset. There are a few different algorithms for training tokenizers. Here are two common ones:

  • BPE Algorithm: Start with a vocabulary of each character in the dataset. Examine all pairs from the vocabulary and merge the pair with the highest frequency in the dataset. Repeat until the vocabulary size is reached. (So “ee” is more likely to get merged than “zf” in the english corpus)

  • Top-Down WordPiece Algorithm: Generate all substrings of each word from the dataset and count occurrences in the training data. Keep any string that occurs more than a threshold number of times. Repeat this process until the vocabulary size is reached. (For a more thorough explanation of this process, see the TensorFlow Guide.)

We’ll use WordPiece:

tokenizer_trainer = trainers.WordPieceTrainer(
    # We have to specify the special tokens that we want to use. These will be
    # added to the vocabulary no matter what the vocab-building algorithm does.
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],

Those special tokens are important because it tells the WordPiece training process how to treat phrases, masks, and unknown tokens. (Note that we can also add our own special tokens, such as [CITE] to indicate when a citation is about to be used, if we wanted to train a model to predict the presence of citations in a text.) Training this will take a bit of time.

sample_ratio = 0.2
keep = int(len(dataset)*sample_ratio)
dataset_small = load_dataset("wikitext", "wikitext-103-raw-v1", split=f"train[:{keep}]")
tokenizer.train_from_iterator(generate_n_examples(dataset_small), trainer=tokenizer_trainer, length=len(dataset_small))

# In "real life", we'd probably want to save the tokenizer to disk so that we
# can use it later. We can do this with the "save" method:

# Let's try it out!
print("Hello, world!")
        tokenizer.encode("Hello, world!").tokens,
        tokenizer.encode("Hello, world!").ids,

# Can we also tokenize made-up words?
print(tokenizer.encode("These toastersocks are so groommpy!").tokens)
Hello, world!
('hell', 9140) ('##o', 2266) (',', 16) ('world', 4375) ('!', 5)
['these', 'to', '##aster', '##so', '##ck', '##s', 'are', 'so', 'gro', '##omm', '##p', '##y', '!']

(The ## means that the token is a continuation of the previous chunk.)

Try playing around with the hyperparameters and the tokenizing algorithms to see how they affect the tokenizer’s output. There can be some very major differences!

In summary, we created a tokenizer pipeline that:

  • Normalizes the text (cleans up punctuation and diacritics)

  • Splits the text into chunks (using whitespace and punctuation)

  • Trains the tokenizer on the dataset (using the WordPiece algorithm)

In common use, this would be the first step of any modern NLP pipeline. The next step would be to add an embedder to the end of the tokenizer, so that we can feed in a high-dimensional space to our model. But unlike Word2Vec, we can now separate the tokenization step from the embedding step, which means our encoding/embedding process can be task-specific, custom to our downstream neural net architecture, instead of general-purpose.


We established that the tokenizer is a better move than the One-Hot-Encoder because it can handle out-of-vocabulary words. But what if we just made a one-hot encoding where the vocabulary is all possible two-character combinations? Would there still be an advantage to the tokenizer? (Hint: Re-read the section on the BPE and WordPiece algorithms, and how the tokens are selected.)

tokenizer_vs_combinatorial_OHE = "" #@param {type:"string"}

Let’s think about a language like Chinese, where words are each composed of a relatively fewer number of characters compared to English (hungry is six unicode characters, but 饿 is one unicode character), but there are many more unique Chinese characters than there are letters in the English alphabet. In a one or two sentence high-level sketch, what properties would be desireable for a Chinese tokenizer to have?

useful_chinese_tokenizer_properties = "" #@param {type:"string"}

Submit your feedback

# @title Submit your feedback

Section 3: Using BERT

In this section we will learn about using the BERT model from huggingface.

Learning Goals

  • Understand the idea behind BERT

  • Understand the idea of pre-training and fine-tuning

  • Understand how freezing parts of the network is useful

Video 3: BERT

Submit your feedback

# @title Submit your feedback

Section 4: NLG with GPT

In this section we will learn about Natural Language Generation with Generative Pretrained Transformers

Learning goals

  • How to produce language with GPTs

Video 4: NLG

Submit your feedback

# @title Submit your feedback

Using SotA Models

Unless you are writing your own experimental DL research (and sometimes even then!) it is far more common these days to use the HuggingFace model library to quickly import and start working with state of the art models. In this section we will show you how to do that.

We will download a pretrained model from the hf transformers library that is used to generate text. We will then fine-tune it on a different dataset, using the hf.datasets library and the HuggingFace Trainer classes to make the process as easy as possible, and we’ll see that we can accomplish all of this in just a few lines of easily maintained code.

At the end, we will have a working generator… for code!

from transformers import AutoTokenizer
from datasets import load_dataset

We’re first going to pick a tokenizer. You can see some of the options here. We’ll use CodeParrot tokenizer, which is a BPE tokenizer. But you can choose (or build!) another if you’d like to try offroading!

Quiz: Why can you use a different tokenizer than the one that was originally used? What requirements must another tokenizer for this task have?

Hint: You couldn’t, for example, use the very popular bert-base-uncased tokenizer, even though it’s a popular choice for text generation tasks that was trained on the English Wikipedia and the BookCorpus datasets (which are both available in the hf.datasets library).

why_tokenizer_choice = "" #@param{type:"string"}
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small")

Next we’ll download a pre-built model architecture. CodeParrot (the model) is a GPT-2 model, which is a transformer-based language model. You can see some of the options here. But you can choose (or build!) another!

Note that codeparrot/codeparrot ( is about 7GB to download (so it may take a while, or it may be too large for your runtime if you’re on a free Colab). Instead, we will use a smaller model, codeparrot/codeparrot-small (, which is only ~500MB.

To run everything together — tokenization, model, and de-tokenization, we can use the pipeline function from transformers:

from transformers import AutoModelWithLMHead
from transformers import pipeline

model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot-small")
generation_pipeline = pipeline(
    "text-generation", # The task to run. This tells hf what the pipeline steps are
    model=model, # The model to use; can also pass the string here;
    tokenizer=tokenizer, # The tokenizer to use; can also pass the string name here.
input_prompt = '''\
def simple_add(a: int, b: int) -> int:
    Adds two numbers together and returns the result.

# Return tensors for PyTorch:
inputs = tokenizer(input_prompt, return_tensors="pt")

Recall that these tokens are integer indices into the vocabulary of the tokenizer. We can use the tokenizer to decode these tokens into a string, which we can then print out to see what the model is generating.

input_token_ids = inputs["input_ids"]
input_strs = tokenizer.convert_ids_to_tokens(*input_token_ids.tolist())

print(*zip(input_strs, input_token_ids[0]))
('def', tensor(318)) ('Ġsimple', tensor(3486)) ('_', tensor(63)) ('add', tensor(525)) ('(', tensor(8)) ('a', tensor(65)) (':', tensor(26)) ('Ġint', tensor(1109)) (',', tensor(12)) ('Ġb', tensor(330)) (':', tensor(26)) ('Ġint', tensor(1109)) (')', tensor(9)) ('Ġ->', tensor(1035)) ('Ġint', tensor(1109)) (':', tensor(26)) ('ĊĠĠĠ', tensor(272)) ('Ġ"""', tensor(408)) ('ĊĠĠĠ', tensor(272)) ('ĠAdds', tensor(15747)) ('Ġtwo', tensor(2877)) ('Ġnumbers', tensor(5579)) ('Ġtogether', tensor(10451)) ('Ġand', tensor(436)) ('Ġreturns', tensor(2529)) ('Ġthe', tensor(314)) ('Ġresult', tensor(754)) ('.', tensor(14)) ('ĊĠĠĠ', tensor(272)) ('Ġ"""', tensor(408))

(Quick knowledge-check: what are the weirdly-rendering characters representing?)

This model is already ready to use! Let’s give it a try. (Note that we don’t use inputs — we just generated that to show the initial tokenization steps.)

Here, we use the pipeline that we created earlier to chain all of our components together. If you were writing a Copilot-style code-completer, you could get away with wrapping this single line in a nice API and calling it a day!

Play with the hyperparameters and see what kinds of outputs you can get. Temperature is a measure of how much randomness is added to the model’s predictions. Higher temperature means more randomness, and lower temperature means less randomness. More randomness in the latent space will lead to wilder predictions, but potentially more creative answers as well. A good place to start is 0.2. You can also try changing the max_length parameter, which controls how long the generated code can be (though the model can opt to put a “stop” token in the middle of the sequence, so it may not always generate exactly this many tokens).

outputs = generation_pipeline(input_prompt, max_length=100, num_return_sequences=1, temperature=0.2)
def simple_add(a: int, b: int) -> int:
    Adds two numbers together and returns the result.
    return a + b

def simple_mul(a: int, b: int) -> int:
    Multiplies two numbers together and returns the result.
    return a * b

def simple_div(a: int, b: int) -> int:
    Divides two numbers together

Let’s see if we can fool our model now! The huggingface documentation tells us that the codeparrot model was trained to generate Python code (docs). Let’s see if we can get it to generate some JavaScript:

input_prompt = "class SimpleAdder {"

print(generation_pipeline(input_prompt, max_length=100, num_return_sequences=1, temperature=0.2)[0]["generated_text"])
class SimpleAdder {
        def __init__(self, name, args, kwargs):
   = name
            self.args = args
            self.kwargs = kwargs

        def __call__(self, *args, **kwargs):
            return self.args + self.kwargs

class SimpleAdder2 {
        def __init__(self, name, args, kwargs):
   = name
            self.args = args

Yikes! I don’t know what it generated for you, but what it made for me was:

class SimpleAdder {
        class SimpleAdder(object):
            def __init__(self, a, b):
                self.a = a
                self.b = b

            def __call__(self, x):
                return self.a + x

Ew! That’s wrong in a lot of ways. But it’s understandable: Our model can’t really generalize outside of the domain in which it was trained. And so probably there were a few Python files that included syntax of other languages (perhaps generators for other code?) and so the model knows that there’s some mysterious syntax that uses curly brackets… But it’s not sure about anything else. (For the programming-language hobbyists among you: The public notation looks to me a lot like the model is trying to do something C-flavored and perhaps something Java-flavored; I like it! But it’s definitely not JavaScript.)

What are the major observations?

  • The syntax it’s generating rapidly devolves into Python; it’s able to predict only a few characters of non-Python before falling back on its familiar training territory.

  • The part of the code that follows Python syntax is valid, and even resembles a useful class definition. (Although if you look closely, it really doesn’t seem to do anything useful with the b attribute…) This tells us that the model “understands” its problem domain, but just hasn’t been trained on the right data to solve our new problem.

What are your other observations about the code it generated for you? You’re now aware of how Transformers work. (1) Think specifically and remark about the observations a machine learning practitioner would make here if your role were to diagnose the error in a production system. Now, (2) how would a nonexpert user interpret the issues? (3) Do you think the model-reported confidence for this output would be high, low, in between…?

out_of_domain_generation_observations = "" #@param {type:"string"}

Submit your feedback

# @title Submit your feedback


Alright, so we have a model that can generate code. But now we want to fine-tune it to generate JavaScript.

Assuming the data will be too large to fit on disk on Colab, we’ll use the load_dataset function to download only part of the dataset. There’s actually a JavaScript subset to the codeparrot dataset, which we’ll use as an example… but you can use any dataset you like! We recommend filtering datasets by task category (e.g. text generation) to get the most relevant datasets, but you can use any dataset you like if you can configure the data-loader to use it. (Consider, for example, this one.)

Choose a dataset from the HuggingFace datasets library:

# Unlike _some_ code-generator models on the market, we'll limit our training data by license :)
dataset = load_dataset(
    licenses=["mit", "isc", "apache-2.0"],
# Print the schema of the first example from the training set:
print({k: type(v) for k, v in next(iter(dataset)).items()})
{'code': <class 'str'>, 'repo_name': <class 'str'>, 'path': <class 'str'>, 'language': <class 'str'>, 'license': <class 'str'>, 'size': <class 'int'>}

Just like training any model, we need to define a training loop and an evaluation metric.

This is made overwhelmingly easy with the transformers library. Specifically, take a look below at all of the code you can avoid by using the huggingface infrastructure. (In the past, we’ve used PyTorch Lightning, which had a similar training-loop abstraction. Do you have preferences between these two libraries?)

Here are the big pieces of what we do below:

  • Create a TrainingArguments object. This is a serializable object (i.e., you can save it to memory or to disk) that makes it easy to train a model reproducibly with the same hyperparameters. (This certainly beats having a bunch of global variables in your notebook!)

  • Encode the dataset. This is effectively just passing everything through the tokenizer, with a padding step that fills the end of each sequence with the padding token.

  • Define our metrics. We use the accuracy metric here (look at the 4th line in the code cell). Why might accuracy be a bad metric for this task? (Hint: What does it mean to be “accurate” in this task?)

accuracy_metric_observations = "" #@param {type:"string"}
  • Create a data collator. This is a function that takes a list of examples and returns a batch of examples. The DataCollatorForLanguageModeling class is a convenient way to do this.

  • Create a Trainer object. This is a class that wraps the training loop and makes it easy to train a model. It’s a bit like the Trainer class in PyTorch Lightning, but it’s a bit more flexible, and works with non-PyTorch models as well.

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
import numpy as np
from evaluate import load
metric = load("accuracy")

# Trainer:
training_args = TrainingArguments(

tokenizer.pad_token = tokenizer.eos_token

encoded_dataset =
    lambda x: tokenizer(x["code"], truncation=True, padding="max_length"),

# Metrics for loss:
def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

# Data collator:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,

trainer = Trainer(
# Run the actual training:
[100/100 00:56, Epoch 1/9223372036854775807]
Step Training Loss

TrainOutput(global_step=100, training_loss=2.1988786315917968, metrics={'train_runtime': 61.1261, 'train_samples_per_second': 3.272, 'train_steps_per_second': 1.636, 'total_flos': 104516812800000.0, 'train_loss': 2.1988786315917968, 'epoch': 1.0})

Finally, we will try our model on the same code snippet to see how it performs after fine-tuning:

# Move the model to the CPU for inference:"cpu")
        input_prompt, max_length=100, num_return_sequences=1, temperature=0.2
class SimpleAdder {
    function () {

module.exports = SimpleAdder

module.exports.prototype = SimpleAdder

module.exports.prototype.prototype = SimpleAdder

module.exports.prototype.prototype.prototype = SimpleAdder

module.exports.prototype.prototype.prototype.prototype = SimpleAdder

module.exports.prototype.prototype.prototype.prototype = SimpleAdder

Of course, your results will be slightly different. Here’s what I got:

class SimpleAdder {
    constructor(a, b) {
        this.a = a;
        this.b = b;


Much better! The model is no longer generating Python code, and it’s not trying to jam python-flavored syntax into other languages. It’s still not perfect, but it’s a lot better than before! (And of course, remember that this is just a small model, and we didn’t train it for very long. You can try training it for longer, or using a larger model, to get better results.)

best_out_of_domain_example = "" #@param {type:"string"}
best_out_of_domain_example = "" #@param {type:"string"}

Submit your feedback

# @title Submit your feedback

GPT Today and Tomorrow

Limitation of the current models

Video 5: Conclusion

Submit your feedback

# @title Submit your feedback