{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {},
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"# Tutorial 1: Introduction to processing time series\n",
"\n",
"**Week 3, Day 1: Time Series And Natural Language Processing**\n",
"\n",
"**By Neuromatch Academy**\n",
"\n",
"__Content creators:__ Lyle Ungar, Kelson Shilling-Scrivo, Alish Dipani\n",
"\n",
"__Content reviewers:__ Kelson Shilling-Scrivo\n",
"\n",
"__Content editors:__ Gagana B, Spiros Chavlis, Kelson Shilling-Scrivo\n",
"\n",
"__Production editors:__ Gagana B, Spiros Chavlis\n",
"\n",
"
\n",
"\n",
"_Based on Content from: Anushree Hede, Pooja Consul, Ann-Katrin Reuel_"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"----\n",
"# Tutorial objectives\n",
"\n",
"Before we explore how Recurrent Neural Networks (RNNs) excel at modeling sequences, we will explore other ways to model sequences, encode the text, and make meaningful measurements using such encodings and embeddings."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @markdown\n",
"from IPython.display import IFrame\n",
"from ipywidgets import widgets\n",
"out = widgets.Output()\n",
"with out:\n",
" print(f\"If you want to download the slides: https://osf.io/download/n263c/\")\n",
" display(IFrame(src=f\"https://mfr.ca-1.osf.io/render?url=https://osf.io/n263c/?direct%26mode=render%26action=download%26mode=render\", width=730, height=410))\n",
"display(out)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install dependencies\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" There may be *errors* and/or *warnings* reported during the installation. However, they are to be ignored.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Install dependencies\n",
"\n",
"# @markdown There may be *errors* and/or *warnings* reported during the installation. However, they are to be ignored.\n",
"!pip install --upgrade gensim --quiet\n",
"!pip install nltk --quiet\n",
"!pip install python-Levenshtein --quiet\n",
"!pip install portalocker>=2.0.0 --quiet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install and import feedback gadget\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Install and import feedback gadget\n",
"\n",
"!pip3 install vibecheck datatops --quiet\n",
"\n",
"from vibecheck import DatatopsContentReviewContainer\n",
"def content_review(notebook_section: str):\n",
" return DatatopsContentReviewContainer(\n",
" \"\", # No text prompt\n",
" notebook_section,\n",
" {\n",
" \"url\": \"https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab\",\n",
" \"name\": \"neuromatch_dl\",\n",
" \"user_key\": \"f379rz8y\",\n",
" },\n",
" ).render()\n",
"\n",
"\n",
"feedback_prefix = \"W3D1_T1\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install fastText\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" If you want to see the original code, go to [GitHub repo](https://github.com/facebookresearch/fastText.git).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Install fastText\n",
"# @markdown If you want to see the original code, go to [GitHub repo](https://github.com/facebookresearch/fastText.git).\n",
"\n",
"# !pip install git+https://github.com/facebookresearch/fastText.git --quiet\n",
"\n",
"import os, zipfile, requests\n",
"\n",
"url = \"https://osf.io/vkuz7/download\"\n",
"fname = \"fastText-main.zip\"\n",
"\n",
"print('Downloading Started...')\n",
"# Downloading the file by sending the request to the URL\n",
"r = requests.get(url, stream=True)\n",
"\n",
"# Writing the file to the local file system\n",
"with open(fname, 'wb') as f:\n",
" f.write(r.content)\n",
"print('Downloading Completed.')\n",
"\n",
"# opening the zip file in READ mode\n",
"with zipfile.ZipFile(fname, 'r') as zipObj:\n",
" # extracting all the files\n",
" print('Extracting all the files now...')\n",
" zipObj.extractall()\n",
" print('Done!')\n",
" os.remove(fname)\n",
"\n",
"# Install the package\n",
"!pip install fastText-main/ --quiet"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# Imports\n",
"import time\n",
"import nltk\n",
"import fasttext\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from collections import Counter\n",
"\n",
"from nltk.corpus import brown\n",
"from nltk.tokenize import word_tokenize\n",
"from gensim.models import Word2Vec\n",
"\n",
"import torch.nn as nn\n",
"from torch.nn import functional as F\n",
"from torch.utils.data import DataLoader\n",
"from torch.nn.utils.rnn import pad_sequence\n",
"\n",
"from torchtext import datasets\n",
"from torchtext.vocab import FastText\n",
"from torchtext.vocab import vocab as Vocab\n",
"from torchtext.datasets import IMDB, AG_NEWS\n",
"from torchtext.data.utils import get_tokenizer\n",
"from torch.utils.data.dataset import random_split\n",
"from torchtext.vocab import build_vocab_from_iterator\n",
"from torchtext.data.functional import to_map_style_dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Figure Settings\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Figure Settings\n",
"import logging\n",
"logging.getLogger('matplotlib.font_manager').disabled = True\n",
"\n",
"import ipywidgets as widgets\n",
"%matplotlib inline\n",
"%config InlineBackend.figure_format = 'retina'\n",
"plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/content-creation/main/nma.mplstyle\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Dataset from `nltk`\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Load Dataset from `nltk`\n",
"# No critical warnings, so we suppress it\n",
"import warnings\n",
"warnings.simplefilter(\"ignore\")\n",
"\n",
"nltk.download('punkt')\n",
"nltk.download('brown')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Helper functions\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Helper functions\n",
"import requests\n",
"\n",
"\n",
"def download_file_from_google_drive(id, destination):\n",
" URL = \"https://docs.google.com/uc?export=download\"\n",
" session = requests.Session()\n",
" response = session.get(URL, params={'id': id}, stream=True)\n",
" token = get_confirm_token(response)\n",
"\n",
" if token:\n",
" params = {'id': id, 'confirm': token}\n",
" response = session.get(URL, params=params, stream=True)\n",
"\n",
" save_response_content(response, destination)\n",
"\n",
"\n",
"def get_confirm_token(response):\n",
" for key, value in response.cookies.items():\n",
" if key.startswith('download_warning'):\n",
" return value\n",
"\n",
" return None\n",
"\n",
"\n",
"def save_response_content(response, destination):\n",
" CHUNK_SIZE = 32768\n",
" with open(destination, \"wb\") as f:\n",
" for chunk in response.iter_content(CHUNK_SIZE):\n",
" if chunk: # filter out keep-alive new chunks\n",
" f.write(chunk)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set random seed\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Executing `set_seed(seed=seed)` you are setting the seed\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Set random seed\n",
"\n",
"# @markdown Executing `set_seed(seed=seed)` you are setting the seed\n",
"\n",
"# For DL its critical to set the random seed so that students can have a\n",
"# baseline to compare their results to expected results.\n",
"# Read more here: https://pytorch.org/docs/stable/notes/randomness.html\n",
"\n",
"# Call `set_seed` function in the exercises to ensure reproducibility.\n",
"import random\n",
"import torch\n",
"\n",
"def set_seed(seed=None, seed_torch=True):\n",
" \"\"\"\n",
" Function that controls randomness.\n",
" NumPy and random modules must be imported.\n",
"\n",
" Args:\n",
" seed : Integer\n",
" A non-negative integer that defines the random state. Default is `None`.\n",
" seed_torch : Boolean\n",
" If `True` sets the random seed for pytorch tensors, so pytorch module\n",
" must be imported. Default is `True`.\n",
"\n",
" Returns:\n",
" Nothing.\n",
" \"\"\"\n",
" if seed is None:\n",
" seed = np.random.choice(2 ** 32)\n",
" random.seed(seed)\n",
" np.random.seed(seed)\n",
" if seed_torch:\n",
" torch.manual_seed(seed)\n",
" torch.cuda.manual_seed_all(seed)\n",
" torch.cuda.manual_seed(seed)\n",
" torch.backends.cudnn.benchmark = False\n",
" torch.backends.cudnn.deterministic = True\n",
"\n",
" print(f'Random seed {seed} has been set.')\n",
"\n",
"# In case that `DataLoader` is used\n",
"def seed_worker(worker_id):\n",
" \"\"\"\n",
" DataLoader will reseed workers following randomness in\n",
" multi-process data loading algorithm.\n",
"\n",
" Args:\n",
" worker_id: integer\n",
" ID of subprocess to seed. 0 means that\n",
" the data will be loaded in the main process\n",
" Refer: https://pytorch.org/docs/stable/data.html#data-loading-randomness for more details\n",
"\n",
" Returns:\n",
" Nothing\n",
" \"\"\"\n",
" worker_seed = torch.initial_seed() % 2**32\n",
" np.random.seed(worker_seed)\n",
" random.seed(worker_seed)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set device (GPU or CPU). Execute `set_device()`\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Set device (GPU or CPU). Execute `set_device()`\n",
"\n",
"# Inform the user if the notebook uses GPU or CPU.\n",
"\n",
"def set_device():\n",
" \"\"\"\n",
" Set the device. CUDA if available, CPU otherwise\n",
"\n",
" Args:\n",
" None\n",
"\n",
" Returns:\n",
" Nothing\n",
" \"\"\"\n",
" device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
" if device != \"cuda\":\n",
" print(\"WARNING: For this notebook to perform best, \"\n",
" \"if possible, in the menu under `Runtime` -> \"\n",
" \"`Change runtime type.` select `GPU` \")\n",
" else:\n",
" print(\"GPU is enabled in this notebook.\")\n",
"\n",
" return device"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"DEVICE = set_device()\n",
"SEED = 2021\n",
"set_seed(seed=SEED)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Section 1: Intro: What time series are there?\n",
"\n",
"*Time estimate: 20 mins*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Video 1: Time Series and NLP\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 1: Time Series and NLP\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"\n",
"video_ids = [('Youtube', 'W4RTRXt7pO0'), ('Bilibili', 'BV1E94y117Nf')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Time_Series_and_NLP_Video\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Video 2: What is NLP?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 2: What is NLP?\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"\n",
"video_ids = [('Youtube', 'Q-PGZyaBQVk'), ('Bilibili', 'BV18v4y1M7GF')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_What_is_NLP_Video\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Section 2: Embeddings\n",
"\n",
"*Time estimate: 50 mins*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Video 3: NLP Tokenization\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 3: NLP Tokenization\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"\n",
"video_ids = [('Youtube', 'GLreyXm4rg8'), ('Bilibili', 'BV1ov4y1M7bQ')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_NLP_tokenization_Video\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Section 2.1: Introduction\n",
"\n",
"[Word2vec](https://rare-technologies.com/word2vec-tutorial/) is a group of related models that produce word embeddings. These models are shallow, two-layer neural networks trained to reconstruct linguistic contexts of words. Word2vec takes a large corpus of text as input and produces a vector space, with each unique word in the corpus being assigned a corresponding vector in the space."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Creating Word Embeddings\n",
"\n",
"We will create embeddings for a subset of categories in [Brown corpus](https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html). To achieve this task we will use [gensim](https://radimrehurek.com/gensim/) library to create word2vec embeddings. Gensim’s word2vec expects a sequence of sentences as its input. Each sentence is a list of words.\n",
"\n",
"Calling Word2Vec(sentences, `iter=1`) will run two passes over the sentences iterator (generally, `iter+1` passes). The first pass collects words and their frequencies to build an internal dictionary tree structure. The second and subsequent passes train the neural model.\n",
"Word2vec accepts several parameters that affect both training speed and quality.\n",
"\n",
"One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there are not enough data to make any meaningful training on those words, so it’s best to ignore them:\n",
"\n",
"```python\n",
"model = Word2Vec(sentences, min_count=10) # default value is 5\n",
"```\n",
"\n",
"A reasonable value for `min_count` is bewteen 0-100, depending on the size of your dataset.\n",
"\n",
"Another parameter is the `size` of the NN layers, which correspond to the “degrees” of freedom the training algorithm has:\n",
"\n",
"```python\n",
"model = Word2Vec(sentences, size=200) # default value is 100\n",
"```\n",
"\n",
"Bigger `size` values require more training data but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.\n",
"\n",
"The last of the major parameters (full list [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) is for training parallelization, to speed up training:\n",
"\n",
"```python\n",
"model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# Categories used for the Brown corpus\n",
"category = ['editorial', 'fiction', 'government', 'mystery', 'news', 'religion',\n",
" 'reviews', 'romance', 'science_fiction']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Word2Vec model\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown Word2Vec model\n",
"\n",
"def create_word2vec_model(category='news', size=50, sg=1, min_count=5):\n",
" sentences = brown.sents(categories=category)\n",
" model = Word2Vec(sentences, vector_size=size,\n",
" sg=sg, min_count=min_count)\n",
" return model\n",
"\n",
"\n",
"def model_dictionary(model):\n",
" print(w2vmodel.wv)\n",
" words = list(w2vmodel.wv)\n",
" return words\n",
"\n",
"\n",
"def get_embedding(word, model):\n",
" if word in w2vmodel.wv:\n",
" return model.wv[word]\n",
" else:\n",
" return None"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"The cell will take 30-45 seconds to run."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# Create a word2vec model based on categories from Brown corpus\n",
"w2vmodel = create_word2vec_model(category)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"You can get the embedding vector for a word in the dictionary."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# get word list from Brown corpus\n",
"brown_wordlist = list(brown.words(categories=category))\n",
"# generate a random word\n",
"random_word = random.sample(brown_wordlist, 1)[0]\n",
"# get embedding of the random word\n",
"random_word_embedding = get_embedding(random_word, w2vmodel)\n",
"print(f'Embedding of \"{random_word}\" is {random_word_embedding}')"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Visualizing Word Embeddings\n",
"\n",
"We can now obtain the word embeddings for any word in the dictionary using word2vec. Let's visualize these embeddings to get an intuition of what these embeddings mean. The word embeddings obtained from the word2vec model are in high dimensional space, and we will use tSNE to pick the two features that capture the most variance in the embeddings to represent them in a 2D space.\n",
"\n",
"For each word in `keys`, we pick the top 10 similar words (using cosine similarity) and plot them.\n",
"\n",
"Before you run the code, spend some time to think:\n",
"\n",
"- What should be the arrangement of similar words?\n",
"- What should be the arrangement of the critical clusters with respect to each other?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"keys = ['voters', 'magic', 'love', 'God', 'evidence', 'administration', 'governments']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown ### Cluster embeddings related functions\n",
"\n",
"# @markdown **Note:** We import [sklearn.manifold.TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)\n",
"from sklearn.manifold import TSNE\n",
"import matplotlib.cm as cm\n",
"\n",
"def get_cluster_embeddings(keys):\n",
" embedding_clusters = []\n",
" word_clusters = []\n",
"\n",
" # find closest words and add them to cluster\n",
" for word in keys:\n",
" embeddings = []\n",
" words = []\n",
" if not word in w2vmodel.wv:\n",
" print(f'The word {word} is not in the dictionary')\n",
" continue\n",
"\n",
" for similar_word, _ in w2vmodel.wv.most_similar(word, topn=10):\n",
" words.append(similar_word)\n",
" embeddings.append(w2vmodel.wv[similar_word])\n",
" embeddings.append(get_embedding(word, w2vmodel))\n",
" words.append(word)\n",
" embedding_clusters.append(embeddings)\n",
" word_clusters.append(words)\n",
"\n",
" # get embeddings for the words in clusers\n",
" embedding_clusters = np.array(embedding_clusters)\n",
" n, m, k = embedding_clusters.shape\n",
" tsne_model_en_2d = TSNE(perplexity=10, n_components=2, init='pca', n_iter=3500, random_state=32)\n",
" embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)\n",
" return embeddings_en_2d, word_clusters\n",
"\n",
"\n",
"def tsne_plot_similar_words(title, labels, embedding_clusters,\n",
" word_clusters, opacity, filename=None):\n",
" plt.figure(figsize=(16, 9))\n",
" colors = cm.rainbow(np.linspace(0, 1, len(labels)))\n",
" for label, embeddings, words, color in zip(labels, embedding_clusters, word_clusters, colors):\n",
" x = embeddings[:, 0]\n",
" y = embeddings[:, 1]\n",
" plt.scatter(x, y, color=color, alpha=opacity, label=label)\n",
" # Plot the cluster centroids\n",
" plt.plot(np.mean(x), np.mean(y), 'x', color=color, markersize=16)\n",
" for i, word in enumerate(words):\n",
" size = 10 if i < 10 else 14\n",
" plt.annotate(word, alpha=0.5, xy=(x[i], y[i]), xytext=(5, 2),\n",
" textcoords='offset points',\n",
" ha='right', va='bottom', size=size)\n",
" plt.legend()\n",
" plt.title(title)\n",
" plt.grid(True)\n",
" if filename:\n",
" plt.savefig(filename, format='png', dpi=150, bbox_inches='tight')\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# Get closest words to the keys and get clusters of these words\n",
"embeddings_en_2d, word_clusters = get_cluster_embeddings(keys)\n",
"# tSNE plot of similar words to keys\n",
"tsne_plot_similar_words(title='Similar words from Brown Corpus',\n",
" labels=keys,\n",
" embedding_clusters=embeddings_en_2d,\n",
" word_clusters=word_clusters,\n",
" opacity=0.7,\n",
" filename='similar_words.png')"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Think! 2.1: Similarity\n",
"\n",
"1. What does having higher similarity between two word embeddings mean?\n",
"2. Why are cluster centroids (represented with X in the plot) close to some keys (represented with larger fonts) but farther from others?"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/solutions/W3D1_Tutorial1_Solution_61e9bed5.py)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Similarity_Discussion\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Section 2.2: Embedding exploration"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Video 4: Embeddings rule!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 4: Embeddings rule!\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"\n",
"video_ids = [('Youtube', '7ijjjFpcOwI'), ('Bilibili', 'BV1KN4y1G7sL')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Embeddings_rule_Video\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Video 5: Distributional Similarity and Vector Embeddings\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 5: Distributional Similarity and Vector Embeddings\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"\n",
"video_ids = [('Youtube', '0vTuEIAnrII'), ('Bilibili', 'BV1sa411W7ks')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Distributional_Similarity_and_Vector_Embeddings_Video\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Words or subword units such as morphemes are the basic units we use to express meaning in language. The technique of mapping words to vectors of real numbers is known as word embedding.\n",
"\n",
"In this section, we will use pretrained `fastText` embeddings, a context-oblivious embedding similar to `word2vec`."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Embedding Manipulation\n",
"\n",
"Let's use the [FastText](https://fasttext.cc/) library to manipulate the embeddings. First, find the embedding for the word \"King\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown ### Download FastText English Embeddings of dimension 100\n",
"# @markdown This will take 1-2 minutes to run\n",
"\n",
"import os, zipfile, requests\n",
"\n",
"url = \"https://osf.io/2frqg/download\"\n",
"fname = \"cc.en.100.bin.gz\"\n",
"\n",
"print('Downloading Started...')\n",
"# Downloading the file by sending the request to the URL\n",
"r = requests.get(url, stream=True)\n",
"\n",
"# Writing the file to the local file system\n",
"with open(fname, 'wb') as f:\n",
" f.write(r.content)\n",
"print('Downloading Completed.')\n",
"\n",
"# opening the zip file in READ mode\n",
"with zipfile.ZipFile(fname, 'r') as zipObj:\n",
" # extracting all the files\n",
" print('Extracting all the files now...')\n",
" zipObj.extractall()\n",
" print('Done!')\n",
" os.remove(fname)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# Load 100 dimension FastText Vectors using FastText library\n",
"ft_en_vectors = fasttext.load_model('cc.en.100.bin')\n",
"print(f\"Length of the embedding is: {len(ft_en_vectors.get_word_vector('king'))}\")\n",
"print(f\"\\nEmbedding for the word King is:\\n {ft_en_vectors.get_word_vector('king')}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Cosine similarity is used for similarities between words. Similarity is a scalar between 0 and 1. Higher scalar value corresponds to higher similarity.\n",
"\n",
"Now find the 10 most similar words to \"king\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"ft_en_vectors.get_nearest_neighbors(\"king\", 10) # Most similar by key"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Word Similarity\n",
"\n",
"More on similarity between words. Let's check how similar different pairs of word are."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"def cosine_similarity(vec_a, vec_b):\n",
" \"\"\"Compute cosine similarity between vec_a and vec_b\"\"\"\n",
" return np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))\n",
"\n",
"\n",
"def getSimilarity(word1, word2):\n",
" v1 = ft_en_vectors.get_word_vector(word1)\n",
" v2 = ft_en_vectors.get_word_vector(word2)\n",
" return cosine_similarity(v1, v2)\n",
"\n",
"\n",
"print(f\"Similarity between the words King and Queen: {getSimilarity('king', 'queen')}\")\n",
"print(f\"Similarity between the words King and Knight: {getSimilarity('king', 'knight')}\")\n",
"print(f\"Similarity between the words King and Rock: {getSimilarity('king', 'rock')}\")\n",
"print(f\"Similarity between the words King and Twenty: {getSimilarity('king', 'twenty')}\")\n",
"\n",
"print(f\"\\nSimilarity between the words Dog and Cat: {getSimilarity('dog', 'cat')}\")\n",
"print(f\"Similarity between the words Ascending and Descending: {getSimilarity('ascending', 'descending')}\")\n",
"print(f\"Similarity between the words Victory and Defeat: {getSimilarity('victory', 'defeat')}\")\n",
"print(f\"Similarity between the words Less and More: {getSimilarity('less', 'more')}\")\n",
"print(f\"Similarity between the words True and False: {getSimilarity('true', 'false')}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Interactive Demo 2.2.1: Check similarity between words"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Type two words and run the cell!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown Type two words and run the cell!\n",
"word1 = 'King' # @param \\ {type:\"string\"}\n",
"word2 = 'Frog' # @param \\ {type:\"string\"}\n",
"word_similarity = getSimilarity(word1, word2)\n",
"print(f'Similarity between {word1} and {word2}: {word_similarity}')"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Using embeddings, we can find the words that appear in similar contexts. But, what happens if the word has several different meanings?"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Homonym Similarity\n",
"\n",
"Homonyms are words that have the same spelling or pronunciation but different meanings depending on the context. Let's explore how these words are embedded and their similarity in different contexts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"####################### Words with multiple meanings ##########################\n",
"print(f\"Similarity between the words Cricket and Insect: {getSimilarity('cricket', 'insect')}\")\n",
"print(f\"Similarity between the words Cricket and Sport: {getSimilarity('cricket', 'sport')}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Check_similarity_between_words_Interactive_Demo\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Interactive Demo 2.2.2: Explore homonyms"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown Type the words and run the cell!\n",
"# @markdown examples - minute (time/small), pie (graph/food)\n",
"\n",
"word = 'minute' # @param \\ {type:\"string\"}\n",
"context_word_1 = 'time' # @param \\ {type:\"string\"}\n",
"context_word_2 = 'small' # @param \\ {type:\"string\"}\n",
"word_similarity_1 = getSimilarity(word, context_word_1)\n",
"word_similarity_2 = getSimilarity(word, context_word_2)\n",
"print(f'Similarity between {word} and {context_word_1}: {word_similarity_1}')\n",
"print(f'Similarity between {word} and {context_word_2}: {word_similarity_2}')"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Word Analogies"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Embeddings can be used to find word analogies.\n",
"Let's try it:\n",
"1. Man : Woman :: King : _____\n",
"2. Germany: Berlin :: France : _____\n",
"3. Leaf : Tree :: Petal : _____\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"## Use get_analogies() funnction.\n",
"# The words have to be in the order Positive, negative, Positve\n",
"\n",
"# Man : Woman :: King : _____\n",
"# Positive=(woman, king), Negative=(man)\n",
"print(ft_en_vectors.get_analogies(\"woman\", \"man\", \"king\", 1))\n",
"\n",
"# Germany: Berlin :: France : ______\n",
"# Positive=(berlin, frannce), Negative=(germany)\n",
"print(ft_en_vectors.get_analogies(\"berlin\", \"germany\", \"france\", 1))\n",
"\n",
"# Leaf : Tree :: Petal : _____\n",
"# Positive=(tree, petal), Negative=(leaf)\n",
"print(ft_en_vectors.get_analogies(\"tree\", \"leaf\", \"petal\", 1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"But, does it always work?\n",
"\n",
"\n",
"1. Poverty : Wealth :: Sickness : _____\n",
"2. train : board :: horse : _____"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# Poverty : Wealth :: Sickness : _____\n",
"print(ft_en_vectors.get_analogies(\"wealth\", \"poverty\", \"sickness\", 1))\n",
"\n",
"# train : board :: horse : _____\n",
"print(ft_en_vectors.get_analogies(\"board\", \"train\", \"horse\", 1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Explore_homonyms_Interactive_Demo\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Section 2.3: Neural Net with word embeddings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Video 6: Using Embeddings\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 6: Using Embeddings\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"\n",
"video_ids = [('Youtube', '9ujUgNoPeF0'), ('Bilibili', 'BV1cU4y1Q7Fh')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Using_Embeddings_Video\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Training context-oblivious word embeddings is relatively cheap, but most people still use pre-trained word embeddings. After we cover context-sensitive word embeddings, we'll see how to \"fine tune\" embeddings (adjust them to the task at hand).\n",
"\n",
"Let's use the pretrained FastText embeddings to train a neural network on the IMDB dataset.\n",
"\n",
"The data consists of reviews and sentiments attached to it, and it is a binary classification task."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Coding Exercise 1: Simple feed forward net\n",
"\n",
"Define a vanilla neural network with linear layers. Then average the word embeddings to get an embedding for the entire review. The neural net will have one hidden layer of size 128."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"class NeuralNet(nn.Module):\n",
" \"\"\" A vanilla neural network. \"\"\"\n",
" def __init__(self, output_size, hidden_size, vocab_size,\n",
" embedding_length, word_embeddings):\n",
" \"\"\"\n",
" Constructs a vanilla Neural Network Instance.\n",
"\n",
" Args:\n",
" batch_size: Integer\n",
" Specifies probability of dropout hyperparameter\n",
" output_size: Integer\n",
" Specifies the size of output vector\n",
" hidden_size: Integer\n",
" Specifies the size of hidden layer\n",
" vocab_size: Integer\n",
" Specifies the size of the vocabulary\n",
" i.e. the number of tokens in the vocabulary\n",
" embedding_length: Integer\n",
" Specifies the size of the embedding vector\n",
" word_embeddings\n",
" Specifies the weights to create embeddings from\n",
" voabulary.\n",
"\n",
" Returns:\n",
" Nothing\n",
" \"\"\"\n",
" super(NeuralNet, self).__init__()\n",
"\n",
" self.output_size = output_size\n",
" self.hidden_size = hidden_size\n",
" self.vocab_size = vocab_size\n",
" self.embedding_length = embedding_length\n",
"\n",
" # self.word_embeddings = nn.EmbeddingBag(vocab_size, embedding_length, sparse=False)\n",
" self.word_embeddings = nn.EmbeddingBag.from_pretrained(embedding_fasttext.vectors)\n",
" self.word_embeddings.weight.requiresGrad = False\n",
" self.fc1 = nn.Linear(embedding_length, hidden_size)\n",
" self.fc2 = nn.Linear(hidden_size, output_size)\n",
" self.init_weights()\n",
"\n",
" def init_weights(self):\n",
" initrange = 0.5\n",
" # self.word_embeddings.weight.data.uniform_(-initrange, initrange)\n",
" self.fc1.weight.data.uniform_(-initrange, initrange)\n",
" self.fc1.bias.data.zero_()\n",
" self.fc2.weight.data.uniform_(-initrange, initrange)\n",
" self.fc2.bias.data.zero_()\n",
"\n",
" def forward(self, inputs, offsets):\n",
" \"\"\"\n",
" Compute the final labels by taking tokens as input.\n",
"\n",
" Args:\n",
" inputs: Tensor\n",
" Tensor of tokens in the text\n",
"\n",
" Returns:\n",
" out: Tensor\n",
" Final prediction Tensor\n",
" \"\"\"\n",
" embedded = self.word_embeddings(inputs, offsets) # convert text to embeddings\n",
" #################################################\n",
" # Implement a vanilla neural network\n",
" raise NotImplementedError(\"Neural Net `forward`\")\n",
" #################################################\n",
" # Pass the embeddings through the neural net\n",
" # Use ReLU as the non-linearity\n",
" x = ...\n",
" x = ...\n",
" x = ...\n",
" output = F.log_softmax(x, dim=1)\n",
" return output"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/solutions/W3D1_Tutorial1_Solution_ea214171.py)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Simple_feed_forward_net_Exercise\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown ### Helper functions\n",
"\n",
"# @markdown - `train(model, device, train_iter, valid_iter, epochs, learning_rate)`\n",
"\n",
"# @markdown - `test(model, device, test_iter)`\n",
"\n",
"# @markdown - `load_dataset(emb_vectors, seed, batch_size=32)`\n",
"\n",
"# @markdown - `plot_train_val(x, train, val, train_label, val_label, title)`\n",
"\n",
"\n",
"# Training\n",
"import time\n",
"\n",
"def train(dataloader):\n",
" model.train()\n",
" total_acc, total_count = 0, 0\n",
" running_loss = 0\n",
" log_interval = 500\n",
" start_time = time.time()\n",
"\n",
" for idx, (label, text, offsets) in enumerate(dataloader):\n",
" optimizer.zero_grad()\n",
" predicted_label = model(text, offsets)\n",
" loss = criterion(predicted_label, label)\n",
" loss.backward()\n",
" torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)\n",
" optimizer.step()\n",
" total_acc += (predicted_label.argmax(1) == label).sum().item()\n",
" total_count += label.size(0)\n",
" if idx % log_interval == 0 and idx > 0:\n",
" elapsed = time.time() - start_time\n",
" print(f'| epoch {epoch:3d} | {idx:5d}/{len(dataloader):5d} batches '\n",
" f'| accuracy {total_acc/total_count:8.3f}')\n",
"\n",
" start_time = time.time()\n",
"\n",
" running_loss += loss.item()\n",
" return total_acc/total_count, loss\n",
"\n",
"\n",
"def evaluate(dataloader):\n",
" model.eval()\n",
" total_acc, total_count = 0, 0\n",
" running_loss = 0\n",
"\n",
" with torch.no_grad():\n",
" for idx, (label, text, offsets) in enumerate(dataloader):\n",
" predicted_label = model(text, offsets)\n",
" loss = criterion(predicted_label, label)\n",
" total_acc += (predicted_label.argmax(1) == label).sum().item()\n",
" total_count += label.size(0)\n",
" running_loss += loss\n",
" return total_acc/total_count, loss\n",
"\n",
"\n",
"def load_dataset(train_iter, device='cpu', seed=0, batch_size=32, valid_split=0.7):\n",
"\n",
" # Prepare data processing pipelines\n",
" tokenizer = get_tokenizer('basic_english')\n",
"\n",
" def yield_tokens(data_iter):\n",
" for _, text in data_iter:\n",
" yield tokenizer(text)\n",
"\n",
" # Create a vocabulary block\n",
" vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=[\"\"])\n",
" vocab.set_default_index(vocab[\"\"])\n",
"\n",
" # Prepare the text processing pipeline with the tokenizer and vocabulary\n",
" text_pipeline = lambda x: vocab(tokenizer(x))\n",
" label_pipeline = lambda x: int(x) - 1\n",
"\n",
" def collate_batch(batch):\n",
" label_list, text_list, offsets = [], [], [0]\n",
" for (_label, _text) in batch:\n",
" label_list.append(label_pipeline(_label))\n",
" processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)\n",
" text_list.append(processed_text)\n",
" offsets.append(processed_text.size(0))\n",
" label_list = torch.tensor(label_list, dtype=torch.int64)\n",
" offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)\n",
" text_list = torch.cat(text_list)\n",
" return label_list.to(device), text_list.to(device), offsets.to(device)\n",
"\n",
" # Generate data batch and iterator and split the data\n",
" train_iter, test_iter = AG_NEWS()\n",
" train_dataset = to_map_style_dataset(train_iter)\n",
" test_dataset = to_map_style_dataset(test_iter)\n",
" num_train = int(len(train_dataset) * valid_split)\n",
" num_valid = len(train_dataset) - num_train\n",
" generator = torch.Generator().manual_seed(seed)\n",
" split_train_, split_valid_ = random_split(train_dataset, [num_train, num_valid], generator=generator)\n",
"\n",
" train_dataloader = DataLoader(split_train_, batch_size=batch_size,\n",
" shuffle=True, collate_fn=collate_batch)\n",
" valid_dataloader = DataLoader(split_valid_, batch_size=batch_size,\n",
" shuffle=True, collate_fn=collate_batch)\n",
" test_dataloader = DataLoader(test_dataset, batch_size=batch_size,\n",
" shuffle=True, collate_fn=collate_batch)\n",
"\n",
" return vocab, train_dataloader, valid_dataloader, test_dataloader\n",
"\n",
"\n",
"# Plotting\n",
"def plot_train_val(x, train, val, train_label, val_label, title, ylabel):\n",
" plt.plot(x, train, label=train_label)\n",
" plt.plot(x, val, label=val_label)\n",
" plt.legend()\n",
" plt.xlabel('epoch')\n",
" plt.ylabel(ylabel)\n",
" plt.title(title)\n",
" plt.show()\n",
"\n",
"\n",
"# Dataset\n",
"def tokenize(sentences):\n",
" # Tokenize the sentence\n",
" # from nltk.tokenize library use word_tokenize\n",
" token = word_tokenize(sentences)\n",
" return token"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown ### Download embeddings and load the dataset\n",
"\n",
"# @markdown This will load 300 dim FastText embeddings.\n",
"\n",
"# @markdown It will take around 3-4 minutes.\n",
"\n",
"embedding_fasttext = FastText('simple')\n",
"\n",
"# load the training data\n",
"VOCAB, train_data, valid_data, test_data = load_dataset(iter(AG_NEWS(split='train')), device=DEVICE, seed=SEED, batch_size=32, valid_split=0.7)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"train_iter = AG_NEWS(split='train')\n",
"num_class = len(set([label for (label, text) in train_iter]))\n",
"vocab_size = len(VOCAB)\n",
"hidden_size = 128\n",
"embedding_length = 300\n",
"model = NeuralNet(num_class, hidden_size, vocab_size, embedding_length, embedding_fasttext).to(DEVICE)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# Hyperparameters\n",
"EPOCHS = 10 # epoch\n",
"LR = 5 # learning rate\n",
"BATCH_SIZE = 64 # batch size for training\n",
"\n",
"criterion = torch.nn.CrossEntropyLoss()\n",
"optimizer = torch.optim.SGD(model.parameters(), lr=LR)\n",
"scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)\n",
"total_accu = None\n",
"\n",
"train_loss, val_loss = [], []\n",
"train_acc, val_acc = [], []\n",
"\n",
"for epoch in range(1, EPOCHS + 1):\n",
" epoch_start_time = time.time()\n",
" accu_train, loss_train = train(train_data)\n",
" accu_val, loss_val = evaluate(valid_data)\n",
" train_loss.append(loss_train)\n",
" val_loss.append(loss_val)\n",
" train_acc.append(accu_train)\n",
" val_acc.append(accu_val)\n",
"\n",
" if total_accu is not None and total_accu > accu_val:\n",
" scheduler.step()\n",
" else:\n",
" total_accu = accu_val\n",
" print('-' * 59)\n",
" print('| end of epoch {:3d} | time: {:5.2f}s | '\n",
" 'valid accuracy {:8.3f} '.format(epoch,\n",
" time.time() - epoch_start_time,\n",
" accu_val))\n",
" print('-' * 59)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"print('Checking the results of test dataset.')\n",
"accu_test, loss_test = evaluate(test_data)\n",
"print('test accuracy {:8.3f}'.format(accu_test))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"plot_train_val(np.arange(EPOCHS), train_acc, val_acc,\n",
" 'training_accuracy', 'validation_accuracy',\n",
" 'Neural Net on AG_NEWS text classification', 'accuracy')\n",
"plot_train_val(np.arange(EPOCHS), [x.detach().cpu().numpy() for x in train_loss],\n",
" [x.detach().cpu().numpy() for x in val_loss],\n",
" 'training_loss', 'validation_loss',\n",
" 'Neural Net on AG_NEWS text classification', 'loss')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"ag_news_label = {1: \"World\",\n",
" 2: \"Sports\",\n",
" 3: \"Business\",\n",
" 4: \"Sci/Tec\"}\n",
"\n",
"# Prepare data processing pipelines\n",
"tokenizer = get_tokenizer('basic_english')\n",
"\n",
"def yield_tokens(data_iter):\n",
" for _, text in data_iter:\n",
" yield tokenizer(text)\n",
"\n",
"# Create a vocabulary block\n",
"vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=[\"\"])\n",
"vocab.set_default_index(vocab[\"\"])\n",
"\n",
"# Prepare the text processing pipeline with the tokenizer and vocabulary\n",
"text_pipeline = lambda x: vocab(tokenizer(x))\n",
"label_pipeline = lambda x: int(x) - 1\n",
"\n",
"def predict(text, text_pipeline):\n",
" with torch.no_grad():\n",
" text = torch.tensor(text_pipeline(text))\n",
" output = model(text, torch.tensor([0]))\n",
" return output.argmax(1).item() + 1\n",
"\n",
"ex_text_str = \"MEMPHIS, Tenn. – Four days ago, Jon Rahm was \\\n",
" enduring the season’s worst weather conditions on Sunday at The \\\n",
" Open on his way to a closing 75 at Royal Portrush, which \\\n",
" considering the wind and the rain was a respectable showing. \\\n",
" Thursday’s first round at the WGC-FedEx St. Jude Invitational \\\n",
" was another story. With temperatures in the mid-80s and hardly any \\\n",
" wind, the Spaniard was 13 strokes better in a flawless round. \\\n",
" Thanks to his best putting performance on the PGA Tour, Rahm \\\n",
" finished with an 8-under 62 for a three-stroke lead, which \\\n",
" was even more impressive considering he’d never played the \\\n",
" front nine at TPC Southwind.\"\n",
"\n",
"model = model.to(\"cpu\")\n",
"\n",
"print(f\"This is a {ag_news_label[predict(ex_text_str, text_pipeline)]} news\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Summary\n",
"\n",
"In this tutorial, we introduced how to process time series by taking language as an example. To process time series, we should convert them into embeddings. We can first tokenize the text words and then create context-oblivious or context-dependent embeddings. Finally, we saw how these word embeddings could be processed for applications such as text classification.\n",
"\n",
"
\n",
"\n",
"If you want to learn about **Multilingual Embeddings** see the Bonus tutorial on [colab](https://colab.research.google.com/github/NeuromatchAcademy/course-content-dl/blob/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/W3D1_Tutorial3.ipynb) or [kaggle](https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content-dl/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/W3D1_Tutorial3.ipynb). But first, we suggest completing the tutorial 2!"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"collapsed_sections": [],
"gpuType": "T4",
"include_colab_link": true,
"name": "W3D1_Tutorial1",
"provenance": [],
"toc_visible": true
},
"gpuClass": "standard",
"kernel": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}