{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "execution": {}, "id": "view-in-github" }, "source": [ "\"Open   \"Open" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "# Bonus Tutorial: Multilingual Embeddings\n", "\n", "**Week 3, Day 1: Time Series And Natural Language Processing**\n", "\n", "**By Neuromatch Academy**\n", "\n", "__Content creators:__ Alish Dipani, Kelson Shilling-Scrivo, Lyle Ungar\n", "\n", "__Content reviewers:__ Kelson Shilling-Scrivo\n", "\n", "__Content editors:__ Kelson Shilling-Scrivo\n", "\n", "__Production editors:__ Gagana B, Spiros Chavlis\n", "\n", "
\n", "\n", "_Based on Content from: Anushree Hede, Pooja Consul, Ann-Katrin Reuel_" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "----\n", "# Tutorial objectives\n", "\n", "Before we begin with exploring how RNNs excel at modelling sequences, we will explore some of the other ways we can model sequences, encode text, and make meaningful measurements using such encodings and embeddings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# @markdown\n", "from IPython.display import IFrame\n", "from ipywidgets import widgets\n", "out = widgets.Output()\n", "with out:\n", " print(f\"If you want to download the slides: https://osf.io/download/n263c/\")\n", " display(IFrame(src=f\"https://mfr.ca-1.osf.io/render?url=https://osf.io/n263c/?direct%26mode=render%26action=download%26mode=render\", width=730, height=410))\n", "display(out)" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install dependencies\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " There may be *errors* and/or *warnings* reported during the installation. However, they are to be ignored.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Install dependencies\n", "# @markdown There may be *errors* and/or *warnings* reported during the installation. However, they are to be ignored.\n", "!pip install python-Levenshtein --quiet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install and import feedback gadget\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Install and import feedback gadget\n", "\n", "!pip3 install vibecheck datatops --quiet\n", "\n", "from vibecheck import DatatopsContentReviewContainer\n", "def content_review(notebook_section: str):\n", " return DatatopsContentReviewContainer(\n", " \"\", # No text prompt\n", " notebook_section,\n", " {\n", " \"url\": \"https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab\",\n", " \"name\": \"neuromatch_dl\",\n", " \"user_key\": \"f379rz8y\",\n", " },\n", " ).render()\n", "\n", "\n", "feedback_prefix = \"W3D1_T3_Bonus\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install fastText\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " If you want to see the original code, go to repo: https://github.com/facebookresearch/fastText.git\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Install fastText\n", "# @markdown If you want to see the original code, go to repo: https://github.com/facebookresearch/fastText.git\n", "\n", "# !pip install git+https://github.com/facebookresearch/fastText.git --quiet\n", "\n", "import os, zipfile, requests\n", "\n", "url = \"https://osf.io/vkuz7/download\"\n", "fname = \"fastText-main.zip\"\n", "\n", "print('Downloading Started...')\n", "# Downloading the file by sending the request to the URL\n", "r = requests.get(url, stream=True)\n", "\n", "# Writing the file to the local file system\n", "with open(fname, 'wb') as f:\n", " f.write(r.content)\n", "print('Downloading Completed.')\n", "\n", "# opening the zip file in READ mode\n", "with zipfile.ZipFile(fname, 'r') as zipObj:\n", " # extracting all the files\n", " print('Extracting all the files now...')\n", " zipObj.extractall()\n", " print('Done!')\n", " os.remove(fname)\n", "\n", "# Install the package\n", "!pip install fastText-main/ --quiet" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "# Imports\n", "import fasttext\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from tqdm import tqdm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Figure Settings\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Figure Settings\n", "import logging\n", "logging.getLogger('matplotlib.font_manager').disabled = True\n", "\n", "import ipywidgets as widgets\n", "%matplotlib inline\n", "%config InlineBackend.figure_format = 'retina'\n", "plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/content-creation/main/nma.mplstyle\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Helper functions\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Helper functions\n", "def cosine_similarity(vec_a, vec_b):\n", " \"\"\"Compute cosine similarity between vec_a and vec_b\"\"\"\n", " return np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))\n", "\n", "\n", "def getSimilarity(word1, word2):\n", " v1 = ft_en_vectors.get_word_vector(word1)\n", " v2 = ft_en_vectors.get_word_vector(word2)\n", " return cosine_similarity(v1, v2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set random seed\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Executing `set_seed(seed=seed)` you are setting the seed\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Set random seed\n", "\n", "# @markdown Executing `set_seed(seed=seed)` you are setting the seed\n", "\n", "# For DL its critical to set the random seed so that students can have a\n", "# baseline to compare their results to expected results.\n", "# Read more here: https://pytorch.org/docs/stable/notes/randomness.html\n", "\n", "# Call `set_seed` function in the exercises to ensure reproducibility.\n", "import random\n", "import torch\n", "\n", "def set_seed(seed=None, seed_torch=True):\n", " \"\"\"\n", " Function that controls randomness.\n", " NumPy and random modules must be imported.\n", "\n", " Args:\n", " seed : Integer\n", " A non-negative integer that defines the random state. Default is `None`.\n", " seed_torch : Boolean\n", " If `True` sets the random seed for pytorch tensors, so pytorch module\n", " must be imported. Default is `True`.\n", "\n", " Returns:\n", " Nothing.\n", " \"\"\"\n", " if seed is None:\n", " seed = np.random.choice(2 ** 32)\n", " random.seed(seed)\n", " np.random.seed(seed)\n", " if seed_torch:\n", " torch.manual_seed(seed)\n", " torch.cuda.manual_seed_all(seed)\n", " torch.cuda.manual_seed(seed)\n", " torch.backends.cudnn.benchmark = False\n", " torch.backends.cudnn.deterministic = True\n", "\n", " print(f'Random seed {seed} has been set.')\n", "\n", "# In case that `DataLoader` is used\n", "def seed_worker(worker_id):\n", " \"\"\"\n", " DataLoader will reseed workers following randomness in\n", " multi-process data loading algorithm.\n", "\n", " Args:\n", " worker_id: integer\n", " ID of subprocess to seed. 0 means that\n", " the data will be loaded in the main process\n", " Refer: https://pytorch.org/docs/stable/data.html#data-loading-randomness for more details\n", "\n", " Returns:\n", " Nothing\n", " \"\"\"\n", " worker_seed = torch.initial_seed() % 2**32\n", " np.random.seed(worker_seed)\n", " random.seed(worker_seed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set device (GPU or CPU). Execute `set_device()`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Set device (GPU or CPU). Execute `set_device()`\n", "\n", "# Inform the user if the notebook uses GPU or CPU.\n", "\n", "def set_device():\n", " \"\"\"\n", " Set the device. CUDA if available, CPU otherwise\n", "\n", " Args:\n", " None\n", "\n", " Returns:\n", " Nothing\n", " \"\"\"\n", " device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", " if device != \"cuda\":\n", " print(\"WARNING: For this notebook to perform best, \"\n", " \"if possible, in the menu under `Runtime` -> \"\n", " \"`Change runtime type.` select `GPU` \")\n", " else:\n", " print(\"GPU is enabled in this notebook.\")\n", "\n", " return device" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "DEVICE = set_device()\n", "SEED = 2021\n", "set_seed(seed=SEED)" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Section 1 : Multilingual Embeddings\n", "\n", "Traditionally, word embeddings have been language-specific, with embeddings for each language trained separately and existing in entirely different vector spaces. But, what if we wanted to compare words in one language to another? Say we want to create a text classifier with a corpus of English and Spanish words.\n", "\n", "We use the multilingual word embeddings provided in fastText. More information can be found [here](https://engineering.fb.com/2018/01/24/ml-applications/under-the-hood-multilingual-embeddings/)." ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Training multilingual embeddings\n", "\n", "We first train separate embeddings for each language using fastText and a combination of data from Facebook and Wikipedia. Then, we find a dictionary of common words between the two languages. The dictionaries are automatically induced from parallel data - datasets consisting of a pair of sentences in two languages with the same meaning.\n", "\n", "Then, we find a matrix that projects the embeddings into a common space between the given languages. The matrix is designed to minimize the distance between a word $x_i$ and its projection $y_i$. If our dictionary consists of pairs $(x_i, y_i)$, our projector $M$ would be:\n", "\n", "\\begin{equation}\n", "M = \\underset{W}{\\operatorname{argmax}} \\sum_i ||x_i - Wy_i||^2\n", "\\end{equation}\n", "\n", "Also, the projector matrix $W$ is constrained to e orthogonal, so actual distances between word embedding vectors are preserved. Multilingual models are trained by using our multilingual word embeddings as the base representations in DeepText and “freezing” them or leaving them unchanged during the training process.\n", "\n", "After going through this, try to replicate the above exercises but in different languages!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @markdown ### Download FastText English Embeddings of dimension 100\n", "# @markdown This will take 1-2 minutes to run\n", "\n", "import os, zipfile, requests\n", "\n", "url = \"https://osf.io/2frqg/download\"\n", "fname = \"cc.en.100.bin.gz\"\n", "\n", "print('Downloading Started...')\n", "# Downloading the file by sending the request to the URL\n", "r = requests.get(url, stream=True)\n", "\n", "# Writing the file to the local file system\n", "with open(fname, 'wb') as f:\n", " f.write(r.content)\n", "print('Downloading Completed.')\n", "\n", "# opening the zip file in READ mode\n", "with zipfile.ZipFile(fname, 'r') as zipObj:\n", " # extracting all the files\n", " print('Extracting all the files now...')\n", " zipObj.extractall()\n", " print('Done!')\n", " os.remove(fname)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "# Load 100 dimension FastText Vectors using FastText library\n", "ft_en_vectors = fasttext.load_model('cc.en.100.bin')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @markdown ### Download FastText French Embeddings of dimension 100\n", "\n", "# @markdown **Note:** This cell might take 2-4 minutes to run\n", "\n", "import os, zipfile, requests\n", "\n", "url = \"https://osf.io/rqadk/download\"\n", "fname = \"cc.fr.100.bin.gz\"\n", "\n", "print('Downloading Started...')\n", "# Downloading the file by sending the request to the URL\n", "r = requests.get(url, stream=True)\n", "\n", "# Writing the file to the local file system\n", "with open(fname, 'wb') as f:\n", " f.write(r.content)\n", "print('Downloading Completed.')\n", "\n", "# opening the zip file in READ mode\n", "with zipfile.ZipFile(fname, 'r') as zipObj:\n", " # extracting all the files\n", " print('Extracting all the files now...')\n", " zipObj.extractall()\n", " print('Done!')\n", " os.remove(fname)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "# Load 100 dimension FastText Vectors using FastText library\n", "french = fasttext.load_model('cc.fr.100.bin')" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "First, we look at the cosine similarity between different languages without projecting them into the same vector space. As you can see, the same words seem close to $0$ cosine similarity in other languages - so neither similar nor dissimilar." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "hello = ft_en_vectors.get_word_vector('hello')\n", "hi = ft_en_vectors.get_word_vector('hi')\n", "bonjour = french.get_word_vector('bonjour')\n", "\n", "print(f\"Cosine Similarity between HI and HELLO: {cosine_similarity(hello, hi)}\")\n", "print(f\"Cosine Similarity between BONJOUR and HELLO: {cosine_similarity(hello, bonjour)}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "cat = ft_en_vectors.get_word_vector('cat')\n", "chatte = french.get_word_vector('chatte')\n", "chat = french.get_word_vector('chat')\n", "\n", "print(f\"Cosine Similarity between cat and chatte: {cosine_similarity(cat, chatte)}\")\n", "print(f\"Cosine Similarity between cat and chat: {cosine_similarity(cat, chat)}\")\n", "print(f\"Cosine Similarity between chatte and chat: {cosine_similarity(chatte, chat)}\")" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "First, let's define a list of words that are in common between English and French. We'll be using this to make our training matrices." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "en_words = set(ft_en_vectors.words)\n", "fr_words = set(french.words)\n", "overlap = list(en_words & fr_words)\n", "bilingual_dictionary = [(entry, entry) for entry in overlap]" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "We define a few functions to make our lives a bit easier: `make_training_matrices` takes in the source words, target language words, and the set of common words. It then creates a matrix of all the word embeddings of all common words between the languages (in each language). These are our training matrices.\n", "\n", "The function `learn_transformation` then takes in these matrices, normalizes them, and performs SVD, which aligns the source language to the target and returns a transformation matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "def make_training_matrices(source_dictionary, target_dictionary,\n", " bilingual_dictionary):\n", " source_matrix = []\n", " target_matrix = []\n", " for (source, target) in tqdm(bilingual_dictionary):\n", " # if source in source_dictionary.words and target in target_dictionary.words:\n", " source_matrix.append(source_dictionary.get_word_vector(source))\n", " target_matrix.append(target_dictionary.get_word_vector(target))\n", " # return training matrices\n", " return np.array(source_matrix), np.array(target_matrix)\n", "\n", "\n", "# from https://stackoverflow.com/questions/21030391/how-to-normalize-array-numpy\n", "def normalized(a, axis=-1, order=2):\n", " \"\"\"Utility function to normalize the rows of a numpy array.\"\"\"\n", " l2 = np.atleast_1d(np.linalg.norm(a, order, axis))\n", " l2[l2==0] = 1\n", " return a / np.expand_dims(l2, axis)\n", "\n", "\n", "def learn_transformation(source_matrix, target_matrix, normalize_vectors=True):\n", " \"\"\"\n", " Source and target matrices are numpy arrays, shape\n", " (dictionary_length, embedding_dimension). These contain paired\n", " word vectors from the bilingual dictionary.\n", " \"\"\"\n", " # optionally normalize the training vectors\n", " if normalize_vectors:\n", " source_matrix = normalized(source_matrix)\n", " target_matrix = normalized(target_matrix)\n", " # perform the SVD\n", " product = np.matmul(source_matrix.transpose(), target_matrix)\n", " U, s, V = np.linalg.svd(product)\n", " # return orthogonal transformation which aligns source language to the target\n", " return np.matmul(U, V)" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "Now, we just have to put it all together!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "source_training_matrix, target_training_matrix = make_training_matrices(ft_en_vectors, french, bilingual_dictionary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "transform = learn_transformation(source_training_matrix, target_training_matrix)" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "Let's run the same examples as above, but this time, whenever we use French words, the matrix multiplies the embedding by the transpose of the transform matrix. That works a lot better!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "hello = ft_en_vectors.get_word_vector('hello')\n", "hi = ft_en_vectors.get_word_vector('hi')\n", "bonjour = np.matmul(french.get_word_vector('bonjour'), transform.T)\n", "\n", "print(f\"Cosine Similarity between HI and HELLO: {cosine_similarity(hello, hi)}\")\n", "print(f\"Cosine Similarity between BONJOUR and HELLO: {cosine_similarity(hello, bonjour)}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "cat = ft_en_vectors.get_word_vector('cat')\n", "chatte = np.matmul(french.get_word_vector('chatte'), transform.T)\n", "chat = np.matmul(french.get_word_vector('chat'), transform.T)\n", "\n", "print(f\"Cosine Similarity between cat and chatte: {cosine_similarity(cat, chatte)}\")\n", "print(f\"Cosine Similarity between cat and chat: {cosine_similarity(cat, chat)}\")\n", "print(f\"Cosine Similarity between chatte and chat: {cosine_similarity(chatte, chat)}\")" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "Now, try a couple of your examples. Try some examples you looked at in Tutorial 1, Section 2.1, but with English and French. Does it work as expected?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit your feedback\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Submit your feedback\n", "content_review(f\"{feedback_prefix}_Multilingual_Embeddings_Bonus_Activity\")" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "include_colab_link": true, "name": "W3D1_Tutorial3", "provenance": [], "toc_visible": true }, "gpuClass": "standard", "kernel": { "display_name": "Python 3", "language": "python", "name": "python3" }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }