Bonus Tutorial: Multilingual Embeddings

Bonus Tutorial: Multilingual Embeddings#

Week 3, Day 1: Time Series And Natural Language Processing

By Neuromatch Academy

Content creators: Alish Dipani, Kelson Shilling-Scrivo, Lyle Ungar

Content reviewers: Kelson Shilling-Scrivo

Content editors: Kelson Shilling-Scrivo

Production editors: Gagana B, Spiros Chavlis

Based on Content from: Anushree Hede, Pooja Consul, Ann-Katrin Reuel

Tutorial objectives#

Before we begin with exploring how RNNs excel at modelling sequences, we will explore some of the other ways we can model sequences, encode text, and make meaningful measurements using such encodings and embeddings.

Setup#

Install dependencies#

There may be errors and/or warnings reported during the installation. However, they are to be ignored.

Install and import feedback gadget#

Install fastText#

If you want to see the original code, go to repo: https://github.com/facebookresearch/fastText.git

Downloading Started...

Downloading Completed.
Extracting all the files now...
Done!

  DEPRECATION: Building 'fasttext' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'fasttext'. Discussion can be found at https://github.com/pypa/pip/issues/6334

  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [70 lines of output]
      /opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/setuptools/dist.py:717: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead
        warnings.warn(
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.9
      creating build/lib.linux-x86_64-3.9/fasttext
      copying python/fasttext_module/fasttext/__init__.py -> build/lib.linux-x86_64-3.9/fasttext
      copying python/fasttext_module/fasttext/FastText.py -> build/lib.linux-x86_64-3.9/fasttext
      creating build/lib.linux-x86_64-3.9/fasttext/util
      copying python/fasttext_module/fasttext/util/__init__.py -> build/lib.linux-x86_64-3.9/fasttext/util
      copying python/fasttext_module/fasttext/util/util.py -> build/lib.linux-x86_64-3.9/fasttext/util
      creating build/lib.linux-x86_64-3.9/fasttext/tests
      copying python/fasttext_module/fasttext/tests/__init__.py -> build/lib.linux-x86_64-3.9/fasttext/tests
      copying python/fasttext_module/fasttext/tests/test_configurations.py -> build/lib.linux-x86_64-3.9/fasttext/tests
      copying python/fasttext_module/fasttext/tests/test_script.py -> build/lib.linux-x86_64-3.9/fasttext/tests
      running build_ext
      creating tmp
      gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/opt/hostedtoolcache/Python/3.9.23/x64/include/python3.9 -c /tmp/tmp2owj2tvs.cpp -o tmp/tmp2owj2tvs.o -std=c++11
      gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/opt/hostedtoolcache/Python/3.9.23/x64/include/python3.9 -c /tmp/tmps1ygw4fc.cpp -o tmp/tmps1ygw4fc.o -fvisibility=hidden
      building 'fasttext_pybind' extension
      creating build/temp.linux-x86_64-3.9
      creating build/temp.linux-x86_64-3.9/python
      creating build/temp.linux-x86_64-3.9/python/fasttext_module
      creating build/temp.linux-x86_64-3.9/python/fasttext_module/fasttext
      creating build/temp.linux-x86_64-3.9/python/fasttext_module/fasttext/pybind
      creating build/temp.linux-x86_64-3.9/src
      gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/pybind11/include -I/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/pybind11/include -Isrc -I/opt/hostedtoolcache/Python/3.9.23/x64/include/python3.9 -c python/fasttext_module/fasttext/pybind/fasttext_pybind.cc -o build/temp.linux-x86_64-3.9/python/fasttext_module/fasttext/pybind/fasttext_pybind.o -DVERSION_INFO="0.9.2" -std=c++11 -fvisibility=hidden
      python/fasttext_module/fasttext/pybind/fasttext_pybind.cc: In lambda function:
      python/fasttext_module/fasttext/pybind/fasttext_pybind.cc:346:35: warning: comparison of integer expressions of different signedness: ‘int32_t’ {aka ‘int’} and ‘std::vector<long int>::size_type’ {aka ‘long unsigned int’} [-Wsign-compare]
        346 |             for (int32_t i = 0; i < vocab_freq.size(); i++) {
            |                                 ~~^~~~~~~~~~~~~~~~~~~
      python/fasttext_module/fasttext/pybind/fasttext_pybind.cc: In lambda function:
      python/fasttext_module/fasttext/pybind/fasttext_pybind.cc:360:35: warning: comparison of integer expressions of different signedness: ‘int32_t’ {aka ‘int’} and ‘std::vector<long int>::size_type’ {aka ‘long unsigned int’} [-Wsign-compare]
        360 |             for (int32_t i = 0; i < labels_freq.size(); i++) {
            |                                 ~~^~~~~~~~~~~~~~~~~~~~
      gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/pybind11/include -I/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/pybind11/include -Isrc -I/opt/hostedtoolcache/Python/3.9.23/x64/include/python3.9 -c src/args.cc -o build/temp.linux-x86_64-3.9/src/args.o -DVERSION_INFO="0.9.2" -std=c++11 -fvisibility=hidden
      src/args.cc: In member function ‘void fasttext::Args::parseArgs(const std::vector<std::__cxx11::basic_string<char> >&)’:
      src/args.cc:120:23: warning: comparison of integer expressions of different signedness: ‘int’ and ‘std::vector<std::__cxx11::basic_string<char> >::size_type’ {aka ‘long unsigned int’} [-Wsign-compare]
        120 |   for (int ai = 2; ai < args.size(); ai += 2) {
            |                    ~~~^~~~~~~~~~~~~
      src/args.cc:221:19: warning: catching polymorphic type ‘class std::out_of_range’ by value [-Wcatch-value=]
        221 |     } catch (std::out_of_range) {
            |                   ^~~~~~~~~~~~
      src/args.cc: In member function ‘int64_t fasttext::Args::getAutotuneModelSize() const’:
      src/args.cc:468:3: error: ‘uint64_t’ was not declared in this scope
        468 |   uint64_t multiplier = 1;
            |   ^~~~~~~~
      src/args.cc:17:1: note: ‘uint64_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
         16 | #include <unordered_map>
        +++ |+#include <cstdint>
         17 |
      src/args.cc:471:5: error: ‘multiplier’ was not declared in this scope
        471 |     multiplier = units[lastCharacter];
            |     ^~~~~~~~~~
      src/args.cc:474:11: error: expected ‘;’ before ‘size’
        474 |   uint64_t size = 0;
            |           ^~~~~
            |           ;
      src/args.cc:478:5: error: ‘size’ was not declared in this scope
        478 |     size = std::stol(modelSize, &nonNumericCharacter);
            |     ^~~~
      src/args.cc:490:10: error: ‘size’ was not declared in this scope
        490 |   return size * multiplier;
            |          ^~~~
      src/args.cc:490:17: error: ‘multiplier’ was not declared in this scope
        490 |   return size * multiplier;
            |                 ^~~~~~~~~~
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for fasttext

ERROR: Failed to build installable wheels for some pyproject.toml based projects (fasttext)

# Imports
import fasttext
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

Figure Settings#

Helper functions#

Set random seed#

Executing set_seed(seed=seed) you are setting the seed

Set device (GPU or CPU). Execute `set_device()`#

DEVICE = set_device()
SEED = 2021
set_seed(seed=SEED)

WARNING: For this notebook to perform best, if possible, in the menu under `Runtime` -> `Change runtime type.`  select `GPU` 
Random seed 2021 has been set.

Section 1 : Multilingual Embeddings#

Traditionally, word embeddings have been language-specific, with embeddings for each language trained separately and existing in entirely different vector spaces. But, what if we wanted to compare words in one language to another? Say we want to create a text classifier with a corpus of English and Spanish words.

We use the multilingual word embeddings provided in fastText. More information can be found here.

Training multilingual embeddings#

We first train separate embeddings for each language using fastText and a combination of data from Facebook and Wikipedia. Then, we find a dictionary of common words between the two languages. The dictionaries are automatically induced from parallel data - datasets consisting of a pair of sentences in two languages with the same meaning.

Then, we find a matrix that projects the embeddings into a common space between the given languages. The matrix is designed to minimize the distance between a word \(x_i\) and its projection \(y_i\). If our dictionary consists of pairs \((x_i, y_i)\), our projector \(M\) would be:

(121)#\[\begin{equation} M = \underset{W}{\operatorname{argmax}} \sum_i ||x_i - Wy_i||^2 \end{equation}\]

Also, the projector matrix \(W\) is constrained to e orthogonal, so actual distances between word embedding vectors are preserved. Multilingual models are trained by using our multilingual word embeddings as the base representations in DeepText and “freezing” them or leaving them unchanged during the training process.

After going through this, try to replicate the above exercises but in different languages!

Downloading Started...

Downloading Completed.
Extracting all the files now...

Done!

# Load 100 dimension FastText Vectors using FastText library
ft_en_vectors = fasttext.load_model('cc.en.100.bin')

Downloading Started...

Downloading Completed.
Extracting all the files now...

Done!

# Load 100 dimension FastText Vectors using FastText library
french = fasttext.load_model('cc.fr.100.bin')

First, we look at the cosine similarity between different languages without projecting them into the same vector space. As you can see, the same words seem close to \(0\) cosine similarity in other languages - so neither similar nor dissimilar.

hello = ft_en_vectors.get_word_vector('hello')
hi = ft_en_vectors.get_word_vector('hi')
bonjour = french.get_word_vector('bonjour')

print(f"Cosine Similarity between HI and HELLO: {cosine_similarity(hello, hi)}")
print(f"Cosine Similarity between BONJOUR and HELLO: {cosine_similarity(hello, bonjour)}")

Cosine Similarity between HI and HELLO: 0.7028388977050781
Cosine Similarity between BONJOUR and HELLO: 0.20523205399513245

cat = ft_en_vectors.get_word_vector('cat')
chatte = french.get_word_vector('chatte')
chat = french.get_word_vector('chat')

print(f"Cosine Similarity between cat and chatte: {cosine_similarity(cat, chatte)}")
print(f"Cosine Similarity between cat and chat: {cosine_similarity(cat, chat)}")
print(f"Cosine Similarity between chatte and chat: {cosine_similarity(chatte, chat)}")

Cosine Similarity between cat and chatte: -0.013087842613458633
Cosine Similarity between cat and chat: -0.02490561455488205
Cosine Similarity between chatte and chat: 0.6003134250640869

First, let’s define a list of words that are in common between English and French. We’ll be using this to make our training matrices.

en_words = set(ft_en_vectors.words)
fr_words = set(french.words)
overlap = list(en_words & fr_words)
bilingual_dictionary = [(entry, entry) for entry in overlap]

We define a few functions to make our lives a bit easier: make_training_matrices takes in the source words, target language words, and the set of common words. It then creates a matrix of all the word embeddings of all common words between the languages (in each language). These are our training matrices.

The function learn_transformation then takes in these matrices, normalizes them, and performs SVD, which aligns the source language to the target and returns a transformation matrix.

def make_training_matrices(source_dictionary, target_dictionary,
                           bilingual_dictionary):
  source_matrix = []
  target_matrix = []
  for (source, target) in tqdm(bilingual_dictionary):
    # if source in source_dictionary.words and target in target_dictionary.words:
    source_matrix.append(source_dictionary.get_word_vector(source))
    target_matrix.append(target_dictionary.get_word_vector(target))
  # return training matrices
  return np.array(source_matrix), np.array(target_matrix)


# from https://stackoverflow.com/questions/21030391/how-to-normalize-array-numpy
def normalized(a, axis=-1, order=2):
  """Utility function to normalize the rows of a numpy array."""
  l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
  l2[l2==0] = 1
  return a / np.expand_dims(l2, axis)


def learn_transformation(source_matrix, target_matrix, normalize_vectors=True):
  """
  Source and target matrices are numpy arrays, shape
  (dictionary_length, embedding_dimension). These contain paired
  word vectors from the bilingual dictionary.
  """
  # optionally normalize the training vectors
  if normalize_vectors:
    source_matrix = normalized(source_matrix)
    target_matrix = normalized(target_matrix)
  # perform the SVD
  product = np.matmul(source_matrix.transpose(), target_matrix)
  U, s, V = np.linalg.svd(product)
  # return orthogonal transformation which aligns source language to the target
  return np.matmul(U, V)

Now, we just have to put it all together!

source_training_matrix, target_training_matrix = make_training_matrices(ft_en_vectors, french, bilingual_dictionary)

transform = learn_transformation(source_training_matrix, target_training_matrix)

Let’s run the same examples as above, but this time, whenever we use French words, the matrix multiplies the embedding by the transpose of the transform matrix. That works a lot better!

hello = ft_en_vectors.get_word_vector('hello')
hi = ft_en_vectors.get_word_vector('hi')
bonjour = np.matmul(french.get_word_vector('bonjour'), transform.T)

print(f"Cosine Similarity between HI and HELLO: {cosine_similarity(hello, hi)}")
print(f"Cosine Similarity between BONJOUR and HELLO: {cosine_similarity(hello, bonjour)}")

Cosine Similarity between HI and HELLO: 0.7028388977050781
Cosine Similarity between BONJOUR and HELLO: 0.5818603038787842

cat = ft_en_vectors.get_word_vector('cat')
chatte = np.matmul(french.get_word_vector('chatte'), transform.T)
chat = np.matmul(french.get_word_vector('chat'), transform.T)

print(f"Cosine Similarity between cat and chatte: {cosine_similarity(cat, chatte)}")
print(f"Cosine Similarity between cat and chat: {cosine_similarity(cat, chat)}")
print(f"Cosine Similarity between chatte and chat: {cosine_similarity(chatte, chat)}")

Cosine Similarity between cat and chatte: 0.43272829055786133
Cosine Similarity between cat and chat: 0.6866635680198669
Cosine Similarity between chatte and chat: 0.6003133654594421

Now, try a couple of your examples. Try some examples you looked at in Tutorial 1, Section 2.1, but with English and French. Does it work as expected?

Bonus Tutorial: Multilingual Embeddings

Contents

Bonus Tutorial: Multilingual Embeddings#

Tutorial objectives#

Setup#

Install dependencies#

Install and import feedback gadget#

Install fastText#

Figure Settings#

Helper functions#

Set random seed#

Set device (GPU or CPU). Execute `set_device()`#

Section 1 : Multilingual Embeddings#

Training multilingual embeddings#

Submit your feedback#

Bonus Tutorial: Multilingual Embeddings

Contents

Bonus Tutorial: Multilingual Embeddings#

Tutorial objectives#

Setup#

Install dependencies#

Install and import feedback gadget#

Install fastText#

Figure Settings#

Helper functions#

Set random seed#

Set device (GPU or CPU). Execute set_device()#

Section 1 : Multilingual Embeddings#

Training multilingual embeddings#

Submit your feedback#

Set device (GPU or CPU). Execute `set_device()`#