{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {},
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"# Twitter Sentiment Analysis\n",
"\n",
"**By Neuromatch Academy**\n",
"\n",
"__Content creators:__ Juan Manuel Rodriguez, Salomey Osei, Gonzalo Uribarri\n",
"\n",
"__Production editors:__ Amita Kapoor, Spiros Chavlis"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Welcome to the NLP project template\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Step 1: Questions and goals\n",
"\n",
"* Can we infer emotion from a tweet text?\n",
"* How words are distributed accross the dataset?\n",
"* Are words related to one kind of emotion?"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Step 2: Literature review\n",
"\n",
"[Original Dataset Paper](https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf)\n",
"\n",
"[Papers with code](https://paperswithcode.com/dataset/imdb-movie-reviews)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Step 3: Load and explore the dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install dependencies\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
"torchvision 0.14.0 requires torch==1.13.0, but you have torch 2.3.1 which is incompatible.\u001b[0m\u001b[31m\n",
"\u001b[0m\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
"torchvision 0.14.0 requires torch==1.13.0, but you have torch 2.3.1 which is incompatible.\u001b[0m\u001b[31m\n",
"\u001b[0m"
]
}
],
"source": [
"# @title Install dependencies\n",
"!pip install pandas --quiet\n",
"!pip install torchtext --quiet\n",
"!pip install datasets --quiet"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/yuda/code/neuromatch/course-content-dl/venv/lib/python3.9/site-packages/torchtext/data/__init__.py:4: UserWarning: \n",
"/!\\ IMPORTANT WARNING ABOUT TORCHTEXT STATUS /!\\ \n",
"Torchtext is deprecated and the last released version will be 0.18 (this one). You can silence this warning by calling the following at the beginnign of your scripts: `import torchtext; torchtext.disable_torchtext_deprecation_warning()`\n",
" warnings.warn(torchtext._TORCHTEXT_DEPRECATION_MSG)\n"
]
}
],
"source": [
"# We import some libraries to load the dataset\n",
"import os\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from collections import Counter\n",
"from tqdm.notebook import tqdm\n",
"\n",
"import torch\n",
"import torch.nn as nn\n",
"import torch.optim as optim\n",
"import torch.nn.functional as F\n",
"from torch.utils.data import TensorDataset, DataLoader\n",
"\n",
"import torchtext\n",
"from torchtext.data import get_tokenizer\n",
"\n",
"from sklearn.utils import shuffle\n",
"from sklearn.metrics import classification_report\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.feature_extraction.text import CountVectorizer"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"You can find the dataset we are going to use in [this website](https://huggingface.co/datasets/stanfordnlp/sentiment140)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "cd6b527daf25440a8acc7a7645af679d",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading builder script: 0%| | 0.00/4.03k [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5b7c93b4ee524a01adec2669170ee097",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading readme: 0%| | 0.00/6.84k [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "19e6cbe3cb7648b4b5e0b2ce60c12d6b",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading data: 0%| | 0.00/81.4M [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "e5bfc518fd6a48deb991309cec03facd",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating train split: 0%| | 0/1600000 [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "983915e475a747bd920d53cd924eb7ca",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating test split: 0%| | 0/498 [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from datasets import load_dataset\n",
"\n",
"dataset = load_dataset(\"stanfordnlp/sentiment140\", trust_remote_code= True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | polarity | \n", "user | \n", "date | \n", "query | \n", "user | \n", "text | \n", "
---|---|---|---|---|---|---|
0 | \n", "0 | \n", "_TheSpecialOne_ | \n", "Mon Apr 06 22:19:45 PDT 2009 | \n", "NO_QUERY | \n", "_TheSpecialOne_ | \n", "@switchfoot http://twitpic.com/2y1zl - Awww, t... | \n", "
1 | \n", "0 | \n", "scotthamilton | \n", "Mon Apr 06 22:19:49 PDT 2009 | \n", "NO_QUERY | \n", "scotthamilton | \n", "is upset that he can't update his Facebook by ... | \n", "
2 | \n", "0 | \n", "mattycus | \n", "Mon Apr 06 22:19:53 PDT 2009 | \n", "NO_QUERY | \n", "mattycus | \n", "@Kenichan I dived many times for the ball. Man... | \n", "
3 | \n", "0 | \n", "ElleCTF | \n", "Mon Apr 06 22:19:57 PDT 2009 | \n", "NO_QUERY | \n", "ElleCTF | \n", "my whole body feels itchy and like its on fire | \n", "
4 | \n", "0 | \n", "Karoli | \n", "Mon Apr 06 22:19:57 PDT 2009 | \n", "NO_QUERY | \n", "Karoli | \n", "@nationwideclass no, it's not behaving at all.... | \n", "
LogisticRegression(solver='saga')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(solver='saga')