Predicting the Sentiment of IMDB Movie Reviews using LSTM in PyTorch

pytorch

lstm

This is a practice notebook to work with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDB) and build an LSTM predictor to distinguish between positive and negative reviews.

Published

November 9, 2022

Credits

This notebook takes inspiration and ideas from the following sources.

“Machine learning with PyTorch and Scikit-Learn” by “Sebastian Raschka, Yuxi (Hayden) Liu, and Vahid Mirjalili”. You can get the book from its website: Machine learning with PyTorch and Scikit-Learn. In addition, the GitHub repository for this book has valuable notebooks: github.com/rasbt/machine-learning-book. Parts of the code you see in this notebook are taken from chapter 15 notebook of the same book.
“Intro to Deep Learning and Generative Models Course” lecture series from “Sebastian Raschka”. Course website: stat453-ss2021. YouTube Link: Intro to Deep Learning and Generative Models Course. Lectures that are related to this post are L15.5 Long Short-Term Memory and L15.7 An RNN Sentiment Classifier in PyTorch

Environment

This notebook is prepared with Google Colab.

GitHub: 2022-11-09-pytorch-lstm-imdb-sentiment-prediction.ipynb
Open In Colab:

For “runtime type” choose hardware accelerator as “GPU”. It will take a long time to complete the training without any GPU.

This notebook also depends on the PyTorch library TorchText. We will use this library to fetch IMDB review data. While using the torchtext latest version, I found more dependencies on other libraries like torchdata. Even after resolving them, it threw strange encoding errors while fetching IMDB data. So I have downgraded this library till the version I found working without external dependencies. Consequently, torch is also downgraded to a compatible version, but I did not find any issue while working with a lower version of PyTorch for this notebook. It is preferred to restart the runtime after the library installation is complete.

#collapse-output
! pip install torchtext==0.11.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: torchtext==0.11.0 in /usr/local/lib/python3.7/dist-packages (0.11.0)
Requirement already satisfied: torch==1.10.0 in /usr/local/lib/python3.7/dist-packages (from torchtext==0.11.0) (1.10.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from torchtext==0.11.0) (1.21.6)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from torchtext==0.11.0) (2.23.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from torchtext==0.11.0) (4.64.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch==1.10.0->torchtext==0.11.0) (4.1.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->torchtext==0.11.0) (2022.9.24)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->torchtext==0.11.0) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->torchtext==0.11.0) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->torchtext==0.11.0) (3.0.4)

Code

from platform import python_version
import numpy, matplotlib, pandas, torch, torchtext

print("python==" + python_version())
print("numpy==" + numpy.__version__)
print("torch==" + torch.__version__)
print("torchtext==" + torchtext.__version__)
print("matplotlib==" + matplotlib.__version__)

python==3.7.15
numpy==1.21.6
torch==1.10.0+cu102
torchtext==0.11.0
matplotlib==3.2.2

Data Preparation

Download data

Let’s download our movie review dataset. This dataset is also known as Large Movie Review Dataset, and can also be obtained in a compressed zip file from this link. Using the torchtext library makes downloading, extracting, and reading files a lot easier. ‘torchtext.datasets’ comes with many more NLP related datasets, and a full list can be found here.

from torchtext.datasets import IMDB
from torch.utils.data.dataset import random_split

torch.manual_seed(1)

train_dataset_raw = IMDB(split="train")
test_dataset_raw = IMDB(split="test")

Check the size of the downloaded data.

print("Train dataset size: ", len(train_dataset_raw))
print("Test dataset size: ", len(test_dataset_raw))

Train dataset size:  25000
Test dataset size:  25000

Split train data further into train and validation set

Both train and test datasets have 25000 reviews. Therefore, we can split the training set further into the train and validation sets.

train_set_size = 20000
valid_set_size = 5000

train_dataset, valid_dataset = random_split(list(train_dataset_raw), [20000, 5000])

How does this data look?

The data we have is in the form of tuples. The first index has the sentiment label, and the second contains the review text. Let’s check the first element in our training dataset.

train_dataset[0]

('pos',
 'An extra is called upon to play a general in a movie about the Russian Revolution. However, he is not any ordinary extra. He is Serguis Alexander, former commanding general of the Russia armies who is now being forced to relive the same scene, which he suffered professional and personal tragedy in, to satisfy the director who was once a revolutionist in Russia and was humiliated by Alexander. It can now be the time for this broken man to finally "win" his penultimate battle. This is one powerful movie with meticulous direction by Von Sternberg, providing the greatest irony in Alexander\'s character in every way he can. Jannings deserved his Oscar for the role with a very moving performance playing the general at his peak and at his deepest valley. Powell lends a sinister support as the revenge minded director and Brent is perfect in her role with her face and movements showing so much expression as Jannings\' love. All around brilliance. Rating, 10.')

Check the first index of the validation set.

valid_dataset[0]

('neg',
 'The Dereks did seem to struggle to find rolls for Bo after "10".<br /><br />I used to work for a marine park in the Florida Keys. One day, the script for "Ghosts Can\'t Do It" was circulating among the trainers in the "fish house" where food was prepared for the dolphins. There was one scene where a -dolphin- supposedly propositions Bo (or Bo the dolphin), asking to "go make eggs." Reading the script, we -lauuughed-...<br /><br />We did not end up doing any portion of this movie at our facility, although our dolphins -were- in "The Big Blue!"<br /><br />This must have been very close to the end of Anthony Quinn\'s life. I hope he had fun in this film, as it certainly didn\'t do anything for his legacy.')

Data preprocessing steps

From these two reviews, we can deduce that

We have two labels. ‘pos’ for a positive and ‘neg’ for a negative review
From the second review (from valid_dataset), we also get that text may contain HTML tags, special characters, and emoticons besides normal English words. It will require some preprocessing to remove them for proper word tokenization.
Reviews can have varying text lengths. It will require some padding to make all review texts the same size.

Let’s take a simple text example and apply these steps to understand why these steps are essential in preprocessing. In the last step, we will create tokens from the preprocessed text.

example_text = '''This is awesome movie <br /><br />. I loved it so much :-) I\'m goona watch it again :)'''
example_text

"This is awesome movie <br /><br />. I loved it so much :-) I'm goona watch it again :)"

##
# step 1. remove HTML tags. they are not helpful in understanding the sentiments of a review
import re

text = re.sub('<[^>]*>', '', example_text)
text

"This is awesome movie . I loved it so much :-) I'm goona watch it again :)"

##
# step 2: use lowercase for all text to keep symmetry
text = text.lower()
text

"this is awesome movie . i loved it so much :-) i'm goona watch it again :)"

##
# step 3: extract emoticons. keep them as they are important sentiment signals
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
emoticons

[':-)', ':)']

##
# step 4: remove punctuation marks
text = re.sub('[\W]+', ' ', text)
text

'this is awesome movie i loved it so much i m goona watch it again '

##
# step 5: put back emoticons
text = text + ' '.join(emoticons).replace('-', '')
text

'this is awesome movie i loved it so much i m goona watch it again :) :)'

##
# step 6: generate word tokens
text = text.split()
text

['this',
 'is',
 'awesome',
 'movie',
 'i',
 'loved',
 'it',
 'so',
 'much',
 'i',
 'm',
 'goona',
 'watch',
 'it',
 'again',
 ':)',
 ':)']

Let’s put all the preprocessing steps in a nice function and give it a name.

def tokenizer(text):
    # step 1. remove HTML tags. they are not helpful in understanding the sentiments of a review
    # step 2: use lowercase for all text to keep symmetry
    # step 3: extract emoticons. keep them as they are important sentiment signals
    # step 4: remove punctuation marks
    # step 5: put back emoticons
    # step 6: generate word tokens
    text = re.sub("<[^>]*>", "", text)
    text = text.lower()
    emoticons = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)", text)
    text = re.sub("[\W]+", " ", text)
    text = text + " ".join(emoticons).replace("-", "")
    tokenized = text.split()
    return tokenized

Apply tokenizer on the example_text to verify the output.

example_tokens = tokenizer(example_text)
example_tokens

['this',
 'is',
 'awesome',
 'movie',
 'i',
 'loved',
 'it',
 'so',
 'much',
 'i',
 'm',
 'goona',
 'watch',
 'it',
 'again',
 ':)',
 ':)']

Preparing data dictionary

We are successful in creating word tokens from our example_text. But there is one more problem. Some of the tokens are repeating. If we can convert these tokens into a dictionary along with their frequency count, we can significantly reduce the generated token size from these reviews. Let’s do that.

from collections import Counter

token_counts = Counter()
token_counts.update(example_tokens)
token_counts

Counter({'this': 1,
         'is': 1,
         'awesome': 1,
         'movie': 1,
         'i': 2,
         'loved': 1,
         'it': 2,
         'so': 1,
         'much': 1,
         'm': 1,
         'goona': 1,
         'watch': 1,
         'again': 1,
         ':)': 2})

Let’s sort the output to have the most common words at the top.

sorted_by_freq_tuples = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)
sorted_by_freq_tuples

[('i', 2),
 ('it', 2),
 (':)', 2),
 ('this', 1),
 ('is', 1),
 ('awesome', 1),
 ('movie', 1),
 ('loved', 1),
 ('so', 1),
 ('much', 1),
 ('m', 1),
 ('goona', 1),
 ('watch', 1),
 ('again', 1)]

It shows that in our example text, the top place is taken by pronouns (i and it) followed by the emoticon. Though our data is now correctly processed, it needs to be prepared to be fed to a model. Because [machine] models love math and work with numbers exclusively. To convert our dictionary of word tokens into integers, we can take help from torchtext.vocab. Its purpose in the official documentation is defined as link here

Factory method for creating a vocab object which maps tokens to indices.

Note that the ordering in which key value pairs were inserted in the ordered_dict will be respected when building the vocab. Therefore if sorting by token frequency is important to the user, the ordered_dict should be created in a way to reflect this.

It highlights three points:

It maps tokens to indices
It requires an ordered dictionary (OrderedDict) to work
Tokens in vocab at the starting indices reflect higher frequency

##
# step 1: convert our sorted list of tokens to OrderedDict
from collections import OrderedDict

ordered_dict = OrderedDict(sorted_by_freq_tuples)
ordered_dict

OrderedDict([('i', 2),
             ('it', 2),
             (':)', 2),
             ('this', 1),
             ('is', 1),
             ('awesome', 1),
             ('movie', 1),
             ('loved', 1),
             ('so', 1),
             ('much', 1),
             ('m', 1),
             ('goona', 1),
             ('watch', 1),
             ('again', 1)])

##
# Check the length of our dictionary
len(ordered_dict)

##
# step 2: convert the ordered dict to torchtext.vocab
from torchtext.vocab import vocab

vb = vocab(ordered_dict)
vb.get_stoi()

{'goona': 11,
 'much': 9,
 'm': 10,
 'loved': 7,
 'watch': 12,
 'so': 8,
 'movie': 6,
 'it': 1,
 'again': 13,
 'this': 3,
 'i': 0,
 'awesome': 5,
 ':)': 2,
 'is': 4}

This generated vocabulary shows that tokens with higher frequency (i, it) have been assigned lower indices (or integers). This vocabulary will act as a lookup table for us, and during training for each word token, we will find a corresponding index from this vocab and pass it to our model.

We have done many steps while processing our example_text. Let’s summarize them here before moving further

Summary of data dictionary preparation steps

Generate tokens from text using the function tokenizer
Find the frequency of tokens using Python collections.Counter
Sort the tokens based on their frequency in descending order
Put the sorted tokens in Python collections.OrderedDict
Convert the tokens into integers using torchtext.vocab

Let’s apply all these steps on our IMDB reviews training dataset.

##
# step 1: convert reviews into tokens
# step 2: find frequency of tokens

token_counts = Counter()

for label, line in train_dataset:
    tokens = tokenizer(line)
    token_counts.update(tokens)
 
print('IMDB vocab size:', len(token_counts))

IMDB vocab size: 69023

After tokenizing IMDB reviews, we find that there 69023 unique tokens.

##
# step 3: sort the token based on their frequency
# step 4: put the sorted tokens in OrderedDict
# step 5: convert token to integers using vocab object

sorted_by_freq_tuples = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)

vb = vocab(ordered_dict)

vb.insert_token("<pad>", 0)  # special token for padding
vb.insert_token("<unk>", 1)  # special token for unknown words
vb.set_default_index(1)

# print some token indexes from vocab
for token in ["this", "is", "an", "example"]:
    print(token, " --> ", vb[token])

this  -->  11
is  -->  7
an  -->  35
example  -->  457

We have added two extra tokens to our vocabulary.

“pad” for padding. This token will come in handy when we pad our reviews to make them of the same length
“unk” for unknown. This token will come in handy if we find any token in the validation or test set that was not part of the train set

Let’s also print the tokens present at the first ten indices of our vocab object.

vb.get_itos()[:10]

['<pad>', '<unk>', 'the', 'and', 'a', 'of', 'to', 'is', 'it', 'in']

It shows that articles, prepositions, and pronouns are the most common words in the training dataset. So let’s also check the least common words.

vb.get_itos()[-10:]

['hairband',
 'ratt',
 'bettiefile',
 'queueing',
 'johansen',
 'hemmed',
 'jardine',
 'morland',
 'seriousuly',
 'fictive']

The least common words seem to be people or place names or misspelled words like ‘queueing’ and ‘seriousuly’.

Define data processing pipelines

At this point, we have our tokenizer function and vocabulary lookup ready. For each review item from the dataset, we are supposed to perform the following preprocessing steps:

For review text

Create tokens from the review text
Assign a unique integer to each token from the vocab lookup

For review label

Assign 1 for pos and 0 for neg label

Let’s create two simple functions (inline lambda) for review text and label processing.

##
# inline lambda functions for text and label precessing
text_pipeline = lambda x: [vb[token] for token in tokenizer(x)]
label_pipeline = lambda x: 1.0 if x == "pos" else 0.0

##
# apply text_pipeline to example_text
text_pipeline(example_text)

[11, 7, 1166, 18, 10, 450, 8, 37, 74, 10, 142, 1, 104, 8, 174, 2287, 2287]

Instead of processing a single review at a time, we always prefer to work with a batch of them during model training. For each review item in the batch, we will be doing the same preprocessing steps i.e. review text processing and label processing. For handling preprocessing steps at a batch level, we can create another higher-level function that applies preprocessing steps at a batch level.

##
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

Using device: cuda

##
# a function to apply pre-processing steps at a batch level
import torch.nn as nn

def collate_batch(batch):
    label_list, text_list, lengths = [], [], []

    # iterate over all reviews in a batch
    for _label, _text in batch:
        # label preprocessing
        label_list.append(label_pipeline(_label))
        # text preprocessing
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)

        # store the processed text in a list
        text_list.append(processed_text)
        
        # store the length of processed text
        # this will come handy in future when we want to know the original size of a text (without padding)
        lengths.append(processed_text.size(0))
    
    label_list = torch.tensor(label_list)
    lengths = torch.tensor(lengths)
    
    # pad the processed reviews to make their lengths consistant
    padded_text_list = nn.utils.rnn.pad_sequence(
        text_list, batch_first=True)
    
    # return
    # 1. a list of processed and padded review texts
    # 2. a list of processed labels
    # 3. a list of review text original lengths (before padding)
    return padded_text_list.to(device), label_list.to(device), lengths.to(device)

Sequence padding

In the above collate_batch function, I added one extra padding step.

added_text_list = nn.utils.rnn.pad_sequence(text_list, batch_first=True)

We intend to make all review texts in a batch of the same length. For this, we take the maximum length of a text in a batch, all pad all the smaller text with extra dummy tokens (‘pad’) to make their sizes equal. Finally, with all the data in a batch of the same dimension, we convert it into a tensor matrix for faster processing.

To understand how PyTorch utility nn.utils.rnn.pad_sequence works, we can take a simple example of three tensors (a, b, c) of varying sizes (1, 3, 5).

##
# initialize three tensors of varying sizes
a = torch.tensor([1])
b = torch.tensor([2, 3, 4])
c = torch.tensor([5, 6, 7, 8, 9])
a, b, c

(tensor([1]), tensor([2, 3, 4]), tensor([5, 6, 7, 8, 9]))

Now let’s pad them to make sizes consistant.

##
# apply padding on tensors
pad_seq = nn.utils.rnn.pad_sequence([a, b, c])
pad_seq

tensor([[1, 2, 5],
        [0, 3, 6],
        [0, 4, 7],
        [0, 0, 8],
        [0, 0, 9]])

Sequence packing

From the above output, we can see that after padding tensors of varying sizes, we can convert them into a single matrix for faster processing. But the drawback of this approach is that we can have many, many padded tokens in our matrix. They are not helping us in any way, instead of occupying a lot of machine memory. To avoid this, we can also squish these matrixes into a much condensed form called packed padded sequences using PyTorch utility nn.utils.rnn.pack_padded_sequence.

pack_pad_seq = nn.utils.rnn.pack_padded_sequence(
    pad_seq, [1, 3, 5], enforce_sorted=False, batch_first=False
)
pack_pad_seq.data

tensor([5, 2, 1, 6, 3, 7, 4, 8, 9])

Here the tensor still holds all the original tensor values (1 to 9) but is very condensed and has no extra padded token. So how does this tensor know which tokens belong to which token? For this, it stores some additional information.

batch sizes (or original tensor length)
tensor indices

We can move back and forth between the padded pack and unpacked sequences using this information.

pack_pad_seq

PackedSequence(data=tensor([5, 2, 1, 6, 3, 7, 4, 8, 9]), batch_sizes=tensor([3, 2, 2, 1, 1]), sorted_indices=tensor([2, 1, 0]), unsorted_indices=tensor([2, 1, 0]))

Run data preprocessing pipelines on an example batch

Let’s load our data in the PyTorch DataLoader class and create a small batch of 4 reviews. Preprocess the entire set with collate_batch function.

from torch.utils.data import DataLoader

dataloader = DataLoader(
    train_dataset, batch_size=4, shuffle=False, collate_fn=collate_batch
)
text_batch, label_batch, length_batch = next(iter(dataloader))

print("text_batch.shape: ", text_batch.shape)
print("label_batch: ", label_batch)
print("length_batch: ", length_batch)

text_batch.shape:  torch.Size([4, 218])
label_batch:  tensor([1., 1., 1., 0.], device='cuda:0')
length_batch:  tensor([165,  86, 218, 145], device='cuda:0')

text_batch.shape: torch.Size([4, 218]) tells us that in this batch, there are four reviews (or their tokens) and all have the same length of 218
label_batch: tensor([1., 1., 1., 0.]) tells us that the first three reviews are positive and the last is negative
length_batch: tensor([165, 86, 218, 145]) tells us that before padding the original length of review tokens

Let’s check what the first review in this batch looks like after preprocessing and padding.

print(text_batch[0])

tensor([   35,  1739,     7,   449,   721,     6,   301,     4,   787,     9,
            4,    18,    44,     2,  1705,  2460,   186,    25,     7,    24,
          100,  1874,  1739,    25,     7, 34415,  3568,  1103,  7517,   787,
            5,     2,  4991, 12401,    36,     7,   148,   111,   939,     6,
        11598,     2,   172,   135,    62,    25,  3199,  1602,     3,   928,
         1500,     9,     6,  4601,     2,   155,    36,    14,   274,     4,
        42945,     9,  4991,     3,    14, 10296,    34,  3568,     8,    51,
          148,    30,     2,    58,    16,    11,  1893,   125,     6,   420,
         1214,    27, 14542,   940,    11,     7,    29,   951,    18,    17,
        15994,   459,    34,  2480, 15211,  3713,     2,   840,  3200,     9,
         3568,    13,   107,     9,   175,    94,    25,    51, 10297,  1796,
           27,   712,    16,     2,   220,    17,     4,    54,   722,   238,
          395,     2,   787,    32,    27,  5236,     3,    32,    27,  7252,
         5118,  2461,  6390,     4,  2873,  1495,    15,     2,  1054,  2874,
          155,     3,  7015,     7,   409,     9,    41,   220,    17,    41,
          390,     3,  3925,   807,    37,    74,  2858,    15, 10297,   115,
           31,   189,  3506,   667,   163,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0],
       device='cuda:0')

To complete the picture, I have re-printed the original text of the first review and manually processed a part of it. You can verify that the tokens match.

##
# first review
train_dataset[0]

('pos',
 'An extra is called upon to play a general in a movie about the Russian Revolution. However, he is not any ordinary extra. He is Serguis Alexander, former commanding general of the Russia armies who is now being forced to relive the same scene, which he suffered professional and personal tragedy in, to satisfy the director who was once a revolutionist in Russia and was humiliated by Alexander. It can now be the time for this broken man to finally "win" his penultimate battle. This is one powerful movie with meticulous direction by Von Sternberg, providing the greatest irony in Alexander\'s character in every way he can. Jannings deserved his Oscar for the role with a very moving performance playing the general at his peak and at his deepest valley. Powell lends a sinister support as the revenge minded director and Brent is perfect in her role with her face and movements showing so much expression as Jannings\' love. All around brilliance. Rating, 10.')

##
# manually preprocessing a part of review text
# notice that the generated tokens match
text = 'An extra is called upon to play a general in a movie about the Russian Revolution'
[vb[token] for token in tokenizer(text)]

[35, 1739, 7, 449, 721, 6, 301, 4, 787, 9, 4, 18, 44, 2, 1705, 2460]

Batching the training, validation, and test dataset

Let’s proceed on creating DataLoaders for train, valid, and test data with batch_size = 32

batch_size = 32

train_dl = DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_batch
)
valid_dl = DataLoader(
    valid_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_batch
)
test_dl = DataLoader(
    test_dataset_raw, batch_size=batch_size, shuffle=False, collate_fn=collate_batch
)

Define model training and evaluation pipelines

I have defined two simple functions to train and evaluate the model in this section.

##
# model training pipeline
# https://github.com/rasbt/machine-learning-book/blob/main/ch15/ch15_part2.ipynb
def train(dataloader):
    model.train()
    total_acc, total_loss = 0, 0
    for text_batch, label_batch, lengths in dataloader:
        optimizer.zero_grad()
        pred = model(text_batch, lengths)[:, 0]
        loss = loss_fn(pred, label_batch)
        loss.backward()
        optimizer.step()
        total_acc += ((pred >= 0.5).float() == label_batch).float().sum().item()
        total_loss += loss.item() * label_batch.size(0)
    return total_acc / len(dataloader.dataset), total_loss / len(dataloader.dataset)


# model evaluation pipeline
def evaluate(dataloader):
    model.eval()
    total_acc, total_loss = 0, 0
    with torch.no_grad():
        for text_batch, label_batch, lengths in dataloader:
            pred = model(text_batch, lengths)[:, 0]
            loss = loss_fn(pred, label_batch)
            total_acc += ((pred >= 0.5).float() == label_batch).float().sum().item()
            total_loss += loss.item() * label_batch.size(0)
    return total_acc / len(dataloader.dataset), total_loss / len(dataloader.dataset)

RNN model configuration, loss function, and optimizer

We have seen the review text, which can be long sequences. We will use the LSTM layer for capturing the long-term dependencies. Our sentiment analysis model is composed of the following layers

Start with an Embedding layer. Placing the embedding layer is similar to one-hot-encoding, where each word token is converted to a separate feature (or vector or column). But this can lead to too many features (curse of dimensionality or dimensional explosion). To avoid this, we try to map tokens to fixed-size vectors (or columns). In such a feature matrix, different elements denote different tokens. Tokens that are closed are also placed together. Further, during training, we also learn and update the positioning of tokens. Similar tokens are placed into closer and closer locations. Such a matrix layer is termed an embedding layer.
After the embedding layer, there is the RNN layer (LSTM to be specific).
Then we have a fully connected layer followed by activation and another fully connected layer.
Finally, we have a logistic sigmoid layer for prediction

##
# https://github.com/rasbt/machine-learning-book/blob/main/ch15/ch15_part2.ipynb
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, batch_first=True)
        self.fc1 = nn.Linear(rnn_hidden_size, fc_hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(fc_hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, text, lengths):
        out = self.embedding(text)
        out = nn.utils.rnn.pack_padded_sequence(
            out, lengths.cpu().numpy(), enforce_sorted=False, batch_first=True
        )
        out, (hidden, cell) = self.rnn(out)
        out = hidden[-1, :, :]
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

vocab_size = len(vb)
embed_dim = 20
rnn_hidden_size = 64
fc_hidden_size = 64

torch.manual_seed(1)
model = RNN(vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size)
model = model.to(device)

Define model loss function and optimizer

For loss function (or criterion), I have used Binary Cross Entropy, and for loss optimization, I have used Adam algorithm

torch.manual_seed(1)

loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Model training and evaluation

Let’s run the pipeline for ten epochs and compare the training and validation accuracy.

num_epochs = 10
for epoch in range(num_epochs):
    acc_train, loss_train = train(train_dl)
    acc_valid, loss_valid = evaluate(valid_dl)
    print(
        f"Epoch {epoch} train accuracy: {acc_train:.4f}; val accuracy: {acc_valid:.4f}"
    )

Epoch 0 train accuracy: 0.6085; val accuracy: 0.6502
Epoch 1 train accuracy: 0.7206; val accuracy: 0.7462
Epoch 2 train accuracy: 0.7613; val accuracy: 0.6250
Epoch 3 train accuracy: 0.8235; val accuracy: 0.8232
Epoch 4 train accuracy: 0.8819; val accuracy: 0.8482
Epoch 5 train accuracy: 0.9132; val accuracy: 0.8526
Epoch 6 train accuracy: 0.9321; val accuracy: 0.8374
Epoch 7 train accuracy: 0.9504; val accuracy: 0.8502
Epoch 8 train accuracy: 0.9643; val accuracy: 0.8608
Epoch 9 train accuracy: 0.9747; val accuracy: 0.8636

Evaluate sentiments on random texts

Let’s create another helper method to evaluate sentiments on random texts.

def classify_review(text):
    text_list, lengths = [], []

    # process review text with text_pipeline
    # note: "text_pipeline" has dependency on data vocabulary
    processed_text = torch.tensor(text_pipeline(text), dtype=torch.int64)
    text_list.append(processed_text)

    # get processed review tokens length
    lengths.append(processed_text.size(0))
    lengths = torch.tensor(lengths)
        
    # change the dimensions from (torch.Size([8]), torch.Size([1, 8]))
    # nn.utils.rnn.pad_sequence(text_list, batch_first=True) does this too
    padded_text_list = torch.unsqueeze(processed_text, 0)

    # move tensors to correct device
    padded_text_list = padded_text_list.to(device)
    lengths = lengths.to(device)

    # get prediction
    model.eval()
    pred = model(padded_text_list, lengths)
    print("model pred: ", pred)

    # positive or negative review
    review_class = 'negative' # else case
    if (pred>=0.5) == 1:
        review_class = "positive"

    print("review type: ", review_class)

##
# create two random texts with strong positive and negative sentiments
pos_review = 'i love this movie. it was so good.'
neg_review = 'slow and boring. waste of time.'

classify_review(pos_review)

model pred:  tensor([[0.9388]], device='cuda:0', grad_fn=<SigmoidBackward0>)
review type:  positive

classify_review(neg_review)

model pred:  tensor([[0.0057]], device='cuda:0', grad_fn=<SigmoidBackward0>)
review type:  negative