Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: torchtext==0.11.0 in /usr/local/lib/python3.7/dist-packages (0.11.0)
Requirement already satisfied: torch==1.10.0 in /usr/local/lib/python3.7/dist-packages (from torchtext==0.11.0) (1.10.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from torchtext==0.11.0) (1.21.6)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from torchtext==0.11.0) (2.23.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from torchtext==0.11.0) (4.64.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch==1.10.0->torchtext==0.11.0) (4.1.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->torchtext==0.11.0) (2022.9.24)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->torchtext==0.11.0) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->torchtext==0.11.0) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->torchtext==0.11.0) (3.0.4)
Predicting the Sentiment of IMDB Movie Reviews using LSTM in PyTorch
Credits
This notebook takes inspiration and ideas from the following sources.
- “Machine learning with PyTorch and Scikit-Learn” by “Sebastian Raschka, Yuxi (Hayden) Liu, and Vahid Mirjalili”. You can get the book from its website: Machine learning with PyTorch and Scikit-Learn. In addition, the GitHub repository for this book has valuable notebooks: github.com/rasbt/machine-learning-book. Parts of the code you see in this notebook are taken from chapter 15 notebook of the same book.
- “Intro to Deep Learning and Generative Models Course” lecture series from “Sebastian Raschka”. Course website: stat453-ss2021. YouTube Link: Intro to Deep Learning and Generative Models Course. Lectures that are related to this post are L15.5 Long Short-Term Memory and L15.7 An RNN Sentiment Classifier in PyTorch
Environment
This notebook is prepared with Google Colab.
- GitHub: 2022-11-09-pytorch-lstm-imdb-sentiment-prediction.ipynb
- Open In Colab:
For “runtime type” choose hardware accelerator as “GPU”. It will take a long time to complete the training without any GPU.
This notebook also depends on the PyTorch library TorchText. We will use this library to fetch IMDB review data. While using the torchtext
latest version, I found more dependencies on other libraries like torchdata
. Even after resolving them, it threw strange encoding errors while fetching IMDB data. So I have downgraded this library till the version I found working without external dependencies. Consequently, torch
is also downgraded to a compatible version, but I did not find any issue while working with a lower version of PyTorch for this notebook. It is preferred to restart the runtime after the library installation is complete.
Code
python==3.7.15
numpy==1.21.6
torch==1.10.0+cu102
torchtext==0.11.0
matplotlib==3.2.2
Data Preparation
Download data
Let’s download our movie review dataset. This dataset is also known as Large Movie Review Dataset, and can also be obtained in a compressed zip file from this link. Using the torchtext
library makes downloading, extracting, and reading files a lot easier. ‘torchtext.datasets’ comes with many more NLP related datasets, and a full list can be found here.
Check the size of the downloaded data.
Split train data further into train and validation set
Both train and test datasets have 25000 reviews. Therefore, we can split the training set further into the train and validation sets.
How does this data look?
The data we have is in the form of tuples. The first index has the sentiment label, and the second contains the review text. Let’s check the first element in our training dataset.
('pos',
'An extra is called upon to play a general in a movie about the Russian Revolution. However, he is not any ordinary extra. He is Serguis Alexander, former commanding general of the Russia armies who is now being forced to relive the same scene, which he suffered professional and personal tragedy in, to satisfy the director who was once a revolutionist in Russia and was humiliated by Alexander. It can now be the time for this broken man to finally "win" his penultimate battle. This is one powerful movie with meticulous direction by Von Sternberg, providing the greatest irony in Alexander\'s character in every way he can. Jannings deserved his Oscar for the role with a very moving performance playing the general at his peak and at his deepest valley. Powell lends a sinister support as the revenge minded director and Brent is perfect in her role with her face and movements showing so much expression as Jannings\' love. All around brilliance. Rating, 10.')
Check the first index of the validation set.
('neg',
'The Dereks did seem to struggle to find rolls for Bo after "10".<br /><br />I used to work for a marine park in the Florida Keys. One day, the script for "Ghosts Can\'t Do It" was circulating among the trainers in the "fish house" where food was prepared for the dolphins. There was one scene where a -dolphin- supposedly propositions Bo (or Bo the dolphin), asking to "go make eggs." Reading the script, we -lauuughed-...<br /><br />We did not end up doing any portion of this movie at our facility, although our dolphins -were- in "The Big Blue!"<br /><br />This must have been very close to the end of Anthony Quinn\'s life. I hope he had fun in this film, as it certainly didn\'t do anything for his legacy.')
Data preprocessing steps
From these two reviews, we can deduce that
- We have two labels. ‘pos’ for a positive and ‘neg’ for a negative review
- From the second review (from valid_dataset), we also get that text may contain HTML tags, special characters, and emoticons besides normal English words. It will require some preprocessing to remove them for proper word tokenization.
- Reviews can have varying text lengths. It will require some padding to make all review texts the same size.
Let’s take a simple text example and apply these steps to understand why these steps are essential in preprocessing. In the last step, we will create tokens from the preprocessed text.
example_text = '''This is awesome movie <br /><br />. I loved it so much :-) I\'m goona watch it again :)'''
example_text
"This is awesome movie <br /><br />. I loved it so much :-) I'm goona watch it again :)"
##
# step 1. remove HTML tags. they are not helpful in understanding the sentiments of a review
import re
text = re.sub('<[^>]*>', '', example_text)
text
"This is awesome movie . I loved it so much :-) I'm goona watch it again :)"
"this is awesome movie . i loved it so much :-) i'm goona watch it again :)"
##
# step 3: extract emoticons. keep them as they are important sentiment signals
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
emoticons
[':-)', ':)']
'this is awesome movie i loved it so much i m goona watch it again '
'this is awesome movie i loved it so much i m goona watch it again :) :)'
['this',
'is',
'awesome',
'movie',
'i',
'loved',
'it',
'so',
'much',
'i',
'm',
'goona',
'watch',
'it',
'again',
':)',
':)']
Let’s put all the preprocessing steps in a nice function and give it a name.
def tokenizer(text):
# step 1. remove HTML tags. they are not helpful in understanding the sentiments of a review
# step 2: use lowercase for all text to keep symmetry
# step 3: extract emoticons. keep them as they are important sentiment signals
# step 4: remove punctuation marks
# step 5: put back emoticons
# step 6: generate word tokens
text = re.sub("<[^>]*>", "", text)
text = text.lower()
emoticons = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)", text)
text = re.sub("[\W]+", " ", text)
text = text + " ".join(emoticons).replace("-", "")
tokenized = text.split()
return tokenized
Apply tokenizer
on the example_text
to verify the output.
Preparing data dictionary
We are successful in creating word tokens from our example_text
. But there is one more problem. Some of the tokens are repeating. If we can convert these tokens into a dictionary along with their frequency count, we can significantly reduce the generated token size from these reviews. Let’s do that.
from collections import Counter
token_counts = Counter()
token_counts.update(example_tokens)
token_counts
Counter({'this': 1,
'is': 1,
'awesome': 1,
'movie': 1,
'i': 2,
'loved': 1,
'it': 2,
'so': 1,
'much': 1,
'm': 1,
'goona': 1,
'watch': 1,
'again': 1,
':)': 2})
Let’s sort the output to have the most common words at the top.
sorted_by_freq_tuples = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)
sorted_by_freq_tuples
[('i', 2),
('it', 2),
(':)', 2),
('this', 1),
('is', 1),
('awesome', 1),
('movie', 1),
('loved', 1),
('so', 1),
('much', 1),
('m', 1),
('goona', 1),
('watch', 1),
('again', 1)]
It shows that in our example text, the top place is taken by pronouns (i and it) followed by the emoticon. Though our data is now correctly processed, it needs to be prepared to be fed to a model. Because [machine] models love math and work with numbers exclusively. To convert our dictionary of word tokens into integers, we can take help from torchtext.vocab
. Its purpose in the official documentation is defined as link here
Factory method for creating a vocab object which maps tokens to indices.
Note that the ordering in which key value pairs were inserted in the ordered_dict will be respected when building the vocab. Therefore if sorting by token frequency is important to the user, the ordered_dict should be created in a way to reflect this.
It highlights three points:
- It maps tokens to indices
- It requires an ordered dictionary (
OrderedDict
) to work - Tokens in vocab at the starting indices reflect higher frequency
##
# step 1: convert our sorted list of tokens to OrderedDict
from collections import OrderedDict
ordered_dict = OrderedDict(sorted_by_freq_tuples)
ordered_dict
OrderedDict([('i', 2),
('it', 2),
(':)', 2),
('this', 1),
('is', 1),
('awesome', 1),
('movie', 1),
('loved', 1),
('so', 1),
('much', 1),
('m', 1),
('goona', 1),
('watch', 1),
('again', 1)])
##
# step 2: convert the ordered dict to torchtext.vocab
from torchtext.vocab import vocab
vb = vocab(ordered_dict)
vb.get_stoi()
{'goona': 11,
'much': 9,
'm': 10,
'loved': 7,
'watch': 12,
'so': 8,
'movie': 6,
'it': 1,
'again': 13,
'this': 3,
'i': 0,
'awesome': 5,
':)': 2,
'is': 4}
This generated vocabulary shows that tokens with higher frequency (i
, it
) have been assigned lower indices (or integers). This vocabulary will act as a lookup table for us, and during training for each word token, we will find a corresponding index from this vocab and pass it to our model.
We have done many steps while processing our example_text
. Let’s summarize them here before moving further
Summary of data dictionary preparation steps
- Generate tokens from text using the function
tokenizer
- Find the frequency of tokens using Python collections.Counter
- Sort the tokens based on their frequency in descending order
- Put the sorted tokens in Python collections.OrderedDict
- Convert the tokens into integers using torchtext.vocab
Let’s apply all these steps on our IMDB reviews training dataset.
##
# step 1: convert reviews into tokens
# step 2: find frequency of tokens
token_counts = Counter()
for label, line in train_dataset:
tokens = tokenizer(line)
token_counts.update(tokens)
print('IMDB vocab size:', len(token_counts))
IMDB vocab size: 69023
After tokenizing IMDB reviews, we find that there 69023
unique tokens.
##
# step 3: sort the token based on their frequency
# step 4: put the sorted tokens in OrderedDict
# step 5: convert token to integers using vocab object
sorted_by_freq_tuples = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)
vb = vocab(ordered_dict)
vb.insert_token("<pad>", 0) # special token for padding
vb.insert_token("<unk>", 1) # special token for unknown words
vb.set_default_index(1)
# print some token indexes from vocab
for token in ["this", "is", "an", "example"]:
print(token, " --> ", vb[token])
this --> 11
is --> 7
an --> 35
example --> 457
We have added two extra tokens to our vocabulary.
- “pad” for padding. This token will come in handy when we pad our reviews to make them of the same length
- “unk” for unknown. This token will come in handy if we find any token in the validation or test set that was not part of the train set
Let’s also print the tokens present at the first ten indices of our vocab object.
It shows that articles, prepositions, and pronouns are the most common words in the training dataset. So let’s also check the least common words.
['hairband',
'ratt',
'bettiefile',
'queueing',
'johansen',
'hemmed',
'jardine',
'morland',
'seriousuly',
'fictive']
The least common words seem to be people or place names or misspelled words like ‘queueing’ and ‘seriousuly’.
Define data processing pipelines
At this point, we have our tokenizer function and vocabulary lookup ready. For each review item from the dataset, we are supposed to perform the following preprocessing steps:
For review text
- Create tokens from the review text
- Assign a unique integer to each token from the vocab lookup
For review label
- Assign 1 for
pos
and 0 forneg
label
Let’s create two simple functions (inline lambda) for review text and label processing.
[11, 7, 1166, 18, 10, 450, 8, 37, 74, 10, 142, 1, 104, 8, 174, 2287, 2287]
Instead of processing a single review at a time, we always prefer to work with a batch of them during model training. For each review item in the batch, we will be doing the same preprocessing steps i.e. review text processing and label processing. For handling preprocessing steps at a batch level, we can create another higher-level function that applies preprocessing steps at a batch level.
##
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cuda
##
# a function to apply pre-processing steps at a batch level
import torch.nn as nn
def collate_batch(batch):
label_list, text_list, lengths = [], [], []
# iterate over all reviews in a batch
for _label, _text in batch:
# label preprocessing
label_list.append(label_pipeline(_label))
# text preprocessing
processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
# store the processed text in a list
text_list.append(processed_text)
# store the length of processed text
# this will come handy in future when we want to know the original size of a text (without padding)
lengths.append(processed_text.size(0))
label_list = torch.tensor(label_list)
lengths = torch.tensor(lengths)
# pad the processed reviews to make their lengths consistant
padded_text_list = nn.utils.rnn.pad_sequence(
text_list, batch_first=True)
# return
# 1. a list of processed and padded review texts
# 2. a list of processed labels
# 3. a list of review text original lengths (before padding)
return padded_text_list.to(device), label_list.to(device), lengths.to(device)
Sequence padding
In the above collate_batch
function, I added one extra padding step.
added_text_list = nn.utils.rnn.pad_sequence(text_list, batch_first=True)
We intend to make all review texts in a batch of the same length. For this, we take the maximum length of a text in a batch, all pad all the smaller text with extra dummy tokens (‘pad’) to make their sizes equal. Finally, with all the data in a batch of the same dimension, we convert it into a tensor matrix for faster processing.
To understand how PyTorch utility nn.utils.rnn.pad_sequence
works, we can take a simple example of three tensors (a, b, c) of varying sizes (1, 3, 5).
##
# initialize three tensors of varying sizes
a = torch.tensor([1])
b = torch.tensor([2, 3, 4])
c = torch.tensor([5, 6, 7, 8, 9])
a, b, c
(tensor([1]), tensor([2, 3, 4]), tensor([5, 6, 7, 8, 9]))
Now let’s pad them to make sizes consistant.
Sequence packing
From the above output, we can see that after padding tensors of varying sizes, we can convert them into a single matrix for faster processing. But the drawback of this approach is that we can have many, many padded tokens in our matrix. They are not helping us in any way, instead of occupying a lot of machine memory. To avoid this, we can also squish these matrixes into a much condensed form called packed padded sequences
using PyTorch utility nn.utils.rnn.pack_padded_sequence
.
pack_pad_seq = nn.utils.rnn.pack_padded_sequence(
pad_seq, [1, 3, 5], enforce_sorted=False, batch_first=False
)
pack_pad_seq.data
tensor([5, 2, 1, 6, 3, 7, 4, 8, 9])
Here the tensor still holds all the original tensor values (1 to 9) but is very condensed and has no extra padded token. So how does this tensor know which tokens belong to which token? For this, it stores some additional information.
- batch sizes (or original tensor length)
- tensor indices
We can move back and forth between the padded pack and unpacked sequences using this information.
Run data preprocessing pipelines on an example batch
Let’s load our data in the PyTorch DataLoader class and create a small batch of 4 reviews. Preprocess the entire set with collate_batch
function.
from torch.utils.data import DataLoader
dataloader = DataLoader(
train_dataset, batch_size=4, shuffle=False, collate_fn=collate_batch
)
text_batch, label_batch, length_batch = next(iter(dataloader))
print("text_batch.shape: ", text_batch.shape)
print("label_batch: ", label_batch)
print("length_batch: ", length_batch)
text_batch.shape: torch.Size([4, 218])
label_batch: tensor([1., 1., 1., 0.], device='cuda:0')
length_batch: tensor([165, 86, 218, 145], device='cuda:0')
text_batch.shape: torch.Size([4, 218])
tells us that in this batch, there are four reviews (or their tokens) and all have the same length of 218label_batch: tensor([1., 1., 1., 0.])
tells us that the first three reviews are positive and the last is negativelength_batch: tensor([165, 86, 218, 145])
tells us that before padding the original length of review tokens
Let’s check what the first review in this batch looks like after preprocessing and padding.
tensor([ 35, 1739, 7, 449, 721, 6, 301, 4, 787, 9,
4, 18, 44, 2, 1705, 2460, 186, 25, 7, 24,
100, 1874, 1739, 25, 7, 34415, 3568, 1103, 7517, 787,
5, 2, 4991, 12401, 36, 7, 148, 111, 939, 6,
11598, 2, 172, 135, 62, 25, 3199, 1602, 3, 928,
1500, 9, 6, 4601, 2, 155, 36, 14, 274, 4,
42945, 9, 4991, 3, 14, 10296, 34, 3568, 8, 51,
148, 30, 2, 58, 16, 11, 1893, 125, 6, 420,
1214, 27, 14542, 940, 11, 7, 29, 951, 18, 17,
15994, 459, 34, 2480, 15211, 3713, 2, 840, 3200, 9,
3568, 13, 107, 9, 175, 94, 25, 51, 10297, 1796,
27, 712, 16, 2, 220, 17, 4, 54, 722, 238,
395, 2, 787, 32, 27, 5236, 3, 32, 27, 7252,
5118, 2461, 6390, 4, 2873, 1495, 15, 2, 1054, 2874,
155, 3, 7015, 7, 409, 9, 41, 220, 17, 41,
390, 3, 3925, 807, 37, 74, 2858, 15, 10297, 115,
31, 189, 3506, 667, 163, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
device='cuda:0')
To complete the picture, I have re-printed the original text of the first review and manually processed a part of it. You can verify that the tokens match.
('pos',
'An extra is called upon to play a general in a movie about the Russian Revolution. However, he is not any ordinary extra. He is Serguis Alexander, former commanding general of the Russia armies who is now being forced to relive the same scene, which he suffered professional and personal tragedy in, to satisfy the director who was once a revolutionist in Russia and was humiliated by Alexander. It can now be the time for this broken man to finally "win" his penultimate battle. This is one powerful movie with meticulous direction by Von Sternberg, providing the greatest irony in Alexander\'s character in every way he can. Jannings deserved his Oscar for the role with a very moving performance playing the general at his peak and at his deepest valley. Powell lends a sinister support as the revenge minded director and Brent is perfect in her role with her face and movements showing so much expression as Jannings\' love. All around brilliance. Rating, 10.')
Batching the training, validation, and test dataset
Let’s proceed on creating DataLoaders for train, valid, and test data with batch_size = 32
batch_size = 32
train_dl = DataLoader(
train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_batch
)
valid_dl = DataLoader(
valid_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_batch
)
test_dl = DataLoader(
test_dataset_raw, batch_size=batch_size, shuffle=False, collate_fn=collate_batch
)
Define model training and evaluation pipelines
I have defined two simple functions to train and evaluate the model in this section.
##
# model training pipeline
# https://github.com/rasbt/machine-learning-book/blob/main/ch15/ch15_part2.ipynb
def train(dataloader):
model.train()
total_acc, total_loss = 0, 0
for text_batch, label_batch, lengths in dataloader:
optimizer.zero_grad()
pred = model(text_batch, lengths)[:, 0]
loss = loss_fn(pred, label_batch)
loss.backward()
optimizer.step()
total_acc += ((pred >= 0.5).float() == label_batch).float().sum().item()
total_loss += loss.item() * label_batch.size(0)
return total_acc / len(dataloader.dataset), total_loss / len(dataloader.dataset)
# model evaluation pipeline
def evaluate(dataloader):
model.eval()
total_acc, total_loss = 0, 0
with torch.no_grad():
for text_batch, label_batch, lengths in dataloader:
pred = model(text_batch, lengths)[:, 0]
loss = loss_fn(pred, label_batch)
total_acc += ((pred >= 0.5).float() == label_batch).float().sum().item()
total_loss += loss.item() * label_batch.size(0)
return total_acc / len(dataloader.dataset), total_loss / len(dataloader.dataset)
RNN model configuration, loss function, and optimizer
We have seen the review text, which can be long sequences. We will use the LSTM layer for capturing the long-term dependencies. Our sentiment analysis model is composed of the following layers
- Start with an Embedding layer. Placing the embedding layer is similar to one-hot-encoding, where each word token is converted to a separate feature (or vector or column). But this can lead to too many features (curse of dimensionality or dimensional explosion). To avoid this, we try to map tokens to fixed-size vectors (or columns). In such a feature matrix, different elements denote different tokens. Tokens that are closed are also placed together. Further, during training, we also learn and update the positioning of tokens. Similar tokens are placed into closer and closer locations. Such a matrix layer is termed an embedding layer.
- After the embedding layer, there is the RNN layer (LSTM to be specific).
- Then we have a fully connected layer followed by activation and another fully connected layer.
- Finally, we have a logistic sigmoid layer for prediction
##
# https://github.com/rasbt/machine-learning-book/blob/main/ch15/ch15_part2.ipynb
class RNN(nn.Module):
def __init__(self, vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, batch_first=True)
self.fc1 = nn.Linear(rnn_hidden_size, fc_hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(fc_hidden_size, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, text, lengths):
out = self.embedding(text)
out = nn.utils.rnn.pack_padded_sequence(
out, lengths.cpu().numpy(), enforce_sorted=False, batch_first=True
)
out, (hidden, cell) = self.rnn(out)
out = hidden[-1, :, :]
out = self.fc1(out)
out = self.relu(out)
out = self.fc2(out)
out = self.sigmoid(out)
return out
Define model loss function and optimizer
For loss function (or criterion), I have used Binary Cross Entropy, and for loss optimization, I have used Adam algorithm
Model training and evaluation
Let’s run the pipeline for ten epochs and compare the training and validation accuracy.
num_epochs = 10
for epoch in range(num_epochs):
acc_train, loss_train = train(train_dl)
acc_valid, loss_valid = evaluate(valid_dl)
print(
f"Epoch {epoch} train accuracy: {acc_train:.4f}; val accuracy: {acc_valid:.4f}"
)
Epoch 0 train accuracy: 0.6085; val accuracy: 0.6502
Epoch 1 train accuracy: 0.7206; val accuracy: 0.7462
Epoch 2 train accuracy: 0.7613; val accuracy: 0.6250
Epoch 3 train accuracy: 0.8235; val accuracy: 0.8232
Epoch 4 train accuracy: 0.8819; val accuracy: 0.8482
Epoch 5 train accuracy: 0.9132; val accuracy: 0.8526
Epoch 6 train accuracy: 0.9321; val accuracy: 0.8374
Epoch 7 train accuracy: 0.9504; val accuracy: 0.8502
Epoch 8 train accuracy: 0.9643; val accuracy: 0.8608
Epoch 9 train accuracy: 0.9747; val accuracy: 0.8636
Evaluate sentiments on random texts
Let’s create another helper method to evaluate sentiments on random texts.
def classify_review(text):
text_list, lengths = [], []
# process review text with text_pipeline
# note: "text_pipeline" has dependency on data vocabulary
processed_text = torch.tensor(text_pipeline(text), dtype=torch.int64)
text_list.append(processed_text)
# get processed review tokens length
lengths.append(processed_text.size(0))
lengths = torch.tensor(lengths)
# change the dimensions from (torch.Size([8]), torch.Size([1, 8]))
# nn.utils.rnn.pad_sequence(text_list, batch_first=True) does this too
padded_text_list = torch.unsqueeze(processed_text, 0)
# move tensors to correct device
padded_text_list = padded_text_list.to(device)
lengths = lengths.to(device)
# get prediction
model.eval()
pred = model(padded_text_list, lengths)
print("model pred: ", pred)
# positive or negative review
review_class = 'negative' # else case
if (pred>=0.5) == 1:
review_class = "positive"
print("review type: ", review_class)
model pred: tensor([[0.9388]], device='cuda:0', grad_fn=<SigmoidBackward0>)
review type: positive