Code
python==3.7.15
numpy==1.21.6
torch==1.12.1+cu113
matplotlib==3.2.2
November 19, 2022
This notebook takes inspiration and ideas from the following sources.
This notebook is prepared with Google Colab.
python==3.7.15
numpy==1.21.6
torch==1.12.1+cu113
matplotlib==3.2.2
Recurrent Neural Network (RNN) works well for sequence problems, i.e., predicting the next sequence item. Stock prices, for example, are a type of sequence data more commonly known as time-series data. A similar notion can be applied to the NLP domain to build a character-level language model. Here language textual data becomes the sequence data, and from our model, we try to predict the next character in the input text. For training, the input text is broken into a sequence of characters and fed to the model one character at a time. The network will process the new character in relation to previously seen characters and use this information to predict the next alphabet.
For input text, we will use a famous English folk story (though any other text will work equally well) with the name Cinderella. To download the story text, you may use Project Gutenberg site or Archive.org.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 45278 100 45278 0 0 38865 0 0:00:01 0:00:01 --:--:-- 38831
The download is complete. We can now open the file and read its contents.
The downloaded text has been published as a volunteer effort under Project Gutenberg. They have added some project and license information after the original story text as part of the project requirements. We are not interested in that text (boilerplate text), so let’s omit that and limit our input text to the folk story.
##
# truncate text till story start and end
start_indx = text.find(
"There once lived a gentleman and his wife, who were the parents of a\nlovely little daughter."
)
end_indx = text.find("* * * * *")
text = text[start_indx:end_indx]
# total length of the text
print("Total Length (character count):", len(text))
Total Length (character count): 21831
Let’s view the first 500 characters from the story text.
'There once lived a gentleman and his wife, who were the parents of a\nlovely little daughter.\n\nWhen this child was only nine years of age, her mother fell sick.\nFinding her death coming on, she called her child to her and said to\nher, "My child, always be good; bear every thing that happens to you\nwith patience, and whatever evil and troubles you may suffer, you will\nbe happy in the end if you are so." Then the poor lady died, and her\ndaughter was full of great grief at the loss of a mother so go'
And the last 500 characters.
'their affection.\nShe was then taken to the palace of the young prince, in whose eyes she\nappeared yet more lovely than before, and who married her shortly after.\n\nCinderella, who was as good as she was beautiful, allowed her sisters to\nlodge in the palace, and gave them in marriage, that same day, to two\nlords belonging to the court.\n\n[Illustration: MARRIAGE OF THE PRINCE AND CINDERELLA.]\n\nThe amiable qualities of Cinderella were as conspicuous after as they\nhad been before marriage.\n\n\n\n\n '
Our data is a string and can’t be used to train a model. So instead, we have to convert it into integers. For this encoding, we will use a simple methodology where each unique character in the text is assigned an integer and then replaced with all occurrences of that character in the text with that integer value.
For this, let’s first create a set of all the unique characters in the text.
import numpy as np
# find unique chars from text
char_set = set(text)
print("Unique Characters:", len(char_set))
# sort char set
chars_sorted = sorted(char_set)
print(chars_sorted)
Unique Characters: 65
['\n', ' ', '!', '"', "'", ',', '-', '.', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
We now know all the unique characters in our input text. Accordingly, we can create a dictionary and assign each character in char_set
a unique integer.
# encode chars
char2int = {ch: i for i, ch in enumerate(chars_sorted)}
# `char2int` dictionary for char -> int
print(char2int)
{'\n': 0, ' ': 1, '!': 2, '"': 3, "'": 4, ',': 5, '-': 6, '.': 7, ':': 8, ';': 9, '?': 10, 'A': 11, 'B': 12, 'C': 13, 'D': 14, 'E': 15, 'F': 16, 'G': 17, 'H': 18, 'I': 19, 'J': 20, 'K': 21, 'L': 22, 'M': 23, 'N': 24, 'O': 25, 'P': 26, 'Q': 27, 'R': 28, 'S': 29, 'T': 30, 'U': 31, 'V': 32, 'W': 33, 'Y': 34, 'Z': 35, '[': 36, ']': 37, '_': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
But more than just the encoding, we also need a way to convert the encoded characters back to the original form. For this, we will use a separate array that will hold the index of each char
in the dictionary. Together with char2int
and int2char
we can move back and forth between encoded and decoded characters.
['\n' ' ' '!' '"' "'" ',' '-' '.' ':' ';' '?' 'A' 'B' 'C' 'D' 'E' 'F' 'G'
'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'Y' 'Z'
'[' ']' '_' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o'
'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z']
In this step, we will use the char2int
dictionary to encode our story text. The encoded version of text
is called text_encoded
.
##
# encode original text
text_encoded = np.array([char2int[ch] for ch in text], dtype=np.int32)
print("Text encoded shape: ", text_encoded.shape)
Text encoded shape: (21831,)
Let’s use int2char
to decode and return the original text.
30 -> T
46 -> h
43 -> e
56 -> r
43 -> e
Another example of encoding and decoding. This time I used multiple words together.
print(text[:18], " == Encoding ==> ", text_encoded[:18])
print(text_encoded[19:41], " == Reverse ==> ", "".join(int2char[text_encoded[19:41]]))
There once lived a == Encoding ==> [30 46 43 56 43 1 53 52 41 43 1 50 47 60 43 42 1 39]
[45 43 52 58 50 43 51 39 52 1 39 52 42 1 46 47 57 1 61 47 44 43] == Reverse ==> gentleman and his wife
We have our encoded data ready. Next, we will convert it into sequences of fixed length. The last sequence element will act as a target, and the remaining elements will be the input. For sequencing, we will use length 41.
##
# inspect the first chuck
for seq in text_chunks[:1]:
input_seq = seq[:-1]
target = seq[-1]
print(input_seq, " -> ", target)
print(repr("".join(int2char[input_seq])), " -> ", repr("".join(int2char[target])))
[30 46 43 56 43 1 53 52 41 43 1 50 47 60 43 42 1 39 1 45 43 52 58 50
43 51 39 52 1 39 52 42 1 46 47 57 1 61 47 44] -> 43
'There once lived a gentleman and his wif' -> 'e'
##
# inspect the second chuck
for seq in text_chunks[1:2]:
input_seq = seq[:-1]
target = seq[-1]
print(input_seq, " -> ", target)
print(repr("".join(int2char[input_seq])), " -> ", repr("".join(int2char[target])))
[46 43 56 43 1 53 52 41 43 1 50 47 60 43 42 1 39 1 45 43 52 58 50 43
51 39 52 1 39 52 42 1 46 47 57 1 61 47 44 43] -> 5
'here once lived a gentleman and his wife' -> ','
In this section, we will load our encoded data sequences into Dataset and DataLoader class to prepare batches for model training.
class TextDataset
is derived from PyTorch Dataset
. When we get a sequence using this class, it will return the sequence as a tuple of input and target.
import torch
from torch.utils.data import Dataset
class TextDataset(Dataset):
def __init__(self, text_chunks):
self.text_chunks = text_chunks
def __len__(self):
return len(self.text_chunks)
def __getitem__(self, idx):
text_chunk = self.text_chunks[idx]
return text_chunk[:-1].long(), text_chunk[1:].long() # return input, target
seq_dataset = TextDataset(torch.tensor(text_chunks))
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:15: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.)
from ipykernel import kernelapp as app
Each element from the seq_dataset
consists of
input
data that we will feed to the model for trainingtarget
data that we will use to compare the model outputRemember that both input
and target
sequences are derived from the same encoded text. We train our model to predict the next character from the given input. One character is given as an input to the model, and one character output comes out of the model. In an ideal case, the model output character should represent the next character in a sequence. And our target
sequence is just that: one next character from the input sequence.
for i, (seq, target) in enumerate(seq_dataset):
print(" Input (x):", repr("".join(int2char[seq])))
print("Target (y):", repr("".join(int2char[target])))
print()
if i == 1:
break
Input (x): 'There once lived a gentleman and his wif'
Target (y): 'here once lived a gentleman and his wife'
Input (x): 'here once lived a gentleman and his wife'
Target (y): 'ere once lived a gentleman and his wife,'
In this step, we have prepared training batches using the PyTorch DataLoader class.
In this section, we will configure a model for character-level language modeling. This model will have an Embedding layer at the start. Next, output from the embedding layer will be passed to the LSTM layer. Finally, at the output, we have a fully connected linear layer.
For an in-depth analysis of the working of an Embedding layer, I recommend this article Embeddings in Machine Learning: Everything You Need to Know
import torch.nn as nn
device = "cuda" if torch.cuda.is_available() else "cpu"
class RNN(nn.Module):
def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.rnn_hidden_size = rnn_hidden_size
self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, batch_first=True)
self.fc = nn.Linear(rnn_hidden_size, vocab_size)
def forward(self, x, hidden, cell):
out = self.embedding(x).unsqueeze(1)
out, (hidden, cell) = self.rnn(out, (hidden, cell))
out = self.fc(out).reshape(out.size(0), -1)
return out, hidden, cell
def init_hidden(self, batch_size):
hidden = torch.zeros(1, batch_size, self.rnn_hidden_size)
cell = torch.zeros(1, batch_size, self.rnn_hidden_size)
return hidden.to(device), cell.to(device)
torch.manual_seed(1)
# define model dimensions
vocab_size = len(int2char)
embed_dim = 256
rnn_hidden_size = 512
# initialize model
model = RNN(vocab_size, embed_dim, rnn_hidden_size)
model = model.to(device)
model
RNN(
(embedding): Embedding(65, 256)
(rnn): LSTM(256, 512, batch_first=True)
(fc): Linear(in_features=512, out_features=65, bias=True)
)
vocab_size
of 65 classes.All parts are ready so let’s start the training. Google Colab “CPU” runtime can take significantly longer to train. I would suggest using “GPU” runtime instead.
# for execution time measurement
from timeit import default_timer as timer
num_epochs = 10000
model.train()
start = timer() # timer start
for epoch in range(num_epochs):
hidden, cell = model.init_hidden(batch_size)
seq_batch, target_batch = next(iter(seq_dl))
seq_batch = seq_batch.to(device)
target_batch = target_batch.to(device)
optimizer.zero_grad()
loss = 0
for c in range(seq_length):
pred, hidden, cell = model(seq_batch[:, c], hidden, cell)
loss += loss_fn(pred, target_batch[:, c])
loss.backward()
optimizer.step()
loss = loss.item() / seq_length
if epoch % 500 == 0:
print(f"Epoch {epoch} loss: {loss:.4f}")
end = timer() # timer end
print("Total execution time in seconds: ", "%.2f" % (end - start))
print("Device type: ", device)
Epoch 0 loss: 2.6252
Epoch 500 loss: 0.3377
Epoch 1000 loss: 0.2502
Epoch 1500 loss: 0.2403
Epoch 2000 loss: 0.2501
Epoch 2500 loss: 0.2374
Epoch 3000 loss: 0.2368
Epoch 3500 loss: 0.2499
Epoch 4000 loss: 0.2643
Epoch 4500 loss: 0.2555
Epoch 5000 loss: 0.3854
Epoch 5500 loss: 0.2326
Epoch 6000 loss: 0.2390
Epoch 6500 loss: 0.2270
Epoch 7000 loss: 0.2663
Epoch 7500 loss: 0.3403
Epoch 8000 loss: 0.2475
Epoch 8500 loss: 0.2370
Epoch 9000 loss: 0.2126
Epoch 9500 loss: 0.2308
Total execution time in seconds: 378.14
Device type: cuda
Getting a prediction (text generation) from the model takes some extra work. Since the model is trained on encoded text, the output generated from the model is also encoded. Further, any input used for prediction itself needs to be encoded using the same encoding dictionary model it is trained with. For this, we have defined a helper function.
This function will take the input text and encode it before passing it to the model
It will take the output from the model and decode it before returning
Note that LSTM model output has logits, hidden state, and cell state
. Logits give us the next predicted character. Hidden state and cell state are for keeping the context (or memory) of characters processed so far and are supplied to the model for the next prediction.
For the output logits, we can predict the next character using the index of the highest logit value. This will make our model predict the exact text on the same input each time. To introduce some randomness, we take help from PyTorch class torch.distributions.categorical.Categorical. This is how it works
from torch.distributions.categorical import Categorical
def sample(model, starting_str, len_generated_text=500, scale_factor=1.0):
encoded_input = torch.tensor([char2int[s] for s in starting_str])
encoded_input = torch.reshape(encoded_input, (1, -1))
generated_str = starting_str
model.eval()
hidden, cell = model.init_hidden(1)
hidden = hidden.to("cpu")
cell = cell.to("cpu")
for c in range(len(starting_str) - 1):
_, hidden, cell = model(encoded_input[:, c].view(1), hidden, cell)
last_char = encoded_input[:, -1]
for i in range(len_generated_text):
logits, hidden, cell = model(last_char.view(1), hidden, cell)
logits = torch.squeeze(logits, 0)
scaled_logits = logits * scale_factor
m = Categorical(logits=scaled_logits)
last_char = m.sample()
generated_str += str(int2char[last_char])
return generated_str
We are processing text and model output on the ‘CPU’ device in the ‘sample’ function. So let’s also move the model to the same device.
RNN(
(embedding): Embedding(65, 256)
(rnn): LSTM(256, 512, batch_first=True)
(fc): Linear(in_features=512, out_features=65, bias=True)
)
Before generating some lengthy text, let’s experiment with simple words and see if our model can complete them.
At first, I used the string “fat” and asked the model to generate the following three characters to complete this word. But at the same time, I have passed a tiny scaling factor meaning I have decreased the model’s predictability.
Next, I asked the model to use the same input and predict the following three characters, but I increased the model’s predictability ten times. So let’s see the output this time.
The second time model generated the correct word “father” it had seen before in the training text. So let’s now generate some lengthy texts.
The father too was she was one of those good faeries who protect children. Her
spirits revived, and she wiped away her tears.
The faery took Cinderella by the hand, and old woman, assuming her character of Queen of the
Faeries, that only jumped up behind the
carriage as nimbly as if they had been footmen and laced so tight, touched Cinderella's clothes with her wand, and said, "Now, my dear good child," said the faery, "here you have a coach and
horses, much handsomer than your sisters', to say the least
The mother so good crust. But
if you like to give the household. It was she who washed the dishes, and
scrubbed down the step-sisters were very cruel to Cinderella,
that he did not eat one morsel of the supper.
Cinderella drew the fellow slipper
out of her godmother
would do with it. Her godmother took the pumpkin, and scooped out the
inside of it, leaving nothing but rind; she then struck it with her
godmother then said, "My dear Cinderella,
that he did not eat one morsel of the supper.
Cinderella drew
The three sisters were very cruel to Cinderella,
that he delicacies which she had
received from the prince: but they did not eat one morsel for a
couple of days. They spent their whole time before a looking-glass, and
they would be laced so tight, tossing her head disdainfully, "that I
should lend my clothes to a dirty Cinderella like you!"
Cinderella quite amazed; but their
astonishment at her dancing was still greater.
Gracefulness seemed to play in the attempt.
The long-wished-for evening came at last, an
The lovely prince
immediately jumped up behind the
carriage as nimbly as conspicuous after as they
had been before mocking me," replied the poor girl to do all the
drudgery of the household. It was she who washed the dishes, and
scrubbed down the stairs, who tried with all their might to force their unwould stration: CINDERELLA IS PRESENTED BY THE PRINCE TO THE KING AND
QUEEN, WHO WELCOME HER WITH THE HONORS DUE TO A GREAT PRINCESS, AND IS
THEN LED INTO THE ROYAL BY THE HER WITH THE HONORS DUE TO A GREAT PRINCES