Generating Text with Recurrent Neural Networks in PyTorch

pytorch
lstm
This is a practice notebook to build a character-level language model with LSTM using PyTorch. We will train a model on an input text, and our goal will be to generate some new text.
Published

November 19, 2022

Credits

This notebook takes inspiration and ideas from the following sources.

Environment

This notebook is prepared with Google Colab.

Code
from platform import python_version
import numpy, matplotlib, pandas, torch

print("python==" + python_version())
print("numpy==" + numpy.__version__)
print("torch==" + torch.__version__)
print("matplotlib==" + matplotlib.__version__)
python==3.7.15
numpy==1.21.6
torch==1.12.1+cu113
matplotlib==3.2.2

Introduction

Recurrent Neural Network (RNN) works well for sequence problems, i.e., predicting the next sequence item. Stock prices, for example, are a type of sequence data more commonly known as time-series data. A similar notion can be applied to the NLP domain to build a character-level language model. Here language textual data becomes the sequence data, and from our model, we try to predict the next character in the input text. For training, the input text is broken into a sequence of characters and fed to the model one character at a time. The network will process the new character in relation to previously seen characters and use this information to predict the next alphabet.

Data Preparation

Download data

For input text, we will use a famous English folk story (though any other text will work equally well) with the name Cinderella. To download the story text, you may use Project Gutenberg site or Archive.org.

download_link = "https://ia600204.us.archive.org/30/items/cinderella10830gut/10830.txt"

## alternate download link
# download_link = "https://www.gutenberg.org/cache/epub/10830/pg10830.txt"

file_name = 'input.txt'
##
# download the story text and save it as {file_name}
! curl {download_link} -o {file_name}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 45278  100 45278    0     0  38865      0  0:00:01  0:00:01 --:--:-- 38831

The download is complete. We can now open the file and read its contents.

##
# Reading and processing text
with open(file_name, "r", encoding="utf8") as fp:
    text = fp.read()

Preprocess data

The downloaded text has been published as a volunteer effort under Project Gutenberg. They have added some project and license information after the original story text as part of the project requirements. We are not interested in that text (boilerplate text), so let’s omit that and limit our input text to the folk story.

##
# truncate text till story start and end
start_indx = text.find(
    "There once lived a gentleman and his wife, who were the parents of a\nlovely little daughter."
)
end_indx = text.find("*       *       *       *       *")

text = text[start_indx:end_indx]

# total length of the text
print("Total Length (character count):", len(text))
Total Length (character count): 21831

How does the data look?

Let’s view the first 500 characters from the story text.

# view the text start
text[:500]
'There once lived a gentleman and his wife, who were the parents of a\nlovely little daughter.\n\nWhen this child was only nine years of age, her mother fell sick.\nFinding her death coming on, she called her child to her and said to\nher, "My child, always be good; bear every thing that happens to you\nwith patience, and whatever evil and troubles you may suffer, you will\nbe happy in the end if you are so." Then the poor lady died, and her\ndaughter was full of great grief at the loss of a mother so go'

And the last 500 characters.

# view the text end
text[-500:]
'their affection.\nShe was then taken to the palace of the young prince, in whose eyes she\nappeared yet more lovely than before, and who married her shortly after.\n\nCinderella, who was as good as she was beautiful, allowed her sisters to\nlodge in the palace, and gave them in marriage, that same day, to two\nlords belonging to the court.\n\n[Illustration: MARRIAGE OF THE PRINCE AND CINDERELLA.]\n\nThe amiable qualities of Cinderella were as conspicuous after as they\nhad been before marriage.\n\n\n\n\n       '

Preparing data dictionary

Our data is a string and can’t be used to train a model. So instead, we have to convert it into integers. For this encoding, we will use a simple methodology where each unique character in the text is assigned an integer and then replaced with all occurrences of that character in the text with that integer value.

For this, let’s first create a set of all the unique characters in the text.

import numpy as np

# find unique chars from text
char_set = set(text)
print("Unique Characters:", len(char_set))

# sort char set
chars_sorted = sorted(char_set)
print(chars_sorted)
Unique Characters: 65
['\n', ' ', '!', '"', "'", ',', '-', '.', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

We now know all the unique characters in our input text. Accordingly, we can create a dictionary and assign each character in char_set a unique integer.

# encode chars
char2int = {ch: i for i, ch in enumerate(chars_sorted)}

# `char2int` dictionary for char -> int
print(char2int)
{'\n': 0, ' ': 1, '!': 2, '"': 3, "'": 4, ',': 5, '-': 6, '.': 7, ':': 8, ';': 9, '?': 10, 'A': 11, 'B': 12, 'C': 13, 'D': 14, 'E': 15, 'F': 16, 'G': 17, 'H': 18, 'I': 19, 'J': 20, 'K': 21, 'L': 22, 'M': 23, 'N': 24, 'O': 25, 'P': 26, 'Q': 27, 'R': 28, 'S': 29, 'T': 30, 'U': 31, 'V': 32, 'W': 33, 'Y': 34, 'Z': 35, '[': 36, ']': 37, '_': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}

But more than just the encoding, we also need a way to convert the encoded characters back to the original form. For this, we will use a separate array that will hold the index of each char in the dictionary. Together with char2int and int2char we can move back and forth between encoded and decoded characters.

int2char = np.array(chars_sorted)

# `int2char` for int -> char
print(int2char)
['\n' ' ' '!' '"' "'" ',' '-' '.' ':' ';' '?' 'A' 'B' 'C' 'D' 'E' 'F' 'G'
 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'Y' 'Z'
 '[' ']' '_' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o'
 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z']

Encode input text

In this step, we will use the char2int dictionary to encode our story text. The encoded version of text is called text_encoded.

##
# encode original text
text_encoded = np.array([char2int[ch] for ch in text], dtype=np.int32)

print("Text encoded shape: ", text_encoded.shape)
Text encoded shape:  (21831,)

Let’s use int2char to decode and return the original text.

##
# decoding original text
for ex in text_encoded[:5]:
    print("{} -> {}".format(ex, int2char[ex]))
30 -> T
46 -> h
43 -> e
56 -> r
43 -> e

Another example of encoding and decoding. This time I used multiple words together.

print(text[:18], "     == Encoding ==> ", text_encoded[:18])
print(text_encoded[19:41], " == Reverse  ==> ", "".join(int2char[text_encoded[19:41]]))
There once lived a      == Encoding ==>  [30 46 43 56 43  1 53 52 41 43  1 50 47 60 43 42  1 39]
[45 43 52 58 50 43 51 39 52  1 39 52 42  1 46 47 57  1 61 47 44 43]  == Reverse  ==>  gentleman and his wife

Prepare data sequences

We have our encoded data ready. Next, we will convert it into sequences of fixed length. The last sequence element will act as a target, and the remaining elements will be the input. For sequencing, we will use length 41.

  • The first 40 characters in sequence form the input
  • The last character in sequence (41) represents the output
##
# make sequences of encoded text as `text_chunks`
seq_length = 40
chunk_size = seq_length + 1

text_chunks = [
    text_encoded[i : i + chunk_size] for i in range(len(text_encoded) - chunk_size + 1)
]
##
# inspect the first chuck
for seq in text_chunks[:1]:
    input_seq = seq[:-1]
    target = seq[-1]

    print(input_seq, " -> ", target)
    print(repr("".join(int2char[input_seq])), " -> ", repr("".join(int2char[target])))
[30 46 43 56 43  1 53 52 41 43  1 50 47 60 43 42  1 39  1 45 43 52 58 50
 43 51 39 52  1 39 52 42  1 46 47 57  1 61 47 44]  ->  43
'There once lived a gentleman and his wif'  ->  'e'
##
# inspect the second chuck
for seq in text_chunks[1:2]:
    input_seq = seq[:-1]
    target = seq[-1]

    print(input_seq, " -> ", target)
    print(repr("".join(int2char[input_seq])), " -> ", repr("".join(int2char[target])))
[46 43 56 43  1 53 52 41 43  1 50 47 60 43 42  1 39  1 45 43 52 58 50 43
 51 39 52  1 39 52 42  1 46 47 57  1 61 47 44 43]  ->  5
'here once lived a gentleman and his wife'  ->  ','

Load Data into Dataset and DataLoader class

In this section, we will load our encoded data sequences into Dataset and DataLoader class to prepare batches for model training.

Load data into Dataset class

class TextDataset is derived from PyTorch Dataset. When we get a sequence using this class, it will return the sequence as a tuple of input and target.

Code
import torch
from torch.utils.data import Dataset


class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks

    def __len__(self):
        return len(self.text_chunks)

    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()  # return input, target


seq_dataset = TextDataset(torch.tensor(text_chunks))
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:15: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)
  from ipykernel import kernelapp as app

Each element from the seq_dataset consists of

  • input data that we will feed to the model for training
  • target data that we will use to compare the model output

Remember that both input and target sequences are derived from the same encoded text. We train our model to predict the next character from the given input. One character is given as an input to the model, and one character output comes out of the model. In an ideal case, the model output character should represent the next character in a sequence. And our target sequence is just that: one next character from the input sequence.

for i, (seq, target) in enumerate(seq_dataset):
    print(" Input (x):", repr("".join(int2char[seq])))
    print("Target (y):", repr("".join(int2char[target])))
    print()
    if i == 1:
        break
 Input (x): 'There once lived a gentleman and his wif'
Target (y): 'here once lived a gentleman and his wife'

 Input (x): 'here once lived a gentleman and his wife'
Target (y): 'ere once lived a gentleman and his wife,'

Load data into DataLoader class to prepare batches

In this step, we have prepared training batches using the PyTorch DataLoader class.

from torch.utils.data import DataLoader

batch_size = 64

torch.manual_seed(1)
seq_dl = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

Model Configuration and Training

In this section, we will configure a model for character-level language modeling. This model will have an Embedding layer at the start. Next, output from the embedding layer will be passed to the LSTM layer. Finally, at the output, we have a fully connected linear layer.

For an in-depth analysis of the working of an Embedding layer, I recommend this article Embeddings in Machine Learning: Everything You Need to Know

Code
import torch.nn as nn

device = "cuda" if torch.cuda.is_available() else "cpu"

class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, batch_first=True)
        self.fc = nn.Linear(rnn_hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size)
        return hidden.to(device), cell.to(device)
torch.manual_seed(1)

# define model dimensions
vocab_size = len(int2char)
embed_dim = 256
rnn_hidden_size = 512

# initialize model
model = RNN(vocab_size, embed_dim, rnn_hidden_size)
model = model.to(device)
model
RNN(
  (embedding): Embedding(65, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=65, bias=True)
)

Configure loss function and optimizer

  • For the loss function, we will use CrossEntropyLoss. This is because we are dealing with a classification problem, and our model has to predict the next character from vocab_size of 65 classes.
  • For optimization, we will use torch.optim.Adam
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

Model training

All parts are ready so let’s start the training. Google Colab “CPU” runtime can take significantly longer to train. I would suggest using “GPU” runtime instead.

Code
# for execution time measurement
from timeit import default_timer as timer

num_epochs = 10000
model.train()

start = timer()  # timer start
for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size)

    seq_batch, target_batch = next(iter(seq_dl))
    seq_batch = seq_batch.to(device)
    target_batch = target_batch.to(device)

    optimizer.zero_grad()
    loss = 0

    for c in range(seq_length):
        pred, hidden, cell = model(seq_batch[:, c], hidden, cell)
        loss += loss_fn(pred, target_batch[:, c])

    loss.backward()
    optimizer.step()

    loss = loss.item() / seq_length
    if epoch % 500 == 0:
        print(f"Epoch {epoch} loss: {loss:.4f}")

end = timer()  # timer end
print("Total execution time in seconds: ", "%.2f" % (end - start))
print("Device type: ", device)
Epoch 0 loss: 2.6252
Epoch 500 loss: 0.3377
Epoch 1000 loss: 0.2502
Epoch 1500 loss: 0.2403
Epoch 2000 loss: 0.2501
Epoch 2500 loss: 0.2374
Epoch 3000 loss: 0.2368
Epoch 3500 loss: 0.2499
Epoch 4000 loss: 0.2643
Epoch 4500 loss: 0.2555
Epoch 5000 loss: 0.3854
Epoch 5500 loss: 0.2326
Epoch 6000 loss: 0.2390
Epoch 6500 loss: 0.2270
Epoch 7000 loss: 0.2663
Epoch 7500 loss: 0.3403
Epoch 8000 loss: 0.2475
Epoch 8500 loss: 0.2370
Epoch 9000 loss: 0.2126
Epoch 9500 loss: 0.2308
Total execution time in seconds:  378.14
Device type:  cuda

Process output from the model

Getting a prediction (text generation) from the model takes some extra work. Since the model is trained on encoded text, the output generated from the model is also encoded. Further, any input used for prediction itself needs to be encoded using the same encoding dictionary model it is trained with. For this, we have defined a helper function.

  • This function will take the input text and encode it before passing it to the model

  • It will take the output from the model and decode it before returning

  • Note that LSTM model output has logits, hidden state, and cell state . Logits give us the next predicted character. Hidden state and cell state are for keeping the context (or memory) of characters processed so far and are supplied to the model for the next prediction.

  • For the output logits, we can predict the next character using the index of the highest logit value. This will make our model predict the exact text on the same input each time. To introduce some randomness, we take help from PyTorch class torch.distributions.categorical.Categorical. This is how it works

    • We obtain output probabilities by applying softmax to logits and pass them to a Categorical object to create a distribution.
    • Generate a sample from a Categorical object. Samples generated from the same distribution may be different. This way, we get different outputs with the same input text.
    • This way, we can also control the predictability of the model output by controlling the probability distribution (calculated from logits) passed to the Categorical object. If we can make probabilities a lot more similar (through scaling), the sample generated by Categorical will also be mostly the same. On the other hand, if we can make the probabilities further apart, then we can also increase the randomness of the output from the Categorical class.
Code
from torch.distributions.categorical import Categorical

def sample(model, starting_str, len_generated_text=500, scale_factor=1.0):

    encoded_input = torch.tensor([char2int[s] for s in starting_str])
    encoded_input = torch.reshape(encoded_input, (1, -1))

    generated_str = starting_str

    model.eval()
    hidden, cell = model.init_hidden(1)
    hidden = hidden.to("cpu")
    cell = cell.to("cpu")
    for c in range(len(starting_str) - 1):
        _, hidden, cell = model(encoded_input[:, c].view(1), hidden, cell)

    last_char = encoded_input[:, -1]
    for i in range(len_generated_text):
        logits, hidden, cell = model(last_char.view(1), hidden, cell)
        logits = torch.squeeze(logits, 0)
        scaled_logits = logits * scale_factor
        m = Categorical(logits=scaled_logits)
        last_char = m.sample()
        generated_str += str(int2char[last_char])

    return generated_str

Generating new text passages

We are processing text and model output on the ‘CPU’ device in the ‘sample’ function. So let’s also move the model to the same device.

##
# move model to cpu
model.to('cpu')
RNN(
  (embedding): Embedding(65, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=65, bias=True)
)

Before generating some lengthy text, let’s experiment with simple words and see if our model can complete them.

At first, I used the string “fat” and asked the model to generate the following three characters to complete this word. But at the same time, I have passed a tiny scaling factor meaning I have decreased the model’s predictability.

print(sample(model, starting_str="fat", len_generated_text=3, scale_factor=0.1))
fat, i

Next, I asked the model to use the same input and predict the following three characters, but I increased the model’s predictability ten times. So let’s see the output this time.

print(sample(model, starting_str='fat', len_generated_text=3, scale_factor=1.0))
father

The second time model generated the correct word “father” it had seen before in the training text. So let’s now generate some lengthy texts.

##
# text generation example 1
print(sample(model, starting_str="The father"))
The father too was she was one of those good faeries who protect children. Her
spirits revived, and she wiped away her tears.

The faery took Cinderella by the hand, and old woman, assuming her character of Queen of the
Faeries, that only jumped up behind the
carriage as nimbly as if they had been footmen and laced so tight, touched Cinderella's clothes with her wand, and said, "Now, my dear good child," said the faery, "here you have a coach and
horses, much handsomer than your sisters', to say the least
##
# text generation example 2
print(sample(model, starting_str="The mother"))
The mother so good crust. But
if you like to give the household. It was she who washed the dishes, and
scrubbed down the step-sisters were very cruel to Cinderella,
that he did not eat one morsel of the supper.

Cinderella drew the fellow slipper
out of her godmother
would do with it. Her godmother took the pumpkin, and scooped out the
inside of it, leaving nothing but rind; she then struck it with her
godmother then said, "My dear Cinderella,
that he did not eat one morsel of the supper.

Cinderella drew
##
# text generation example 3
print(sample(model, starting_str="The three sisters"))
The three sisters were very cruel to Cinderella,
that he delicacies which she had
received from the prince:  but they did not eat one morsel for a
couple of days. They spent their whole time before a looking-glass, and
they would be laced so tight, tossing her head disdainfully, "that I
should lend my clothes to a dirty Cinderella like you!"

Cinderella quite amazed; but their
astonishment at her dancing was still greater.

Gracefulness seemed to play in the attempt.

The long-wished-for evening came at last, an
##
# text generation example 4
print(sample(model, starting_str="The lovely prince"))
The lovely prince
immediately jumped up behind the
carriage as nimbly as conspicuous after as they
had been before mocking me," replied the poor girl to do all the
drudgery of the household. It was she who washed the dishes, and
scrubbed down the stairs, who tried with all their might to force their unwould stration: CINDERELLA IS PRESENTED BY THE PRINCE TO THE KING AND
QUEEN, WHO WELCOME HER WITH THE HONORS DUE TO A GREAT PRINCESS, AND IS
THEN LED INTO THE ROYAL BY THE HER WITH THE HONORS DUE TO A GREAT PRINCES