Text-to-Text Transformer (T5-Base Model) Testing For Summarization, Sentiment Classification, and Translation Using Pytorch and Torchtext Skip to main content

Text-to-Text Transformer (T5-Base Model) Testing For Summarization, Sentiment Classification, and Translation Using Pytorch and Torchtext

The Text-to-Text Transformer is a type of neural network architecture that is particularly well-suited for natural language processing tasks involving the generation of text. It was introduced in the paper "Attention is All You Need" by Vaswani et al. and has since become a popular choice for many NLP tasks, including language translation, summarization, and text generation.
One of the key features of the Transformer architecture is its use of self-attention mechanisms, which allow the model to "attend" to different parts of the input text and weights their importance in generating the output. This is in contrast to traditional sequence-to-sequence models, which rely on recurrent neural networks (RNNs) and can be more difficult to parallelize and optimize.
To fine-tune a text-to-text Transformer in Python, you will need to start by installing the necessary libraries, such as TensorFlow or PyTorch. You will then need to prepare your dataset, which should consist of pairs of input and output text. Next, you will need to define the model architecture and select the appropriate hyperparameters. Finally, you can train the model using an appropriate loss function and optimize it using a suitable optimizer.
One common way to fine-tune a Transformer model is to use a pre-trained model as a starting point and then "fine-tune" it on your specific dataset by continuing to train the model using your data. This can be particularly useful if you have a small dataset and want to take advantage of the knowledge and information learned by the pre-trained model on a larger dataset.

It's also important to mention that there are many open-source implementations of the Transformer architecture available online, which can serve as useful starting points for your own projects. These implementations often include code for training and evaluating the model on various tasks, as well as helpful utilities for preprocessing and working with text data.

T5-Base Model Fine-Tuning Using TensorFlow

The T5 (Text-To-Text Transfer Transformer) model is a large-scale, multi-task version of the Transformer architecture that was introduced by Google in the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". It was designed to be a general-purpose model that can be fine-tuned for a wide range of natural language processing tasks, including translation, summarization, question answering, and text generation.

To fine-tune a T5 model in TensorFlow, you will need to start by installing the necessary libraries and downloading the T5 model weights. You will then need to prepare your dataset, which should consist of input text and corresponding output text for the task you want to perform. Next, you will need to define the model architecture and select the appropriate hyperparameters. Finally, you can train the model using an appropriate loss function and optimize it using a suitable optimizer.

One advantage of using the T5 model is that it is designed to be easily fine-tuned for a wide range of tasks using a single codebase. This means that you can reuse the same model architecture and training code for multiple tasks, rather than having to write separate code for each task.

It's also worth noting that the T5 model is a very large and computationally intensive model, so you may need to use a powerful machine with a GPU to fine-tune it effectively. Additionally, you may need to carefully tune the model hyperparameters and training settings to achieve good performance on your specific task.

pip install tensorflow-datasets
import tensorflow_datasets as tfds
import tensorflow as tf

# Download the T5 model weights
model = tfds.image.t5.T5Model.from_pretrained('t5-base')
# Prepare the dataset
# The dataset should consist of input text and corresponding output text
# for the task you want to perform
input_texts = ['input 1', 'input 2', 'input 3']
output_texts = ['output 1', 'output 2', 'output 3']

# Define the model architecture and select the appropriate hyperparameters
# You can use the pre-defined model architectures provided by TensorFlow,
# or you can define your own custom architecture
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(input_texts, output_texts, epochs=5)

T5-Base Model Testing for Summarization, Sentiment Classification, and Translation Using Pytorch and Torchtext

This tutorial shows how to do summarization, sentiment classification, and translation tasks using a pre-trained T5 Model. In this example, we'll show you how to utilize the torchtext library to Create a T5 model text pre-processing pipeline.

  • Initiate a base-configured, pre-trained T5 model.
  • Pre-process the texts in the CNNDM, IMDB, and Multi30k datasets to prepare them for the model.
  • Conduct text synthesis, sentiment analysis, and translation
To start testing the T5-base model, let's first start by importing some common Libraries:

import torch
import torch.nn.functional as F
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

Next is data transformation. Raw text is incompatible with the T5 model. Instead, in order to accomplish training and inference, the text must be converted into a numerical representation. For the T5 model, the following transformations are necessary:
  • Text can be tokenized by converting it to (integer) IDs.
  • The sequences should be shortened to a certain maximum length.
  • Add padding token IDs and end-of-sequence (EOS) token IDs
Text tokenization in T5 is done using the SentencePiece model. In the example below, we establish the text pre-processing pipeline using torchtext's T5Transform and a pre-trained SentencePiece model. However, the T5 model anticipates the input to be batched. It should be noted that the transform accepts both batched and non-batched text input (i.e., one can pass a single sentence or a list of sentences).
from torchtext.prototype.models import T5Transform

padding_idx = 0
eos_idx = 1
max_seq_len = 512
t5_sp_model_path = "https://download.pytorch.org/models/text/t5_tokenizer_base.model"

transform = T5Transform(
Alternatively, we can also use the transform shipped with the pre-trained models that does all of the above out-of-the-box
from torchtext.prototype.models import T5_BASE_GENERATION
transform = T5_BASE_GENERATION.transform()

Preparing The Model

Torchtext offers SOTA pre-trained models that may be applied directly to NLP tasks or improved upon for tasks further down the line. Below, we summarize, categorize, and translate text using the pre-trained T5 model with the default base setup.
from torchtext.prototype.models import T5_BASE_GENERATION
transform = t5_base.transform()
model = t5_base.get_model()

Sequence Generator

A sequence generator can be created to generate an output sequence based on an input sequence. The encoder and decoder of the model are invoked, and the decoded sequences are expanded iteratively until the end-of-sequence token is generated for each sequence in the batch. The sequences are produced using a beam search in the create manner depicted below. A beam size of 1 is equal to a greedy decoder, but larger beam sizes can lead to improved generation at the expense of increased computing complexity.
from torch import Tensor
from torchtext.prototype.models import T5Model

def beam_search(
    beam_size: int,
    step: int,
    bsz: int,
    decoder_output: Tensor,
    decoder_tokens: Tensor,
    scores: Tensor,
    incomplete_sentences: Tensor,
    probs = F.log_softmax(decoder_output[:, -1], dim=-1)
    top = torch.topk(probs, beam_size)

    # N is number of sequences in decoder_tokens, L is length of sequences, B is beam_size
    # decoder_tokens has shape (N,L) -> (N,B,L)
    # top.indices has shape (N,B) - > (N,B,1)
    # x has shape (N,B,L+1)
    # note that when step == 1, N = batch_size, and when step > 1, N = batch_size * beam_size
    x = torch.cat([decoder_tokens.unsqueeze(1).repeat(1, beam_size, 1), top.indices.unsqueeze(-1)], dim=-1)

    # beams are first created for a given sequence
    if step == 1:
        # x has shape (batch_size, B, L+1) -> (batch_size * B, L+1)
        # new_scores has shape (batch_size,B)
        # incomplete_sentences has shape (batch_size * B) = (N)
        new_decoder_tokens = x.view(-1, step + 1)
        new_scores = top.values
        new_incomplete_sentences = incomplete_sentences

    # beams already exist, want to expand each beam into possible new tokens to add
    # and for all expanded beams beloning to the same sequences, choose the top k
        # scores has shape (batch_size,B) -> (N,1) -> (N,B)
        # top.values has shape (N,B)
        # new_scores has shape (N,B) -> (batch_size, B^2)
        new_scores = (scores.view(-1, 1).repeat(1, beam_size) + top.values).view(bsz, -1)

        # v, i have shapes (batch_size, B)
        v, i = torch.topk(new_scores, beam_size)

        # x has shape (N,B,L+1) -> (batch_size, B, L+1)
        # i has shape (batch_size, B) -> (batch_size, B, L+1)
        # new_decoder_tokens has shape (batch_size, B, L+1) -> (N, L)
        x = x.view(bsz, -1, step + 1)
        new_decoder_tokens = x.gather(index=i.unsqueeze(-1).repeat(1, 1, step + 1), dim=1).view(-1, step + 1)

        # need to update incomplete sentences in case one of the beams was kicked out
        # y has shape (N) -> (N, 1) -> (N, B) -> (batch_size, B^2)
        y = incomplete_sentences.unsqueeze(-1).repeat(1, beam_size).view(bsz, -1)

        # now can use i to extract those beams that were selected
        # new_incomplete_sentences has shape (batch_size, B^2) -> (batch_size, B) -> (N, 1) -> N
        new_incomplete_sentences = y.gather(index=i, dim=1).view(bsz * beam_size, 1).squeeze(-1)

        # new_scores has shape (batch_size, B)
        new_scores = v

    return new_decoder_tokens, new_scores, new_incomplete_sentences

def generate(encoder_tokens: Tensor, eos_idx: int, model: T5Model, beam_size: int) -> Tensor:

    # pass tokens through encoder
    bsz = encoder_tokens.size(0)
    encoder_padding_mask = encoder_tokens.eq(model.padding_idx)
    encoder_embeddings = model.dropout1(model.token_embeddings(encoder_tokens))
    encoder_output = model.encoder(encoder_embeddings, tgt_key_padding_mask=encoder_padding_mask)[0]

    encoder_output = model.norm1(encoder_output)
    encoder_output = model.dropout2(encoder_output)

    # initialize decoder input sequence; T5 uses padding index as starter index to decoder sequence
    decoder_tokens = torch.ones((bsz, 1), dtype=torch.long) * model.padding_idx
    scores = torch.zeros((bsz, beam_size))

    # mask to keep track of sequences for which the decoder has not produced an end-of-sequence token yet
    incomplete_sentences = torch.ones(bsz * beam_size, dtype=torch.long)

    # iteratively generate output sequence until all sequences in the batch have generated the end-of-sequence token
    for step in range(model.config.max_seq_len):

        if step == 1:
            # duplicate and order encoder output so that each beam is treated as its own independent sequence
            new_order = torch.arange(bsz).view(-1, 1).repeat(1, beam_size).view(-1)
            new_order = new_order.to(encoder_tokens.device).long()
            encoder_output = encoder_output.index_select(0, new_order)
            encoder_padding_mask = encoder_padding_mask.index_select(0, new_order)

        # causal mask and padding mask for decoder sequence
        tgt_len = decoder_tokens.shape[1]
        decoder_mask = torch.triu(torch.ones((tgt_len, tgt_len), dtype=torch.float64), diagonal=1).bool()
        decoder_padding_mask = decoder_tokens.eq(model.padding_idx)

        # T5 implemention uses padding idx to start sequence. Want to ignore this when masking
        decoder_padding_mask[:, 0] = False

        # pass decoder sequence through decoder
        decoder_embeddings = model.dropout3(model.token_embeddings(decoder_tokens))
        decoder_output = model.decoder(

        decoder_output = model.norm2(decoder_output)
        decoder_output = model.dropout4(decoder_output)
        decoder_output = decoder_output * (model.config.embedding_dim ** -0.5)
        decoder_output = model.lm_head(decoder_output)

        decoder_tokens, scores, incomplete_sentences = beam_search(
            beam_size, step + 1, bsz, decoder_output, decoder_tokens, scores, incomplete_sentences
        # ignore newest tokens for sentences that are already complete
        decoder_tokens[:, -1] *= incomplete_sentences

        # update incomplete_sentences to remove those that were just ended
        incomplete_sentences = incomplete_sentences - (decoder_tokens[:, -1] == eos_idx).long()

        # early stop if all sentences have been ended
        if (incomplete_sentences == 0).all():

    # take most likely sequence
    decoder_tokens = decoder_tokens.view(bsz, beam_size, -1)[:, 0, :]
    return decoder_tokens

Preparing the Dataset

torchtext provides several standard NLP datasets. These datasets offer typical flow-control and mapping/transformation using user-defined functions and transformations because they were constructed using composable torchdata datapipes.
In the example below, we show how to pre-process the CNNDM dataset so that it contains the prefix required for the model to recognize the task it is carrying out. There is a train, validation, and test split in the CNNDM dataset. We demonstrate the test split below.
For text summarizing, the T5 model employs the prefix "summarize".

from torchtext.datasets import IMDB
from functools import partial
from torch.utils.data import DataLoader
from torchtext.datasets import CNNDM

cnndm_batch_size = 5
cnndm_datapipe = CNNDM(split="test")
task = "summarize"

def apply_prefix(task, x):
    return f"{task}: " + x[0], x[1]

cnndm_datapipe = cnndm_datapipe.map(partial(apply_prefix, task))
cnndm_datapipe = cnndm_datapipe.batch(cnndm_batch_size)
cnndm_datapipe = cnndm_datapipe.rows2columnar(["article", "abstract"])
cnndm_dataloader = DataLoader(cnndm_datapipe, batch_size=None)

We can also utilize batched API as an alternative (i.e apply the prefix on the whole batch)

def batch_prefix(task, x):
 return {
     "article": [f'{task}: ' + y for y in x["article"]],
     "abstract": x["abstract"]
cnndm_batch_size = 5
cnndm_datapipe = CNNDM(split="test")
task = 'summarize'
cnndm_datapipe = cnndm_datapipe.batch(cnndm_batch_size).rows2columnar(["article", "abstract"])
cnndm_datapipe = cnndm_datapipe.map(partial(batch_prefix, task))
cnndm_dataloader = DataLoader(cnndm_datapipe, batch_size=None)

We may also import the IMDB dataset, which will be used to show how the T5 model can classify sentiment. There is a train/test split in this dataset. We demonstrate the test split below.
Using the prefix "sst2 sentence," the T5 model was trained on the SST2 dataset (also available in torchtext) for sentiment categorization. In order to do sentiment classification on the IMDB dataset, we will make use of this prefix.
from torchtext.datasets import IMDB
imdb_batch_size = 3
imdb_datapipe = IMDB(split="test")
task = "sst2 sentence"
labels = {"neg": "negative", "pos": "positive"}
def process_labels(labels, x):
    return x[1], labels[x[0]]
imdb_datapipe = imdb_datapipe.map(partial(process_labels, labels))
imdb_datapipe = imdb_datapipe.map(partial(apply_prefix, task))
imdb_datapipe = imdb_datapipe.batch(imdb_batch_size)
imdb_datapipe = imdb_datapipe.rows2columnar(["text", "label"])
imdb_dataloader = DataLoader(imdb_datapipe, batch_size=None)
In order to show how the T5 model can translate from English to German, we may additionally import the Multi30k dataset. There are splits for the dataset for train, validation, and testing. We demonstrate the test split below.
For this purpose, the T5 model employs the prefix "translate English to German"
from torchtext.datasets import Multi30k
multi_batch_size = 5
language_pair = ("en", "de")
multi_datapipe = Multi30k(split="test", language_pair=language_pair)
task = "translate English to German"

multi_datapipe = multi_datapipe.map(partial(apply_prefix, task))
multi_datapipe = multi_datapipe.batch(multi_batch_size)
multi_datapipe = multi_datapipe.rows2columnar(["english", "german"])
multi_dataloader = DataLoader(multi_datapipe, batch_size=None)

Generate Summarization

Using a beam size of 3, we can combine all the elements to provide summaries for the initial collection of articles in the CNNDM test set.
batch = next(iter(cnndm_dataloader))
input_text = batch["article"]
target = batch["abstract"]
beam_size = 3

model_input = transform(input_text)
model_output = generate(model=model, encoder_tokens=model_input, eos_idx=eos_idx, beam_size=beam_size)
output_text = transform.decode(model_output.tolist())

for i in range(cnndm_batch_size):
    print(f"Example {i+1}:\n")
    print(f"prediction: {output_text[i]}\n")
    print(f"target: {target[i]}\n\n")

Example 1:

prediction: the Palestinians become the 123rd member of the international criminal
court . the accession was marked by a ceremony at the Hague, where the court is based .
the ICC opened a preliminary examination into the situation in the occupied
Palestinian territory .

target: Membership gives the ICC jurisdiction over alleged crimes committed in
Palestinian territories since last June . Israel and the United States opposed the
move, which could open the door to war crimes investigations against Israelis .

Example 2:

prediction: a stray pooch has used up at least three of her own after being hit by a
car and buried in a field . the dog managed to stagger to a nearby farm, dirt-covered
and emaciated, where she was found . she suffered a dislocated jaw, leg injuries and a
caved-in sinus cavity -- and still requires surgery to help her breathe .

target: Theia, a bully breed mix, was apparently hit by a car, whacked with a hammer
and buried in a field . "She's a true miracle dog and she deserves a good life," says
Sara Mellado, who is looking for a home for Theia .

Recommended Books:

                                                             Buy Now
                                                              Buy Now


You may like

Latest Posts

SwiGLU Activation Function

Position Embedding: A Detailed Explanation

How to create a 1D- CNN in TensorFlow

Introduction to CNNs with Attention Layers

Meta Pseudo Labels (MPL) Algorithm

Video Classification Using CNN and Transformer: Hybrid Model

Liquid Neural Networks: Introduction