Text-to-Text Transformer (T5-Base Model) Testing For Summarization, Sentiment Classification, and Translation Using Pytorch and Torchtext
One of the key features of the Transformer architecture is its use of self-attention mechanisms, which allow the model to "attend" to different parts of the input text and weights their importance in generating the output. This is in contrast to traditional sequence-to-sequence models, which rely on recurrent neural networks (RNNs) and can be more difficult to parallelize and optimize.
One common way to fine-tune a Transformer model is to use a pre-trained model as a starting point and then "fine-tune" it on your specific dataset by continuing to train the model using your data. This can be particularly useful if you have a small dataset and want to take advantage of the knowledge and information learned by the pre-trained model on a larger dataset.
T5-Base Model Fine-Tuning Using TensorFlow
The T5 (Text-To-Text Transfer Transformer) model is a large-scale, multi-task version of the Transformer architecture that was introduced by Google in the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". It was designed to be a general-purpose model that can be fine-tuned for a wide range of natural language processing tasks, including translation, summarization, question answering, and text generation.
To fine-tune a T5 model in TensorFlow, you will need to start by installing the necessary libraries and downloading the T5 model weights. You will then need to prepare your dataset, which should consist of input text and corresponding output text for the task you want to perform. Next, you will need to define the model architecture and select the appropriate hyperparameters. Finally, you can train the model using an appropriate loss function and optimize it using a suitable optimizer.
One advantage of using the T5 model is that it is designed to be easily fine-tuned for a wide range of tasks using a single codebase. This means that you can reuse the same model architecture and training code for multiple tasks, rather than having to write separate code for each task.
It's also worth noting that the T5 model is a very large and computationally intensive model, so you may need to use a powerful machine with a GPU to fine-tune it effectively. Additionally, you may need to carefully tune the model hyperparameters and training settings to achieve good performance on your specific task.
pip install tensorflow-datasets
import tensorflow as tf
# Download the T5 model weights
model = tfds.image.t5.T5Model.from_pretrained('t5-base')
# Prepare the dataset
# The dataset should consist of input text and corresponding output text
# for the task you want to perform
input_texts = ['input 1', 'input 2', 'input 3']
output_texts = ['output 1', 'output 2', 'output 3']
# Define the model architecture and select the appropriate hyperparameters
# You can use the pre-defined model architectures provided by TensorFlow,
# or you can define your own custom architecture
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(input_texts, output_texts, epochs=5)
T5-Base Model Testing for Summarization, Sentiment Classification, and Translation Using Pytorch and Torchtext
This tutorial shows how to do summarization, sentiment classification, and translation tasks using a pre-trained T5 Model. In this example, we'll show you how to utilize the torchtext library to Create a T5 model text pre-processing pipeline.
- Initiate a base-configured, pre-trained T5 model.
- Pre-process the texts in the CNNDM, IMDB, and Multi30k datasets to prepare them for the model.
- Conduct text synthesis, sentiment analysis, and translation
import torch
import torch.nn.functional as F
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
- Text can be tokenized by converting it to (integer) IDs.
- The sequences should be shortened to a certain maximum length.
- Add padding token IDs and end-of-sequence (EOS) token IDs
padding_idx = 0
eos_idx = 1
max_seq_len = 512
t5_sp_model_path = "https://download.pytorch.org/models/text/t5_tokenizer_base.model"
transform = T5Transform(
sp_model_path=t5_sp_model_path,
max_seq_len=max_seq_len,
eos_idx=eos_idx,
padding_idx=padding_idx,
)
from torchtext.prototype.models import T5_BASE_GENERATION
transform = T5_BASE_GENERATION.transform()
Preparing The Model
from torchtext.prototype.models import T5_BASE_GENERATION
t5_base = T5_BASE_GENERATION
transform = t5_base.transform()
model = t5_base.get_model()
model.eval()
model.to(DEVICE)
Sequence Generator
from torch import Tensor
from torchtext.prototype.models import T5Model
def beam_search(
beam_size: int,
step: int,
bsz: int,
decoder_output: Tensor,
decoder_tokens: Tensor,
scores: Tensor,
incomplete_sentences: Tensor,
):
probs = F.log_softmax(decoder_output[:, -1], dim=-1)
top = torch.topk(probs, beam_size)
# N is number of sequences in decoder_tokens, L is length of sequences, B is beam_size
# decoder_tokens has shape (N,L) -> (N,B,L)
# top.indices has shape (N,B) - > (N,B,1)
# x has shape (N,B,L+1)
# note that when step == 1, N = batch_size, and when step > 1, N = batch_size * beam_size
x = torch.cat([decoder_tokens.unsqueeze(1).repeat(1, beam_size, 1), top.indices.unsqueeze(-1)], dim=-1)
# beams are first created for a given sequence
if step == 1:
# x has shape (batch_size, B, L+1) -> (batch_size * B, L+1)
# new_scores has shape (batch_size,B)
# incomplete_sentences has shape (batch_size * B) = (N)
new_decoder_tokens = x.view(-1, step + 1)
new_scores = top.values
new_incomplete_sentences = incomplete_sentences
# beams already exist, want to expand each beam into possible new tokens to add
# and for all expanded beams beloning to the same sequences, choose the top k
else:
# scores has shape (batch_size,B) -> (N,1) -> (N,B)
# top.values has shape (N,B)
# new_scores has shape (N,B) -> (batch_size, B^2)
new_scores = (scores.view(-1, 1).repeat(1, beam_size) + top.values).view(bsz, -1)
# v, i have shapes (batch_size, B)
v, i = torch.topk(new_scores, beam_size)
# x has shape (N,B,L+1) -> (batch_size, B, L+1)
# i has shape (batch_size, B) -> (batch_size, B, L+1)
# new_decoder_tokens has shape (batch_size, B, L+1) -> (N, L)
x = x.view(bsz, -1, step + 1)
new_decoder_tokens = x.gather(index=i.unsqueeze(-1).repeat(1, 1, step + 1), dim=1).view(-1, step + 1)
# need to update incomplete sentences in case one of the beams was kicked out
# y has shape (N) -> (N, 1) -> (N, B) -> (batch_size, B^2)
y = incomplete_sentences.unsqueeze(-1).repeat(1, beam_size).view(bsz, -1)
# now can use i to extract those beams that were selected
# new_incomplete_sentences has shape (batch_size, B^2) -> (batch_size, B) -> (N, 1) -> N
new_incomplete_sentences = y.gather(index=i, dim=1).view(bsz * beam_size, 1).squeeze(-1)
# new_scores has shape (batch_size, B)
new_scores = v
return new_decoder_tokens, new_scores, new_incomplete_sentences
def generate(encoder_tokens: Tensor, eos_idx: int, model: T5Model, beam_size: int) -> Tensor:
# pass tokens through encoder
bsz = encoder_tokens.size(0)
encoder_padding_mask = encoder_tokens.eq(model.padding_idx)
encoder_embeddings = model.dropout1(model.token_embeddings(encoder_tokens))
encoder_output = model.encoder(encoder_embeddings, tgt_key_padding_mask=encoder_padding_mask)[0]
encoder_output = model.norm1(encoder_output)
encoder_output = model.dropout2(encoder_output)
# initialize decoder input sequence; T5 uses padding index as starter index to decoder sequence
decoder_tokens = torch.ones((bsz, 1), dtype=torch.long) * model.padding_idx
scores = torch.zeros((bsz, beam_size))
# mask to keep track of sequences for which the decoder has not produced an end-of-sequence token yet
incomplete_sentences = torch.ones(bsz * beam_size, dtype=torch.long)
# iteratively generate output sequence until all sequences in the batch have generated the end-of-sequence token
for step in range(model.config.max_seq_len):
if step == 1:
# duplicate and order encoder output so that each beam is treated as its own independent sequence
new_order = torch.arange(bsz).view(-1, 1).repeat(1, beam_size).view(-1)
new_order = new_order.to(encoder_tokens.device).long()
encoder_output = encoder_output.index_select(0, new_order)
encoder_padding_mask = encoder_padding_mask.index_select(0, new_order)
# causal mask and padding mask for decoder sequence
tgt_len = decoder_tokens.shape[1]
decoder_mask = torch.triu(torch.ones((tgt_len, tgt_len), dtype=torch.float64), diagonal=1).bool()
decoder_padding_mask = decoder_tokens.eq(model.padding_idx)
# T5 implemention uses padding idx to start sequence. Want to ignore this when masking
decoder_padding_mask[:, 0] = False
# pass decoder sequence through decoder
decoder_embeddings = model.dropout3(model.token_embeddings(decoder_tokens))
decoder_output = model.decoder(
decoder_embeddings,
memory=encoder_output,
tgt_mask=decoder_mask,
tgt_key_padding_mask=decoder_padding_mask,
memory_key_padding_mask=encoder_padding_mask,
)[0]
decoder_output = model.norm2(decoder_output)
decoder_output = model.dropout4(decoder_output)
decoder_output = decoder_output * (model.config.embedding_dim ** -0.5)
decoder_output = model.lm_head(decoder_output)
decoder_tokens, scores, incomplete_sentences = beam_search(
beam_size, step + 1, bsz, decoder_output, decoder_tokens, scores, incomplete_sentences
)
# ignore newest tokens for sentences that are already complete
decoder_tokens[:, -1] *= incomplete_sentences
# update incomplete_sentences to remove those that were just ended
incomplete_sentences = incomplete_sentences - (decoder_tokens[:, -1] == eos_idx).long()
# early stop if all sentences have been ended
if (incomplete_sentences == 0).all():
break
# take most likely sequence
decoder_tokens = decoder_tokens.view(bsz, beam_size, -1)[:, 0, :]
return decoder_tokens
Preparing the Dataset
from torchtext.datasets import IMDB
from functools import partial
from torch.utils.data import DataLoader
from torchtext.datasets import CNNDM
cnndm_batch_size = 5
cnndm_datapipe = CNNDM(split="test")
task = "summarize"
def apply_prefix(task, x):
return f"{task}: " + x[0], x[1]
cnndm_datapipe = cnndm_datapipe.map(partial(apply_prefix, task))
cnndm_datapipe = cnndm_datapipe.batch(cnndm_batch_size)
cnndm_datapipe = cnndm_datapipe.rows2columnar(["article", "abstract"])
cnndm_dataloader = DataLoader(cnndm_datapipe, batch_size=None)
def batch_prefix(task, x):
return {
"article": [f'{task}: ' + y for y in x["article"]],
"abstract": x["abstract"]
}
cnndm_batch_size = 5
cnndm_datapipe = CNNDM(split="test")
task = 'summarize'
cnndm_datapipe = cnndm_datapipe.batch(cnndm_batch_size).rows2columnar(["article", "abstract"])
cnndm_datapipe = cnndm_datapipe.map(partial(batch_prefix, task))
cnndm_dataloader = DataLoader(cnndm_datapipe, batch_size=None)
from torchtext.datasets import IMDB
imdb_batch_size = 3
imdb_datapipe = IMDB(split="test")
task = "sst2 sentence"
labels = {"neg": "negative", "pos": "positive"}
def process_labels(labels, x):
return x[1], labels[x[0]]
imdb_datapipe = imdb_datapipe.map(partial(process_labels, labels))
imdb_datapipe = imdb_datapipe.map(partial(apply_prefix, task))
imdb_datapipe = imdb_datapipe.batch(imdb_batch_size)
imdb_datapipe = imdb_datapipe.rows2columnar(["text", "label"])
imdb_dataloader = DataLoader(imdb_datapipe, batch_size=None)
multi_batch_size = 5
language_pair = ("en", "de")
multi_datapipe = Multi30k(split="test", language_pair=language_pair)
task = "translate English to German"
multi_datapipe = multi_datapipe.map(partial(apply_prefix, task))
multi_datapipe = multi_datapipe.batch(multi_batch_size)
multi_datapipe = multi_datapipe.rows2columnar(["english", "german"])
multi_dataloader = DataLoader(multi_datapipe, batch_size=None)
Generate Summarization
batch = next(iter(cnndm_dataloader))
input_text = batch["article"]
target = batch["abstract"]
beam_size = 3
model_input = transform(input_text)
model_output = generate(model=model, encoder_tokens=model_input, eos_idx=eos_idx, beam_size=beam_size)
output_text = transform.decode(model_output.tolist())
for i in range(cnndm_batch_size):
print(f"Example {i+1}:\n")
print(f"prediction: {output_text[i]}\n")
print(f"target: {target[i]}\n\n")
Example 1:
prediction: the Palestinians become the 123rd member of the international criminal
court . the accession was marked by a ceremony at the Hague, where the court is based .
the ICC opened a preliminary examination into the situation in the occupied
Palestinian territory .
target: Membership gives the ICC jurisdiction over alleged crimes committed in
Palestinian territories since last June . Israel and the United States opposed the
move, which could open the door to war crimes investigations against Israelis .
Example 2:
prediction: a stray pooch has used up at least three of her own after being hit by a
car and buried in a field . the dog managed to stagger to a nearby farm, dirt-covered
and emaciated, where she was found . she suffered a dislocated jaw, leg injuries and a
caved-in sinus cavity -- and still requires surgery to help her breathe .
target: Theia, a bully breed mix, was apparently hit by a car, whacked with a hammer
and buried in a field . "She's a true miracle dog and she deserves a good life," says
Sara Mellado, who is looking for a home for Theia .
Comments
Post a Comment