This project implements the byte-pair encoding (BPE) algorithm and adapts it for tokenization. The implemented tokenizer produces sub-word tokens when tokenizing a given input sequence.
A vocabulary interface is also implemented, which handles mapping between subword tokens and their corresponding indices in the vocabulary. This vocabulary implementation also provides utility methods for use within the context of a Transformer model.
from tokenizer import BPETokenizer
# Initialize the tokenizer with a corpus and a vocabulary size
tokenizer = BPETokenizer(corpus, vocabulary_size)
# Tokenize an input sequence
tokens = tokenizer.tokenise(input_sequence)
# Decode an encoded sequence
decoded_sequence = tokenizer.decode(tokens)
You can run the project using the following command:
make run
To run the tests, use the following command:
make test