Byte-Pair Encoding Tokenizer

This project implements the byte-pair encoding (BPE) algorithm and adapts it for tokenization. The implemented tokenizer produces sub-word tokens when tokenizing a given input sequence.

A vocabulary interface is also implemented, which handles mapping between subword tokens and their corresponding indices in the vocabulary. This vocabulary implementation also provides utility methods for use within the context of a Transformer model.

Usage

from tokenizer import BPETokenizer

# Initialize the tokenizer with a corpus and a vocabulary size
tokenizer = BPETokenizer(corpus, vocabulary_size)

# Tokenize an input sequence
tokens = tokenizer.tokenise(input_sequence)

# Decode an encoded sequence
decoded_sequence = tokenizer.decode(tokens)

Running the project

You can run the project using the following command:

make run

Testing

To run the tests, use the following command:

make test

References

Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
main.py		main.py
tokenizer.py		tokenizer.py
tokenizer_test.py		tokenizer_test.py
util.py		util.py
vocabulary.py		vocabulary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Byte-Pair Encoding Tokenizer

Usage

Running the project

Testing

References

About

Releases

Packages

Languages

darthchudi/bpe-tokenizer-py

Folders and files

Latest commit

History

Repository files navigation

Byte-Pair Encoding Tokenizer

Usage

Running the project

Testing

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages