BPE dropout not working as expected #201

rgdl · 2020-03-19T05:59:47Z

I'm using BytelvelBPETokenizer together with fastai.text to make a text classifier in Python, and experimenting with BPE dropout.

When I create an empty model and train on my own text corpus, BPE dropout doesn't seem to work.

import json
import pathlib

import tokenizers


# Make an empty tokeniser then train on some text

with open('train.txt', 'w') as f:
    f.writelines(line + '\n' for line in ['hi, how are you?', 'fine, how about you?', 'ababab'])

bpe_tokeniser_from_empty = tokenizers.ByteLevelBPETokenizer(dropout=0.5)
bpe_tokeniser_from_empty.train('train.txt', vocab_size=500)

print('dropout =', bpe_tokeniser_from_empty._parameters['dropout'])
for _ in range(20):
    print(bpe_tokeniser_from_empty.encode('ab').tokens)

output:

dropout = 0.5
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']

However, when I create a ByteLevelBPETokenizer from vocab and merges files, BPE dropout does work as I expect (probabilistic merging). Did I get something wrong with the first method, or is this a bug?

# Create dummy vocab and merges files to build tokeniser from

with open('vocab.json', 'w') as f:
    json.dump({'a': 0, 'b': 1, 'ab': 2}, f)

with open('merges.txt', 'w') as f:
    f.write('#version 1\na b')

bpe_tokeniser_from_files = tokenizers.ByteLevelBPETokenizer(
    vocab_file='vocab.json',
    merges_file='merges.txt',
    dropout=0.5,
)

print('dropout =', bpe_tokeniser_from_files._parameters['dropout'])
for _ in range(20):
    print(bpe_tokeniser_from_files.encode('ab').tokens)

output:

dropout = 0.5
['a', 'b']
['ab']
['ab']
['ab']
['a', 'b']
['ab']
['a', 'b']
['a', 'b']
['ab']
['a', 'b']
['a', 'b']
['ab']
['ab']
['ab']
['ab']
['a', 'b']
['a', 'b']
['ab']
['a', 'b']
['ab']

I came up with this workaround: train on the source text, then output the merges and vocab files and use them to make a new tokeniser

# My workaround

# Train a tokeniser on my data (BPE dropout will not work for this one)
bpe_tokeniser_from_empty = tokenizers.ByteLevelBPETokenizer(dropout=0.5)
bpe_tokeniser_from_empty.train('train.txt', vocab_size=500)

# Export the files
vocab_file, merges_file = bpe_tokeniser_from_empty.save('.')

# Create from files
bpe_tokeniser_from_files = tokenizers.ByteLevelBPETokenizer(
    vocab_file=pathlib.Path(vocab_file).name,
    merges_file=pathlib.Path(merges_file).name,
    dropout=0.5,
)

print('dropout =', bpe_tokeniser_from_files._parameters['dropout'])
for _ in range(20):
    print(bpe_tokeniser_from_files.encode('ab').tokens)

output:

dropout = 0.5
['a', 'b']
['a', 'b']
['a', 'b']
['ab']
['ab']
['a', 'b']
['a', 'b']
['a', 'b']
['a', 'b']
['a', 'b']
['ab']
['ab']
['ab']
['a', 'b']
['a', 'b']
['ab']
['a', 'b']
['ab']
['a', 'b']
['ab']

The text was updated successfully, but these errors were encountered:

KappaDistributive · 2020-05-24T18:09:18Z

This indeed appears to be a bug.

bpe_tokeniser_from_empty.train(..) creates a new BPE from a trainer that itself is created in line 578 of tokenizers/src/models/be/trainer.rs. However, no dropout is communicated to that Builder and hence it defaults to None (regardless of _parameters["dropout"] being set correctly on the Python side).

Since I just started looking at this codebase, I'm not clear what the best way of fixing this might be. But if someone is willing to point me into the right direction, I'd be happy to offer my help and implement a fix.

n1t0 · 2020-05-28T15:07:01Z

Yes indeed, the dropout parameter does not get forwarded to the new BPE being created during training. This is a bit tricky though, and it may require a lot of changes to be done properly.

esceptico · 2020-10-23T16:27:34Z

Any updates? Or maybe someone has any workarounds for this (excluding the topic starter solution)?

n1t0 · 2020-10-23T16:46:03Z

Yes, as a workaround you can still reload the model like this:

files = tokenizer.model.save("./", "workaround")
tokenizer.model = BPE.from_files(*files, dropout=0.1, unk_token="[UNK]")

memray · 2020-11-02T07:18:17Z

Hi @n1t0!

I'm using RobertaTokenizerFast but this trick doesn't seem to work. Do you have any clue about it?

tokenizer = RobertaTokenizerFast.from_pretrained(args.pretrained_vocab, cache_dir=args.cache_dir, dropout=args.bpe_dropout)
workaround_files = tokenizer._tokenizer.model.save(args.cache_dir, 'workaround')
tokenizer.model = type(tokenizer._tokenizer.model)(*workaround_files, dropout=float(args.bpe_dropout))

Thanks!
Rui

n1t0 · 2020-11-02T10:45:26Z

In your second example, I think you should be reassigning on _tokenizer.model too to override it. It should work this way.

memray · 2020-11-02T16:40:44Z

Ahh my bad. Yes, it works!

Thanks!
Rui

n1t0 mentioned this issue Jul 10, 2020

CharBPETokenizer.save() json misses first argument of add_tokens; unk token null #320

Closed

This was referenced Nov 10, 2020

Trainer improvements #519

Merged

Trainer trains the Model in-place #526

Closed

n1t0 closed this as completed in #519 Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPE dropout not working as expected #201

BPE dropout not working as expected #201

rgdl commented Mar 19, 2020

KappaDistributive commented May 24, 2020 •

edited

Loading

n1t0 commented May 28, 2020

esceptico commented Oct 23, 2020

n1t0 commented Oct 23, 2020

memray commented Nov 2, 2020 •

edited

Loading

n1t0 commented Nov 2, 2020

memray commented Nov 2, 2020

BPE dropout not working as expected #201

BPE dropout not working as expected #201

Comments

rgdl commented Mar 19, 2020

KappaDistributive commented May 24, 2020 • edited Loading

n1t0 commented May 28, 2020

esceptico commented Oct 23, 2020

n1t0 commented Oct 23, 2020

memray commented Nov 2, 2020 • edited Loading

n1t0 commented Nov 2, 2020

memray commented Nov 2, 2020

KappaDistributive commented May 24, 2020 •

edited

Loading

memray commented Nov 2, 2020 •

edited

Loading