Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE dropout not working as expected #201

Closed
rgdl opened this issue Mar 19, 2020 · 7 comments · Fixed by #519
Closed

BPE dropout not working as expected #201

rgdl opened this issue Mar 19, 2020 · 7 comments · Fixed by #519

Comments

@rgdl
Copy link

rgdl commented Mar 19, 2020

I'm using BytelvelBPETokenizer together with fastai.text to make a text classifier in Python, and experimenting with BPE dropout.

When I create an empty model and train on my own text corpus, BPE dropout doesn't seem to work.

import json
import pathlib

import tokenizers


# Make an empty tokeniser then train on some text

with open('train.txt', 'w') as f:
    f.writelines(line + '\n' for line in ['hi, how are you?', 'fine, how about you?', 'ababab'])

bpe_tokeniser_from_empty = tokenizers.ByteLevelBPETokenizer(dropout=0.5)
bpe_tokeniser_from_empty.train('train.txt', vocab_size=500)

print('dropout =', bpe_tokeniser_from_empty._parameters['dropout'])
for _ in range(20):
    print(bpe_tokeniser_from_empty.encode('ab').tokens)

output:

dropout = 0.5
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']
['ab']

However, when I create a ByteLevelBPETokenizer from vocab and merges files, BPE dropout does work as I expect (probabilistic merging). Did I get something wrong with the first method, or is this a bug?

# Create dummy vocab and merges files to build tokeniser from

with open('vocab.json', 'w') as f:
    json.dump({'a': 0, 'b': 1, 'ab': 2}, f)

with open('merges.txt', 'w') as f:
    f.write('#version 1\na b')

bpe_tokeniser_from_files = tokenizers.ByteLevelBPETokenizer(
    vocab_file='vocab.json',
    merges_file='merges.txt',
    dropout=0.5,
)

print('dropout =', bpe_tokeniser_from_files._parameters['dropout'])
for _ in range(20):
    print(bpe_tokeniser_from_files.encode('ab').tokens)

output:

dropout = 0.5
['a', 'b']
['ab']
['ab']
['ab']
['a', 'b']
['ab']
['a', 'b']
['a', 'b']
['ab']
['a', 'b']
['a', 'b']
['ab']
['ab']
['ab']
['ab']
['a', 'b']
['a', 'b']
['ab']
['a', 'b']
['ab']

I came up with this workaround: train on the source text, then output the merges and vocab files and use them to make a new tokeniser

# My workaround

# Train a tokeniser on my data (BPE dropout will not work for this one)
bpe_tokeniser_from_empty = tokenizers.ByteLevelBPETokenizer(dropout=0.5)
bpe_tokeniser_from_empty.train('train.txt', vocab_size=500)

# Export the files
vocab_file, merges_file = bpe_tokeniser_from_empty.save('.')

# Create from files
bpe_tokeniser_from_files = tokenizers.ByteLevelBPETokenizer(
    vocab_file=pathlib.Path(vocab_file).name,
    merges_file=pathlib.Path(merges_file).name,
    dropout=0.5,
)

print('dropout =', bpe_tokeniser_from_files._parameters['dropout'])
for _ in range(20):
    print(bpe_tokeniser_from_files.encode('ab').tokens)

output:

dropout = 0.5
['a', 'b']
['a', 'b']
['a', 'b']
['ab']
['ab']
['a', 'b']
['a', 'b']
['a', 'b']
['a', 'b']
['a', 'b']
['ab']
['ab']
['ab']
['a', 'b']
['a', 'b']
['ab']
['a', 'b']
['ab']
['a', 'b']
['ab']
@KappaDistributive
Copy link
Contributor

KappaDistributive commented May 24, 2020

This indeed appears to be a bug.

bpe_tokeniser_from_empty.train(..) creates a new BPE from a trainer that itself is created in line 578 of tokenizers/src/models/be/trainer.rs. However, no dropout is communicated to that Builder and hence it defaults to None (regardless of _parameters["dropout"] being set correctly on the Python side).

Since I just started looking at this codebase, I'm not clear what the best way of fixing this might be. But if someone is willing to point me into the right direction, I'd be happy to offer my help and implement a fix.

@n1t0
Copy link
Member

n1t0 commented May 28, 2020

Yes indeed, the dropout parameter does not get forwarded to the new BPE being created during training. This is a bit tricky though, and it may require a lot of changes to be done properly.

@esceptico
Copy link
Contributor

Any updates? Or maybe someone has any workarounds for this (excluding the topic starter solution)?

@n1t0
Copy link
Member

n1t0 commented Oct 23, 2020

Yes, as a workaround you can still reload the model like this:

files = tokenizer.model.save("./", "workaround")
tokenizer.model = BPE.from_files(*files, dropout=0.1, unk_token="[UNK]")

@memray
Copy link

memray commented Nov 2, 2020

Hi @n1t0!

I'm using RobertaTokenizerFast but this trick doesn't seem to work. Do you have any clue about it?

tokenizer = RobertaTokenizerFast.from_pretrained(args.pretrained_vocab, cache_dir=args.cache_dir, dropout=args.bpe_dropout)
workaround_files = tokenizer._tokenizer.model.save(args.cache_dir, 'workaround')
tokenizer.model = type(tokenizer._tokenizer.model)(*workaround_files, dropout=float(args.bpe_dropout))

Thanks!
Rui

@n1t0
Copy link
Member

n1t0 commented Nov 2, 2020

In your second example, I think you should be reassigning on _tokenizer.model too to override it. It should work this way.

@memray
Copy link

memray commented Nov 2, 2020

Ahh my bad. Yes, it works!

Thanks!
Rui

This was referenced Nov 10, 2020
@n1t0 n1t0 closed this as completed in #519 Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants