How to tokenize a testing phrase #37

edgarmg91 · 2023-12-01T17:48:54Z

Hi everyone! Thanks a lot for this nice tutorial and code to learning transformers!.

I am trying to recreate the sample of the tutorial:

And I was able to train and serialize a model for the IMDB Dataset.

Currently, I want to test the model with new validation phrases. Nevertheless, I cannot find a way to tokenize the phrase into the required data shape, as in the provided sample:

#Load dataset
tdata, _ = datasets.IMDB.splits(TEXT, LABEL)
train, test = tdata.split(split_ratio=0.8)

#Preprocess data
TEXT.build_vocab(train, max_size=50_000 - 2)
LABEL.build_vocab(train)

#Create iterators
train_iter, test_iter = data.BucketIterator.splits((train, test), batch_size=4, device=util.d())

I see that the tokens are generated in some part of the BucketIterator (or the dataset itself):

for batch in tqdm.tqdm(test_iter):

    input = batch.text[0]
    label = batch.label - 1

As in the dataset , I can see the phrases separated into words:

print(test_iter.data()[0].text)
print(test_iter.data()[0].label)

generates:

['i', "wouldn't", 'rent', 'this', 'one', 'even', 'on', 'dollar', 'rental', 'night.']
neg

So, if I want to test a pharse in the model. Like:

#Try the model
input = ["this", "movie", "is", "incredible", "boring"]

How can I tokenize the word in a correct way to feed it into the model?.

Thanks in advance for your response.

Greetings!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to tokenize a testing phrase #37

How to tokenize a testing phrase #37

edgarmg91 commented Dec 1, 2023

How to tokenize a testing phrase #37

How to tokenize a testing phrase #37

Comments

edgarmg91 commented Dec 1, 2023