自定義vocab.txt #2649

NatLee · 2022-06-27T06:19:01Z

各位先進大家好

想請問預訓練的ernie-1.0是否能夠自行擴增vocab.txt

例如以下的tokenizer

import paddlenlp as ppnlp
tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained("ernie-1.0")

我們是否可以再自行新增token？

查詢了一下issue表發現有人問ErnieGramTokenizer

#2022

但不知道ErnieForSequenceClassification是不是也無法自行擴增

謝謝！

The text was updated successfully, but these errors were encountered:

ZHUI · 2022-06-27T06:37:21Z

ernie-1.0 的词表中，有部分unused 的 token，如果你新加的token不多的话，可以试一试替换 unused

NatLee · 2022-06-27T06:43:59Z

@ZHUI 謝謝回覆！

可是unused的token只有九十多個，如果超過的話是不是就無法新增了？

ZHUI · 2022-06-27T06:53:45Z

需要的话，可以自己 resize 一下 vocab, 这里有一个 resize_position_embeddings 的例子。 #2513

NatLee · 2022-06-27T07:16:24Z

@ZHUI 謝謝回覆！

那個例子看起來是resize_position_embeddings的參數

我看預設的設置內，init_args的vocab_size只有18000

想請問就是如果我取用預訓練模型，我有辦法去更改這個設置嗎？

謝謝！

ZHUI · 2022-06-27T07:34:55Z

没有关系的，这里是重新赋值了一遍 embedding

PaddleNLP/paddlenlp/transformers/layoutlmv2/modeling.py

Lines 825 to 834 in c92810b

    
           self.embeddings.position_embeddings = nn.Embedding( 
        
               self.config["max_position_embeddings"], self.config["hidden_size"]) 
        
           with paddle.no_grad(): 
        
               if num_position_embeds_diff > 0: 
        
                   self.embeddings.position_embeddings.weight[: 
        
                                                              -num_position_embeds_diff] = old_position_embeddings_weight 
        
               else: 
        
                   self.embeddings.position_embeddings.weight = old_position_embeddings_weight[: 
        
                                                                                               num_position_embeds_diff]

NatLee · 2022-06-27T08:03:18Z

@ZHUI 這個功能大約什麼時候會被merge進主分支呢？

ZHUI · 2022-06-27T08:28:35Z

抱歉，可以试一下这个 https://github.com/PaddlePaddle/PaddleNLP/pull/2423/files
resize_token_embeddings

NatLee · 2022-06-28T02:40:16Z

@ZHUI 這個在develop分支，之後預計會release？

ZHUI · 2022-06-28T02:53:41Z

本周内应该会有release

ZHUI self-assigned this Jun 27, 2022

NatLee closed this as completed Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

自定義vocab.txt #2649

自定義vocab.txt #2649

NatLee commented Jun 27, 2022 •

edited

Loading

ZHUI commented Jun 27, 2022

NatLee commented Jun 27, 2022

ZHUI commented Jun 27, 2022

NatLee commented Jun 27, 2022

ZHUI commented Jun 27, 2022

NatLee commented Jun 27, 2022

ZHUI commented Jun 27, 2022

NatLee commented Jun 28, 2022

ZHUI commented Jun 28, 2022

自定義vocab.txt #2649

自定義vocab.txt #2649

Comments

NatLee commented Jun 27, 2022 • edited Loading

ZHUI commented Jun 27, 2022

NatLee commented Jun 27, 2022

ZHUI commented Jun 27, 2022

NatLee commented Jun 27, 2022

ZHUI commented Jun 27, 2022

NatLee commented Jun 27, 2022

ZHUI commented Jun 27, 2022

NatLee commented Jun 28, 2022

ZHUI commented Jun 28, 2022

NatLee commented Jun 27, 2022 •

edited

Loading