Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于句子长度的问题 #14

Open
MWXGOD opened this issue Sep 3, 2023 · 1 comment
Open

关于句子长度的问题 #14

MWXGOD opened this issue Sep 3, 2023 · 1 comment

Comments

@MWXGOD
Copy link

MWXGOD commented Sep 3, 2023

作者您好,这段代码中,我有个疑问
def gen_features(tokens,labels,tokenizer,tag2id,max_len):
tags,input_ids,token_type_ids,attention_masks,lengths = [],[],[],[],[]
for i,(token,label) in enumerate(zip(tokens,labels)):
sentence = ''.join(token)
lengths.append(len(sentence))
if len(token) >= max_len - 2:
label = labels[i][0:max_len - 2]
label = [tag2id['O']] + [tag2id[i] for i in label] + [tag2id['O']]
if len(label) < max_len:
label = label + [tag2id['O']] * (max_len - len(label))

    assert len(label) == max_len
    tags.append(label)

    inputs = tokenizer.encode_plus(sentence, max_length=max_len,pad_to_max_length=True,return_tensors='pt')
    input_id,token_type_id,attention_mask = inputs['input_ids'],inputs['token_type_ids'],inputs['attention_mask']
    input_ids.append(input_id)
    token_type_ids.append(token_type_id)
    attention_masks.append(attention_mask)
return input_ids,token_type_ids,attention_masks,tags,lengths

代码第四行,sentence = ''.join(token)这个句子长度是字符个数,但是解码的时候是下面的代码
def trans2label(id2tag,data,lengths):
new = []
for i,line in enumerate(data):
tmp = [id2tag[word] for word in line]
tmp = tmp[1:1 + lengths[i]]
new.append(tmp)
return new
这里我自己的理解是,一句话10个单词,但是输出的标签长度要128,tmp就是取前十个,0位置是cls的标签
所以取【1:1+lengths[i]】这些正好是原句子的长度(不进行padding的长度)。那这个lengths不应该是token级别的长度嘛,为什么代码写的是char级别的长度?
然后我没搞到onto note5的数据集,我随便用了个疾病数据集NCBI,进行预训练,也就是运行了LabelSemantics.py这个文件,然后使用的是微软的pubmed,我用
(1)char级别的lengths,跑了40个epoch,loss从很大,非常缓慢得下降,非常慢,而且Acc : 0,Recall : 0,F1 :0
(2)token级别的lengths,跑了1epoch, loss就很小很小了。但是Acc : 0,Recall : 0,F1 :0,这些还是0。
不清楚我的这些操作有没有什么错误或者忌讳。
刚接触这个领域,问的问题可能比较低级,还望您能帮忙解惑。非常感谢

@Qian-Xiong
Copy link

看看传入模型的input和label有没有问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants