reload生成自定义词典bin文件在重新载入时抛出异常ArrayIndexOutOfBoundsException #1028

BlackPoint-CX · 2018-11-21T07:25:09Z

注意事项

请确认下列注意事项：

我已仔细阅读下列文档，都没有找到答案：
我已经通过Google和issue区检索功能搜索了我的问题，也没有找到答案。
我明白开源社区是出于兴趣爱好聚集起来的自由社区，不承担任何责任或义务。我会礼貌发言，向每一个帮助我的人表示感谢。
我在此括号内输入x打钩，代表上述事项确认完毕。

版本号

当前最新版本号是：1.7.0
我使用的版本是：1.6.8

我的问题

当新增自定义词典文件时, 通过调用CustomDictionary中的reload方法生成新的CustomDictionary.txt.bin文件生成并加载, 从而实现对新加自定义词典中的词的切分功能.

复现问题

目前发现的问题是 : 通过reload方法生成的新bin文件在第二次程序启动过程中不能够被正确识别, 会抛出

十一月 21, 2018 2:41:38 下午 com.hankcs.hanlp.dictionary.CustomDictionary loadDat
警告: 读取失败，问题发生在java.lang.ArrayIndexOutOfBoundsException: 149
	at com.hankcs.hanlp.dictionary.CustomDictionary.loadDat(CustomDictionary.java:327)
	at com.hankcs.hanlp.dictionary.CustomDictionary.loadMainDictionary(CustomDictionary.java:64)
	at com.hankcs.hanlp.dictionary.CustomDictionary.<clinit>(CustomDictionary.java:51)
	at com.hankcs.hanlp.seg.Segment.combineByCustomDictionary(Segment.java:203)
	at com.hankcs.hanlp.seg.Viterbi.ViterbiSegment.segSentence(ViterbiSegment.java:57)
	at com.hankcs.hanlp.seg.Segment.seg(Segment.java:573)
	at com.hankcs.hanlp.tokenizer.StandardTokenizer.segment(StandardTokenizer.java:50)
	at com.hankcs.hanlp.HanLP.segment(HanLP.java:626)

且通过reload生成的bin文件大小和初始化HanLP生成的bin文件大小不一致.
我的例子中, 通过reload生成的bin文件大小为28788656, 通过HanLP初始化生成的bin文件大小为28788888.
已经确定不是因为进程提前关闭导致文件没有写完这种情况.

步骤

触发代码

from jpype import JClass
from pyhanlp import HanLP, SafeJClass
from datetime import datetime
custom_dictionary = SafeJClass('com.hankcs.hanlp.dictionary.CustomDictionary')
custom_dictionary.reload()
HanLP.segment('今天天气不错')

在运行过上述代码后, 会生成一个bin文件, 为了便于区分简称为reload.bin

from jpype import JClass
from pyhanlp import HanLP, SafeJClass
from datetime import datetime
#custom_dictionary = SafeJClass('com.hankcs.hanlp.dictionary.CustomDictionary')
#custom_dictionary.reload()
HanLP.segment('今天天气不错')

在运行上述代码过程中, 会抛出问题中遇到的异常, 而且会生成新的bin文件, 称为init.bin

reload.bin文件大小为28788656
init.bin文件大小为28788888

期望输出

期望输出

实际输出

实际输出

其他信息

2018-11-21 17:00:55
推测可能是loadDat里面有些逻辑是程序初始化中会走, 但是reload不会走的. 所以导致写入文件的内容不一致.

2018-11-22 10:33:15
重新复现了一遍过程, 查看了其他的issue. 过往的issue都是因为老版本的反射机制导致自定义词性报错. 但是目前用的是1.6.8版本, 应该不存在这个问题. 同时采取了不同的策略进行测试

策略: 本地有init.bin 结果 : 正常载入不报错正常分词
策略: 本地无init.bin 结果 : 正常生成init.bin 正常载入不报错正常分词.
策略: 本地有init.bin 第一次和第二次分词之间进行reload 结果 : 正常生成reload.bin 两次分词均正常任务结束后在其他任务中进行分词, 无法正常载入reload.bin, 重新生成init.bin.

reload中也是通过删除本地bin文件后重新调用loadMainDictionary方法进行新bin文件生成, 和init过程中的逻辑是一样的, 但是生成的文件大小不一致, 且不能正常读取. 怀疑是在reload过程中是否没有初始化一些全局变量的值导致reaload生成的bin文件缺少一些内容.

The text was updated successfully, but these errors were encountered:

BlackPoint-CX · 2018-11-22T02:44:41Z

@hankcs 受累帮忙看下这个问题. 谢谢.

BlackPoint-CX · 2018-11-22T05:19:27Z

@hankcs 上述问题的原因是因为 CustomDictionary.java 中方法 loadMainDictionary 中的 defaultNature = LexiconUtility.convertStringToNature(nature, customNatureCollector);导致.
具体逻辑是:
reload过程中, 因为之前的自定义词性均被记录在Nature.values中, 所以defaultNature会从values中返回对应的Nature实例, 而不会记录在customNatureCollector中, 这就导致后续IOUtil.writeCustomNature(out, customNatureCollector); 的时候 customNatureCollector为空.

BlackPoint-CX · 2018-11-22T05:29:15Z

建议解决办法 IOUtil.writeCustomNature(out, Nature.values());

hankcs · 2018-11-24T17:18:43Z

感谢反馈，已经修复，请参考上面的commit。
如果还有问题，欢迎重开issue。

BlackPoint-CX changed the title ~~自定义词典bin文件问题~~ reload生成自定义词典bin文件在重新载入时抛出异常ArrayIndexOutOfBoundsException Nov 22, 2018

hankcs closed this as completed in 9acb8c4 Nov 24, 2018

hankcs added the bug label Nov 24, 2018

hankcs added a commit that referenced this issue Jan 10, 2020

使热更新产生的缓存文件包含用户词性 fix #1028

91df40b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reload生成自定义词典bin文件在重新载入时抛出异常ArrayIndexOutOfBoundsException #1028

reload生成自定义词典bin文件在重新载入时抛出异常ArrayIndexOutOfBoundsException #1028

BlackPoint-CX commented Nov 21, 2018 •

edited

Loading

BlackPoint-CX commented Nov 22, 2018

BlackPoint-CX commented Nov 22, 2018

BlackPoint-CX commented Nov 22, 2018

hankcs commented Nov 24, 2018

reload生成自定义词典bin文件在重新载入时抛出异常ArrayIndexOutOfBoundsException #1028

reload生成自定义词典bin文件在重新载入时抛出异常ArrayIndexOutOfBoundsException #1028

Comments

BlackPoint-CX commented Nov 21, 2018 • edited Loading

注意事项

版本号

我的问题

复现问题

步骤

触发代码

期望输出

实际输出

其他信息

BlackPoint-CX commented Nov 22, 2018

BlackPoint-CX commented Nov 22, 2018

BlackPoint-CX commented Nov 22, 2018

hankcs commented Nov 24, 2018

BlackPoint-CX commented Nov 21, 2018 •

edited

Loading