Skip to content

Comparison of various supervised and unsupervised tokenization algorithms on a Chinese corpus

Notifications You must be signed in to change notification settings

droudy/TokenizationBenchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

TokenizationBenchmarks

Comparison of various supervised and unsupervised tokenization algorithms using sentiment analysis on a chinese corpus (ChnSentiCorp)

Results

Regularized logistic regression trained on 5205 examples and tested on 579 examples (90/10 split)

Tokenizer Accuracy
no tokenzier 83.07
jieba 89.32
SPM vocab_size=2000 vocab_size=4000 vocab_size=8000 vocab_size=16000
Unigram Aborted 87.21 90.43 90.08
Byte Pair Encoding Aborted 86.70 90.81 90.81
Char 53.36 48.46 48.98 47.35
Word 85.18 85.73 84.59 Aborted

Aborted = vocab_size was either too small or too large for that particular model

About

Comparison of various supervised and unsupervised tokenization algorithms on a Chinese corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages