GitHub - luxinyu1/Chinese-LS: A dataset and baselines for CLS.

What is Chinese-LS?

Lexical simplification (LS) aims to replace complex words in a given sentence with their simpler alternatives of equivalent meaning. Chinese-LS is the first attempt in the field of Chinese Lexical Simplification. It includes a high-quality benchmark dataset and five baseline approaches:

Synonym dictionary-based approach
Word embedding-based approach
Pretrained language model-based approach
Sememe-based approach
Hybrid approach

The entire framework of Chinese-LS is shown below:

Quick start

Requirements

Python==3.7.6
transformers==3.5.0
numpy==1.18.1
jieba==0.42.1
torch==1.4.0
OpenHowNet==0.0.1a11
gensim==3.8.2

You can find the complete requirements here.

Preparations

Download Pretrained Models

Chinese-LS uses the following pretrained models:

Word2Vec model: Chinese-Word-Vector (Mixed-large)
BERT-base, Chinese (huggingface)
macbert-base, Chinese (huggingface)
bert-wwm-ext, Chinese (huggingface)
roberta-wwm-ext, Chinese (huggingface)
ERNIE (PaddlePaddle)

Please place the models under the ./model directory after downloading.

Run

We have already executed the codes for you and intermediate results can be found in ./data.

You could check out the details of codes and algorithms from our paper: Chinese Lexical Simplification

If you want to run the codes for reproduction, please execute them in the following order:

Generate

Synonym dictionary based-approach

Run dict_generate.py
Word embedding based-approach

Run vector_generate.py
Pretrained language model based-approach

Run bert_generate.sh
Sememe based-approach

Run hownet_generate.py
Hybrid approach

Run hybrid_approach.py

Select

Run substitute_selection.py

Rank

Run substitute_ranking.py

Experiments

Chinese-LS designs 5 experiments to evaluate the quality of our dataset and the performance of five approaches. You could get the experiment results through running experiment.py.

Citation

@article{qiang2021chinese,
    title={Chinese Lexical Simplification},
    author={Qiang, Jipeng and Lu, Xinyu and Li, Yun and Yuan, Yun-Hao and Wu, Xindong},
    journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
    year={2021},
    volume={29},
    pages={1819-1828},
    doi={10.1109/TASLP.2021.3078361},
    publisher={IEEE}
}

Contact

This repo may still contain bugs and we are working on improving the reproductivity. Welcome to open an issue or submit a Pull Request to report/fix the bugs.

Email: luxinyu12345@foxmail.com

License

Chinese-LS is under the Apache License, Version 2.0.

Name	Name	Last commit message	Last commit date
Latest commit luxinyu1 Fix the index error reported in issue #2 Sep 3, 2022 94db191 · Sep 3, 2022 History 12 Commits
data	data	Add files and update READMEs	Oct 19, 2021
dataset	dataset	Fix the index error reported in issue #2	Sep 3, 2022
dict	dict	Upgrade to transformers==3.5.0	Sep 23, 2021
docs/img	docs/img	Modify multiple files	Sep 13, 2020
hownet	hownet	Initial commit	Aug 27, 2020
model	model	Initial commit	Aug 27, 2020
test	test	Update README.md	Sep 13, 2020
.gitattributes	.gitattributes	Initial commit	Aug 27, 2020
.gitignore	.gitignore	Upgrade to transformers==3.5.0	Sep 23, 2021
LICENSE	LICENSE	Initial commit	Aug 27, 2020
README.md	README.md	Add files and update READMEs	Oct 19, 2021
README.zh.md	README.zh.md	Add files and update READMEs	Oct 19, 2021
bert_generate.py	bert_generate.py	Upgrade to transformers==3.5.0	Sep 23, 2021
bert_generate.sh	bert_generate.sh	Upgrade to transformers==3.5.0	Sep 23, 2021
bert_no_auto_regressive.py	bert_no_auto_regressive.py	Add files and update READMEs	Oct 19, 2021
bert_wwm_generate.sh	bert_wwm_generate.sh	Upgrade to transformers==3.5.0	Sep 23, 2021
dict_generate.py	dict_generate.py	Initial commit	Aug 27, 2020
electra_generate.sh	electra_generate.sh	Upgrade to transformers==3.5.0	Sep 23, 2021
ernie_generate.py	ernie_generate.py	Upgrade to transformers==3.5.0	Sep 23, 2021
experiment.py	experiment.py	Upgrade to transformers==3.5.0	Sep 23, 2021
hownet_generate.py	hownet_generate.py	Modify and add mutiple files	Sep 6, 2020
hybrid_generate.py	hybrid_generate.py	Modify multiple files	Sep 13, 2020
macbert_generate.sh	macbert_generate.sh	Upgrade to transformers==3.5.0	Sep 23, 2021
requirements.txt	requirements.txt	Upgrade to transformers==3.5.0	Sep 23, 2021
roberta_generate.sh	roberta_generate.sh	Upgrade to transformers==3.5.0	Sep 23, 2021
substitute_ranking.py	substitute_ranking.py	Upgrade to transformers==3.5.0	Sep 23, 2021
substitute_ranking_v2.py	substitute_ranking_v2.py	Modify and add mutiple files	Sep 6, 2020
substitute_selection.py	substitute_selection.py	Quick fix	Oct 19, 2021
vector_generate.py	vector_generate.py	Modify multiple files	Sep 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is Chinese-LS?

Quick start

Requirements

Preparations

Download Pretrained Models

Run

Generate

Select

Rank

Experiments

Citation

Contact

License

About

Releases

Packages

Languages

License

luxinyu1/Chinese-LS

Folders and files

Latest commit

History

Repository files navigation

What is Chinese-LS?

Quick start

Requirements

Preparations

Download Pretrained Models

Run

Generate

Select

Rank

Experiments

Citation

Contact

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages