Skip to content

Chinese Characters Visualization & Chinese Text Augmentation.

Notifications You must be signed in to change notification settings

Schlampig/HanziGraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

08c1beb · Sep 19, 2022

History

73 Commits
May 27, 2021
Sep 19, 2022
May 20, 2021
May 10, 2021
May 20, 2021

Repository files navigation

HanziGraph


996.icu LICENSE

Visualization for information about Chinese characters via Neo4j & Text augmentation via Chinese characters and words (typos, synonym, antonym, similar entity, numeric, etc.).

Introduction:

  • I try to integrate several open source Chinese characters or words corpora to build a visualized graph, named HanziGraph, for these characters, motivated by the demand to deal with character-level similarity comparison, nlp data-augumentation, and curiosity (。・ω・。)ノ

  • Furthermore, a light Chinese Text Augmentation code is provided based on several clean word-level corpora integrated by myself from open-source datasets like The Contemporary Chinese Dictionary and BigCilin.


File Dependency:

-> corpus -> char_number: 汉字笔画材料
         |-> char_part: 汉字偏旁材料
         |-> char_pronunciation: 汉字拼音材料
         |-> char_similar: 汉字结构分类、四角编码信息、形近字、音近字材料
         |-> char_split: 汉字拆字、简繁字体对照材料
         |-> basic_dictionary_similar.json
         |-> basic_triple.xlsx
         |-> corpus_handian -> word_handian: 现代汉语词典材料、生成的json数据
                           |-> get_handian.py  # 处理汉典数据的脚本
                           |-> combine_n.py  # 合并汉典与词林名词数据的脚本
         |-> corpus_dacilin -> word_dacilin: 大词林材料、生成的json数据
                           |-> get_dacilin.py  # 处理词林数据的脚本
  |-> prepro.py  # 预处理汉字各数据集的脚本
  |-> build_graph.py  # 生成汉字图谱的脚本
  |-> text_augmentation.py  # 利用词典扩增文本的脚本

Dataset

entry (i.e. each character) = {
                               "split_to(拆字方案)": ["part(偏旁) atom(子字) ...", ...],
                               "has_atom(包含哪些子字)": [atom(子字), ...],
                               "is_atom_of(为哪些字的子字)": [char(字), ...],
                               "has_part(包含哪些偏旁部首)": [part(偏旁), ...],
                               "is_part_of(为哪些字的偏旁部首)": [char(字), ...],
                               "pronunciation(有哪些读音)": [pronunciation(带音调拼音), ...],
                               "number(笔画数)": number(笔画数),
                               "is_simple_to(是哪些繁体的简体形式)": [char(字), ...],
                               "is_traditional_to(是哪些简体的繁体形式)": [char(字), ...],
                               "similar_to(有哪些近似字)": [char(字), ...]
                               }

Command Line:

  • prepro: this code is used to intergrate corpora into the large dictionary, and then transfor the dictionary to the triple-based data:
python prepro.py
  • build_graph: this code is used to transfor the triple-based data into entity/relation-based data:
python build_graph.py
  • import data into Neo4j: this code is used in command line to import entity/relation-based data to Neo4j:
./neo4j-import -into /your_path/neo4j-community-3.5.5/data/databases/graph.db/ --nodes /your_path/hanzi_entity.csv --relationships /your_path/hanzi_relation.csv --ignore-duplicate-nodes=true --ignore-missing-nodes=true
./neo4j console
  • generate dictionaries for text augmentation: using functions from get_handian.py and get_dacilin.py, and function create_corpus4typos in prepro.py to generate dictionaries for text augmentation.

  • text augmentation: using functions in this code to generate new samples. Run the code to see examples. The detailed design method could be found here.

python text_augmentation.py

Requirements

  • Python = 3.6.9
  • Neo4j = 3.5.5
  • pypinyin = 0.41.0
  • pandas = 0.22.0
  • fuzzywuzzy = 0.17.0
  • LTP 4 = 4.0.9
  • tqdm = 4.39.0

References

📖 See more ...

Cite this work as:


Use Case


Releases

No releases published

Packages

No packages published

Languages