-
GlotScript-Resource: provides a resource displaying the writing systems for various languages.
-
GlotScript-Tool: determines the script (writing system) of input text using ISO 15924.
What writing system is each language written in?
Example:
Language | CORE | AUXILLARY |
---|---|---|
Turkish (tur) | Latn | Arab, Cyrl, Grek |
Thai (tha) | Thai | Latn |
Vietnamese (vie) | Latn | Hani |
See metadata folder for more languages.
It's a Python library that detects the script (writing system) of text based on ISO 15924.
- Unicode version: 15.0.0
- The codes were sourced from Wikipedia ISO_15924.
- Unicode ranges were extracted from Unicode Character Database.
Special codes
Zinh
code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.Zyyy
code is the Unicode script for "Common" characters.Zzzz
code is for Unicode script for "uncoded" script.
from pip
pip3 install GlotScript
from git
pip3 install GlotScript@git+https://github.com/cisnlp/GlotScript
Script Detection
from GlotScript import sp
sp('これは日本人です')
>> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})
sp('This is Latin')[:1]
>> ('Latn', 1.0)
sp('මේක සිංහල')[0]
>> 'Sinh'
Script Separation
from GlotScript import sc
sent = "Hello Salut سلام 你好 こんにちは שלום مرحبا"
sc(sent)
>> {
"Latn":"Hello Salut ",
"Hebr":" שלום ",
"Arab":" سلام مرحبا",
"Hani":" 你好 ",
"Hira":" こんにちは "
}
Click to Exapand
- List of Unicode characters - Wikipedia
- Lightweight Plain-Text Editor for macOS - CotEditor
- The Cygwin Terminal – terminal emulator for Cygwin, MSYS, and WSL - mintty
- ISO_15924 Wikipedia
- Unicode Character Database (Blocks) - Unicode
- Unicode Character Database (Scripts) - Unicode
- A free, web-based font editor, focusing on font design hobbyists. - Glyphr-Studio-1
- Kotlin - JetBrains
- UNIX-like reverse engineering framework and command-line toolset - radare2
- FreeOrion Game
- DOMinator - Firefox
- SHSans-derived CJK font family - glow-sans
- Unicode Subset Bitfields - Microsoft
- Stops - FAIR NLLB FB
- Gradient Boosting on Decision Trees - catboost
- Blender
- Unicode Wikipedia
If you use any part of this our resource or tool in your research, please cite it using the following BibTex entry.
@inproceedings{kargaran-etal-2024-glotscript-resource,
title = "{G}lot{S}cript: A Resource and Tool for Low Resource Writing System Identification",
author = {Kargaran, Amir Hossein and
Yvon, Fran{\c{c}}ois and
Sch{\"u}tze, Hinrich},
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.687",
pages = "7774--7784"
}