-
Notifications
You must be signed in to change notification settings - Fork 194
Corpora
jacobvsdanniel edited this page Jul 8, 2020
·
1 revision
Corpus | Description |
---|---|
CNA | Chinese Gigaword 5, CNA (Central News Agency) part |
Wiki | Wikipedia, Chinese part, 2019-05-20 pages-articles dump |
ASBC | Sinica corpus 4.0 |
OntoNotes | OntoNotes 5.0, Chinese part |
- Transform to ZhTW
- Unicode normalization
normalized_string = unicodedata.normalize("NFKD", raw_string)
Corpus | #sents | #words | #characters | #words/sent | #chars/sent | "sent" Type |
---|---|---|---|---|---|---|
CNA | 13,366,581 | 632,289,913 | 1,098,546,752 | 47.3 | 82.2 | Paragraph |
Wiki | 5,557,141 | 247,714,633 | 461,862,002 | 44.6 | 83.1 | Paragraph |
ASBC | 1,297,793 | 10,409,751 | 16,331,383 | 8.0 | 12.6 | Clause |
OntoNotes | 46,905 | 958,345 | 1,515,151 | 20.4 | 32.3 | Sentence |
Embedding | Corpora | Corpora size | Final embedding size | Dimension |
---|---|---|---|---|
Character | CNA, Wiki | 1,560,408,754 | 13,136 | 300 |
Word | CNA, Wiki, ASBC-train | 890,414,297 | 1,355,791 | 300 |
- Call unicodedata.normalize (see above) before using the embeddings for custom models
- Word corpora are segmented by CkipTagger WS
- The words that are neither most frequent 30,000 nor length<=20 are removed from final embedding