Support both traditional and simplified Chinese name localizations #533

nvkelso · 2021-05-13T22:15:34Z

In support of #302 and tilezen/vector-datasource#1955, Natural Earth needs to include name localization for both traditional and simplified Chinese. Now we just have an ambiguous name_zh property.

In the case of Chinese (and some other languages), the "spoken" language has multiple "written" character sets (Traditional and Simplified) and is spoken and written in multiple countries using different configs (eg zh-CN implies zh-Hans).

When we harvest localized names from Wikidata we need to source Traditional Chinese separately from Simplified Chinese, and put them in two different properties like name_zh-hs or name_zhs(Chinese simplified irrespective of country) and name_zh-ht or name_zht(Chinese traditiional irrespective of country). Normally these might be name_zh-hans (Chinese simplified) and name_zh-hant but shapefile's DBF has a 10 character limit on the column names.

There should also be some consideration and compatibility with the point-of-view / worldview being introduced in v5.

The text was updated successfully, but these errors were encountered:

nvkelso · 2021-06-14T06:23:19Z

@ImreSamu can you give me any tips on how to modify the Wikidata names script to harvest separate Chinese simplified versus Chinese traditional characters, please? Usually a language code is simple 1:1 match but this one has multiple variants (more than just these 2!). Thanks :)

ImreSamu · 2021-06-14T08:11:13Z

IMHO:

complicated: https://en.wikipedia.org/wiki/Chinese_Wikipedia
probably a zh-hans, zh-hant should be added ( I see values in the FREQ report ) so just a simple 1:1

Based on this table: https://www.wikidata.org/wiki/Help:Languages

wikimedia language codes	Language
`zh-hans`	Simplified Chinese (Q13414913)
`zh-hant`	Traditional Chinese (Q18130932)

Simple FREQ stat - on wikidata "locality" related items :

--   wikidata "locality" related items 
--   count  zh* labels
whosonfirst=# 
WITH locality_wdlabels as 
(
 SELECT
   wd.wd_id 
  ,clean_wdlabel( wd.data->'labels'->'en'->>'value')           as name_en             
  ,clean_wdlabel( wd.data->'labels'->'zh'->>'value')           as name_zh          
  ,clean_wdlabel( wd.data->'labels'->'zh-classical'->>'value') as name_zh_classical
  ,clean_wdlabel( wd.data->'labels'->'zh-hans'->>'value')      as name_zh_hans     
  ,clean_wdlabel( wd.data->'labels'->'zh-hant'->>'value')      as name_zh_hant     
  ,clean_wdlabel( wd.data->'labels'->'zh-hk'->>'value')        as name_zh_hk        
  ,clean_wdlabel( wd.data->'labels'->'zh-min-nan'->>'value')   as name_zh_min_nan   
  ,clean_wdlabel( wd.data->'labels'->'zh-yue'->>'value')       as name_yue          
 FROM wd.wd_ok as wd 
 WHERE
  (a_wof_type  @> ARRAY['locality' ,'hasP625'] ) and not iscebuano
)
select 
  count(*) AS _cnt_base_wikidata
 ,count(*) FILTER (WHERE name_en           IS NOT NULL) AS _cnt_name_en          
 ,count(*) FILTER (WHERE name_zh           IS NOT NULL) AS _cnt_name_zh          
 ,count(*) FILTER (WHERE name_zh_classical IS NOT NULL) AS _cnt_name_zh_classical
 ,count(*) FILTER (WHERE name_zh_hans      IS NOT NULL) AS _cnt_name_zh_hans     
 ,count(*) FILTER (WHERE name_zh_hant      IS NOT NULL) AS _cnt_name_zh_hant     
 ,count(*) FILTER (WHERE name_zh_hk        IS NOT NULL) AS _cnt_name_zh_hk       
 ,count(*) FILTER (WHERE name_zh_min_nan   IS NOT NULL) AS _cnt_name_zh_min_nan  
 ,count(*) FILTER (WHERE name_yue          IS NOT NULL) AS _cnt_name_yue                
from locality_wdlabels
;
+-[ RECORD 1 ]-----------+--------+
| _cnt_base_wikidata     | 987949 |
| _cnt_name_en           | 788403 |
| _cnt_name_zh           | 197732 |
| _cnt_name_zh_classical | 0      |
| _cnt_name_zh_hans      | 91037  |
| _cnt_name_zh_hant      | 68299  |
| _cnt_name_zh_hk        | 51997  |
| _cnt_name_zh_min_nan   | 0      |
| _cnt_name_yue          | 0      |
+------------------------+--------+

Normally these might be name_zh-hans (Chinese simplified) and name_zh-hant
but shapefile's DBF has a 10 character limit on the column names.

I don't have good solutions.
just a brainstoring :

name_zhans ; name_zhant
namezhhans ; namezhhant
name_hans ; name_hant

nvkelso · 2021-06-16T05:05:10Z

Nice, thanks for the stats and Wikidata tips @ImreSamu!

The related Tilezen PR is tilezen/vector-datasource#1956, which has some logic for how to detect and backfill against the various options. I'll apply similar changes to the Python script in this repo.

nvkelso · 2021-07-12T06:26:14Z

Hey @ImreSamu do you have any tips on how to extend the existing (well, branch) script to include the two new language variants you mentioned?

I tried the following, but it barfs in Python, and using https://query.wikidata.org/ it compains about the - in the new label variants name_zh-hans and name_zh-hant in the SELECT and OPTIONAL sections.

I also tried searching for https://www.wikidata.org/wiki/Q62 which is San Francisco since I know it has all 3 variants but no dice.

        SELECT
            ?e ?i ?r ?population
            ?name_ar
            ?name_bn
            ?name_de
            ?name_el
            ?name_en
            ?name_es
            ?name_fa
            ?name_fr
            ?name_he
            ?name_hi
            ?name_hu
            ?name_id
            ?name_it
            ?name_ja
            ?name_ko
            ?name_nl
            ?name_pl
            ?name_pt
            ?name_ru
            ?name_sv
            ?name_tr
            ?name_uk
            ?name_ur
            ?name_vi
            ?name_zh
            ?name_zh-hans
            ?name_zh-hant
        WHERE {
            {
                SELECT DISTINCT  ?e ?i ?r
                WHERE{
                    VALUES ?i { wd:Q2102493 wd:Q1781    }
                    OPTIONAL{ ?i owl:sameAs ?r. }
                    BIND(COALESCE(?r, ?i) AS ?e).
                }
            }
            SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
            OPTIONAL{?e wdt:P1082 ?population .}
            OPTIONAL{?e rdfs:label ?name_ar FILTER((LANG(?name_ar))="ar").}
            OPTIONAL{?e rdfs:label ?name_bn FILTER((LANG(?name_bn))="bn").}
            OPTIONAL{?e rdfs:label ?name_de FILTER((LANG(?name_de))="de").}
            OPTIONAL{?e rdfs:label ?name_el FILTER((LANG(?name_el))="el").}
            OPTIONAL{?e rdfs:label ?name_en FILTER((LANG(?name_en))="en").}
            OPTIONAL{?e rdfs:label ?name_es FILTER((LANG(?name_es))="es").}
            OPTIONAL{?e rdfs:label ?name_fa FILTER((LANG(?name_fr))="fa").}
            OPTIONAL{?e rdfs:label ?name_fr FILTER((LANG(?name_fr))="fr").}
            OPTIONAL{?e rdfs:label ?name_he FILTER((LANG(?name_he))="he").}
            OPTIONAL{?e rdfs:label ?name_hi FILTER((LANG(?name_hi))="hi").}
            OPTIONAL{?e rdfs:label ?name_hu FILTER((LANG(?name_hu))="hu").}
            OPTIONAL{?e rdfs:label ?name_id FILTER((LANG(?name_id))="id").}
            OPTIONAL{?e rdfs:label ?name_it FILTER((LANG(?name_it))="it").}
            OPTIONAL{?e rdfs:label ?name_ja FILTER((LANG(?name_ja))="ja").}
            OPTIONAL{?e rdfs:label ?name_ko FILTER((LANG(?name_ko))="ko").}
            OPTIONAL{?e rdfs:label ?name_nl FILTER((LANG(?name_nl))="nl").}
            OPTIONAL{?e rdfs:label ?name_pl FILTER((LANG(?name_pl))="pl").}
            OPTIONAL{?e rdfs:label ?name_pt FILTER((LANG(?name_pt))="pt").}
            OPTIONAL{?e rdfs:label ?name_ru FILTER((LANG(?name_ru))="ru").}
            OPTIONAL{?e rdfs:label ?name_sv FILTER((LANG(?name_sv))="sv").}
            OPTIONAL{?e rdfs:label ?name_tr FILTER((LANG(?name_tr))="tr").}
            OPTIONAL{?e rdfs:label ?name_uk FILTER((LANG(?name_uk))="uk").}
            OPTIONAL{?e rdfs:label ?name_ur FILTER((LANG(?name_ur))="ur").}
            OPTIONAL{?e rdfs:label ?name_vi FILTER((LANG(?name_vi))="vi").}
            OPTIONAL{?e rdfs:label ?name_zh FILTER((LANG(?name_zh))="zh").}
            OPTIONAL{?e rdfs:label ?name_zh-hans FILTER((LANG(?name_zh-hans))="zh-hans").}
            OPTIONAL{?e rdfs:label ?name_zh-hant FILTER((LANG(?name_zh-hant))="zh-hant").}

ImreSamu · 2021-07-12T15:15:23Z

it compains about the - in the new label variants name_zh-hans and name_zh-hant in the SELECT and OPTIONAL sections.

ouch ... https://stackoverflow.com/questions/11075261/special-characters-in-sparql-variables

IMHO:

try using name_zh_hans variable name in SPARQL
and if the '-' char is important then you should rename the variable name in python ( name_zh_hans -> name_zh-hs ? )

minimal SPARQL example

https://w.wiki/3dBR ( Short URL of Wikidata Query Service - with the minimal sparql example )
backup of the minimal SPARQL example :

SELECT
    ?e ?i ?r ?population
    ?name_de
    ?name_en
    ?name_zh
    ?name_zh_hans
    ?name_zh_hant
WHERE {
    {
        SELECT DISTINCT  ?e ?i ?r
        WHERE{
            VALUES ?i { wd:Q2102493 wd:Q1781 }
            OPTIONAL{ ?i owl:sameAs ?r. }
            BIND(COALESCE(?r, ?i) AS ?e).
        }
    }
    SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
    OPTIONAL{?e wdt:P1082 ?population .}
    OPTIONAL{?e rdfs:label ?name_de FILTER((LANG(?name_de))="de").}
    OPTIONAL{?e rdfs:label ?name_en FILTER((LANG(?name_en))="en").}
    OPTIONAL{?e rdfs:label ?name_zh FILTER((LANG(?name_zh))="zh").}
    OPTIONAL{?e rdfs:label ?name_zh_hans FILTER((LANG(?name_zh_hans))="zh-hans").}
    OPTIONAL{?e rdfs:label ?name_zh_hant FILTER((LANG(?name_zh_hant))="zh-hant").}
    }

EDIT:

example with San Francisco / Q62: --> https://w.wiki/3dBu VALUES ?i { wd:Q62 wd:Q2102493 wd:Q1781 }

nvkelso · 2021-07-12T23:35:44Z

That works, thanks!

Now to determine when someone has put in English text into one of the Chinese values, oy vey.

nvkelso · 2021-07-13T06:40:17Z

I got this working locally and will push a branch soon with support for Simplified and Traditional Chinese names, thanks @ImreSamu ! I also fixed Italian and unborked the new Farsi so 2x win.

nvkelso · 2021-08-04T05:07:38Z

This work is reflected in #446 (which is too crazy big for a PR)

nvkelso · 2021-08-29T06:12:59Z

Fixed via #446.

nvkelso added housekeeping adm1 pop_places adm0 labels May 13, 2021

nvkelso added this to the v5.1.0 milestone May 13, 2021

nvkelso modified the milestones: v5.0.0 part 2 (FB), v5.0.0 (part 1) May 29, 2021

nvkelso mentioned this issue Aug 4, 2021

v5 prequel #446

Merged

nvkelso closed this as completed Aug 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support both traditional and simplified Chinese name localizations #533

Support both traditional and simplified Chinese name localizations #533

nvkelso commented May 13, 2021 •

edited

Loading

nvkelso commented Jun 14, 2021

ImreSamu commented Jun 14, 2021

nvkelso commented Jun 16, 2021

nvkelso commented Jul 12, 2021 •

edited

Loading

ImreSamu commented Jul 12, 2021 •

edited

Loading

nvkelso commented Jul 12, 2021

nvkelso commented Jul 13, 2021

nvkelso commented Aug 4, 2021

nvkelso commented Aug 29, 2021

Support both traditional and simplified Chinese name localizations #533

Support both traditional and simplified Chinese name localizations #533

Comments

nvkelso commented May 13, 2021 • edited Loading

nvkelso commented Jun 14, 2021

ImreSamu commented Jun 14, 2021

Simple FREQ stat - on wikidata "locality" related items :

nvkelso commented Jun 16, 2021

nvkelso commented Jul 12, 2021 • edited Loading

ImreSamu commented Jul 12, 2021 • edited Loading

minimal SPARQL example

nvkelso commented Jul 12, 2021

nvkelso commented Jul 13, 2021

nvkelso commented Aug 4, 2021

nvkelso commented Aug 29, 2021

nvkelso commented May 13, 2021 •

edited

Loading

nvkelso commented Jul 12, 2021 •

edited

Loading

ImreSamu commented Jul 12, 2021 •

edited

Loading