Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support both traditional and simplified Chinese name localizations #533

Closed
nvkelso opened this issue May 13, 2021 · 9 comments
Closed

Support both traditional and simplified Chinese name localizations #533

nvkelso opened this issue May 13, 2021 · 9 comments

Comments

@nvkelso
Copy link
Owner

nvkelso commented May 13, 2021

In support of #302 and tilezen/vector-datasource#1955, Natural Earth needs to include name localization for both traditional and simplified Chinese. Now we just have an ambiguous name_zh property.

In the case of Chinese (and some other languages), the "spoken" language has multiple "written" character sets (Traditional and Simplified) and is spoken and written in multiple countries using different configs (eg zh-CN implies zh-Hans).

When we harvest localized names from Wikidata we need to source Traditional Chinese separately from Simplified Chinese, and put them in two different properties like name_zh-hs or name_zhs(Chinese simplified irrespective of country) and name_zh-ht or name_zht(Chinese traditiional irrespective of country). Normally these might be name_zh-hans (Chinese simplified) and name_zh-hant but shapefile's DBF has a 10 character limit on the column names.

There should also be some consideration and compatibility with the point-of-view / worldview being introduced in v5.

@nvkelso
Copy link
Owner Author

nvkelso commented Jun 14, 2021

@ImreSamu can you give me any tips on how to modify the Wikidata names script to harvest separate Chinese simplified versus Chinese traditional characters, please? Usually a language code is simple 1:1 match but this one has multiple variants (more than just these 2!). Thanks :)

@ImreSamu
Copy link
Collaborator

IMHO:

Based on this table: https://www.wikidata.org/wiki/Help:Languages

wikimedia language codes Language
zh-hans Simplified Chinese (Q13414913)
zh-hant Traditional Chinese (Q18130932)

Simple FREQ stat - on wikidata "locality" related items :

--   wikidata "locality" related items 
--   count  zh* labels
whosonfirst=# 
WITH locality_wdlabels as 
(
 SELECT
   wd.wd_id 
  ,clean_wdlabel( wd.data->'labels'->'en'->>'value')           as name_en             
  ,clean_wdlabel( wd.data->'labels'->'zh'->>'value')           as name_zh          
  ,clean_wdlabel( wd.data->'labels'->'zh-classical'->>'value') as name_zh_classical
  ,clean_wdlabel( wd.data->'labels'->'zh-hans'->>'value')      as name_zh_hans     
  ,clean_wdlabel( wd.data->'labels'->'zh-hant'->>'value')      as name_zh_hant     
  ,clean_wdlabel( wd.data->'labels'->'zh-hk'->>'value')        as name_zh_hk        
  ,clean_wdlabel( wd.data->'labels'->'zh-min-nan'->>'value')   as name_zh_min_nan   
  ,clean_wdlabel( wd.data->'labels'->'zh-yue'->>'value')       as name_yue          
 FROM wd.wd_ok as wd 
 WHERE
  (a_wof_type  @> ARRAY['locality' ,'hasP625'] ) and not iscebuano
)
select 
  count(*) AS _cnt_base_wikidata
 ,count(*) FILTER (WHERE name_en           IS NOT NULL) AS _cnt_name_en          
 ,count(*) FILTER (WHERE name_zh           IS NOT NULL) AS _cnt_name_zh          
 ,count(*) FILTER (WHERE name_zh_classical IS NOT NULL) AS _cnt_name_zh_classical
 ,count(*) FILTER (WHERE name_zh_hans      IS NOT NULL) AS _cnt_name_zh_hans     
 ,count(*) FILTER (WHERE name_zh_hant      IS NOT NULL) AS _cnt_name_zh_hant     
 ,count(*) FILTER (WHERE name_zh_hk        IS NOT NULL) AS _cnt_name_zh_hk       
 ,count(*) FILTER (WHERE name_zh_min_nan   IS NOT NULL) AS _cnt_name_zh_min_nan  
 ,count(*) FILTER (WHERE name_yue          IS NOT NULL) AS _cnt_name_yue                
from locality_wdlabels
;
+-[ RECORD 1 ]-----------+--------+
| _cnt_base_wikidata     | 987949 |
| _cnt_name_en           | 788403 |
| _cnt_name_zh           | 197732 |
| _cnt_name_zh_classical | 0      |
| _cnt_name_zh_hans      | 91037  |
| _cnt_name_zh_hant      | 68299  |
| _cnt_name_zh_hk        | 51997  |
| _cnt_name_zh_min_nan   | 0      |
| _cnt_name_yue          | 0      |
+------------------------+--------+

Normally these might be name_zh-hans (Chinese simplified) and name_zh-hant
but shapefile's DBF has a 10 character limit on the column names.

I don't have good solutions.
just a brainstoring :

  • name_zhans ; name_zhant
  • namezhhans ; namezhhant
  • name_hans ; name_hant

@nvkelso
Copy link
Owner Author

nvkelso commented Jun 16, 2021

Nice, thanks for the stats and Wikidata tips @ImreSamu!

The related Tilezen PR is tilezen/vector-datasource#1956, which has some logic for how to detect and backfill against the various options. I'll apply similar changes to the Python script in this repo.

@nvkelso
Copy link
Owner Author

nvkelso commented Jul 12, 2021

Hey @ImreSamu do you have any tips on how to extend the existing (well, branch) script to include the two new language variants you mentioned?

I tried the following, but it barfs in Python, and using https://query.wikidata.org/ it compains about the - in the new label variants name_zh-hans and name_zh-hant in the SELECT and OPTIONAL sections.

I also tried searching for https://www.wikidata.org/wiki/Q62 which is San Francisco since I know it has all 3 variants but no dice.

        SELECT
            ?e ?i ?r ?population
            ?name_ar
            ?name_bn
            ?name_de
            ?name_el
            ?name_en
            ?name_es
            ?name_fa
            ?name_fr
            ?name_he
            ?name_hi
            ?name_hu
            ?name_id
            ?name_it
            ?name_ja
            ?name_ko
            ?name_nl
            ?name_pl
            ?name_pt
            ?name_ru
            ?name_sv
            ?name_tr
            ?name_uk
            ?name_ur
            ?name_vi
            ?name_zh
            ?name_zh-hans
            ?name_zh-hant
        WHERE {
            {
                SELECT DISTINCT  ?e ?i ?r
                WHERE{
                    VALUES ?i { wd:Q2102493 wd:Q1781    }
                    OPTIONAL{ ?i owl:sameAs ?r. }
                    BIND(COALESCE(?r, ?i) AS ?e).
                }
            }
            SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
            OPTIONAL{?e wdt:P1082 ?population .}
            OPTIONAL{?e rdfs:label ?name_ar FILTER((LANG(?name_ar))="ar").}
            OPTIONAL{?e rdfs:label ?name_bn FILTER((LANG(?name_bn))="bn").}
            OPTIONAL{?e rdfs:label ?name_de FILTER((LANG(?name_de))="de").}
            OPTIONAL{?e rdfs:label ?name_el FILTER((LANG(?name_el))="el").}
            OPTIONAL{?e rdfs:label ?name_en FILTER((LANG(?name_en))="en").}
            OPTIONAL{?e rdfs:label ?name_es FILTER((LANG(?name_es))="es").}
            OPTIONAL{?e rdfs:label ?name_fa FILTER((LANG(?name_fr))="fa").}
            OPTIONAL{?e rdfs:label ?name_fr FILTER((LANG(?name_fr))="fr").}
            OPTIONAL{?e rdfs:label ?name_he FILTER((LANG(?name_he))="he").}
            OPTIONAL{?e rdfs:label ?name_hi FILTER((LANG(?name_hi))="hi").}
            OPTIONAL{?e rdfs:label ?name_hu FILTER((LANG(?name_hu))="hu").}
            OPTIONAL{?e rdfs:label ?name_id FILTER((LANG(?name_id))="id").}
            OPTIONAL{?e rdfs:label ?name_it FILTER((LANG(?name_it))="it").}
            OPTIONAL{?e rdfs:label ?name_ja FILTER((LANG(?name_ja))="ja").}
            OPTIONAL{?e rdfs:label ?name_ko FILTER((LANG(?name_ko))="ko").}
            OPTIONAL{?e rdfs:label ?name_nl FILTER((LANG(?name_nl))="nl").}
            OPTIONAL{?e rdfs:label ?name_pl FILTER((LANG(?name_pl))="pl").}
            OPTIONAL{?e rdfs:label ?name_pt FILTER((LANG(?name_pt))="pt").}
            OPTIONAL{?e rdfs:label ?name_ru FILTER((LANG(?name_ru))="ru").}
            OPTIONAL{?e rdfs:label ?name_sv FILTER((LANG(?name_sv))="sv").}
            OPTIONAL{?e rdfs:label ?name_tr FILTER((LANG(?name_tr))="tr").}
            OPTIONAL{?e rdfs:label ?name_uk FILTER((LANG(?name_uk))="uk").}
            OPTIONAL{?e rdfs:label ?name_ur FILTER((LANG(?name_ur))="ur").}
            OPTIONAL{?e rdfs:label ?name_vi FILTER((LANG(?name_vi))="vi").}
            OPTIONAL{?e rdfs:label ?name_zh FILTER((LANG(?name_zh))="zh").}
            OPTIONAL{?e rdfs:label ?name_zh-hans FILTER((LANG(?name_zh-hans))="zh-hans").}
            OPTIONAL{?e rdfs:label ?name_zh-hant FILTER((LANG(?name_zh-hant))="zh-hant").}

@ImreSamu
Copy link
Collaborator

ImreSamu commented Jul 12, 2021

it compains about the - in the new label variants name_zh-hans and name_zh-hant in the SELECT and OPTIONAL sections.

ouch ... https://stackoverflow.com/questions/11075261/special-characters-in-sparql-variables

IMHO:

  • try using name_zh_hans variable name in SPARQL
  • and if the '-' char is important then you should rename the variable name in python ( name_zh_hans -> name_zh-hs ? )

minimal SPARQL example

  • https://w.wiki/3dBR ( Short URL of Wikidata Query Service - with the minimal sparql example )
  • backup of the minimal SPARQL example :
SELECT
    ?e ?i ?r ?population
    ?name_de
    ?name_en
    ?name_zh
    ?name_zh_hans
    ?name_zh_hant
WHERE {
    {
        SELECT DISTINCT  ?e ?i ?r
        WHERE{
            VALUES ?i { wd:Q2102493 wd:Q1781 }
            OPTIONAL{ ?i owl:sameAs ?r. }
            BIND(COALESCE(?r, ?i) AS ?e).
        }
    }
    SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
    OPTIONAL{?e wdt:P1082 ?population .}
    OPTIONAL{?e rdfs:label ?name_de FILTER((LANG(?name_de))="de").}
    OPTIONAL{?e rdfs:label ?name_en FILTER((LANG(?name_en))="en").}
    OPTIONAL{?e rdfs:label ?name_zh FILTER((LANG(?name_zh))="zh").}
    OPTIONAL{?e rdfs:label ?name_zh_hans FILTER((LANG(?name_zh_hans))="zh-hans").}
    OPTIONAL{?e rdfs:label ?name_zh_hant FILTER((LANG(?name_zh_hant))="zh-hant").}
    }

EDIT:

  • example with San Francisco / Q62: --> https://w.wiki/3dBu VALUES ?i { wd:Q62 wd:Q2102493 wd:Q1781 }

@nvkelso
Copy link
Owner Author

nvkelso commented Jul 12, 2021

That works, thanks!

Now to determine when someone has put in English text into one of the Chinese values, oy vey.

@nvkelso
Copy link
Owner Author

nvkelso commented Jul 13, 2021

I got this working locally and will push a branch soon with support for Simplified and Traditional Chinese names, thanks @ImreSamu ! I also fixed Italian and unborked the new Farsi so 2x win.

@nvkelso nvkelso mentioned this issue Aug 4, 2021
@nvkelso
Copy link
Owner Author

nvkelso commented Aug 4, 2021

This work is reflected in #446 (which is too crazy big for a PR)

@nvkelso
Copy link
Owner Author

nvkelso commented Aug 29, 2021

Fixed via #446.

@nvkelso nvkelso closed this as completed Aug 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants