-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support both traditional and simplified Chinese name localizations #533
Comments
@ImreSamu can you give me any tips on how to modify the Wikidata names script to harvest separate Chinese simplified versus Chinese traditional characters, please? Usually a language code is simple 1:1 match but this one has multiple variants (more than just these 2!). Thanks :) |
IMHO:
Based on this table: https://www.wikidata.org/wiki/Help:Languages
Simple FREQ stat - on wikidata "locality" related items :-- wikidata "locality" related items
-- count zh* labels
whosonfirst=#
WITH locality_wdlabels as
(
SELECT
wd.wd_id
,clean_wdlabel( wd.data->'labels'->'en'->>'value') as name_en
,clean_wdlabel( wd.data->'labels'->'zh'->>'value') as name_zh
,clean_wdlabel( wd.data->'labels'->'zh-classical'->>'value') as name_zh_classical
,clean_wdlabel( wd.data->'labels'->'zh-hans'->>'value') as name_zh_hans
,clean_wdlabel( wd.data->'labels'->'zh-hant'->>'value') as name_zh_hant
,clean_wdlabel( wd.data->'labels'->'zh-hk'->>'value') as name_zh_hk
,clean_wdlabel( wd.data->'labels'->'zh-min-nan'->>'value') as name_zh_min_nan
,clean_wdlabel( wd.data->'labels'->'zh-yue'->>'value') as name_yue
FROM wd.wd_ok as wd
WHERE
(a_wof_type @> ARRAY['locality' ,'hasP625'] ) and not iscebuano
)
select
count(*) AS _cnt_base_wikidata
,count(*) FILTER (WHERE name_en IS NOT NULL) AS _cnt_name_en
,count(*) FILTER (WHERE name_zh IS NOT NULL) AS _cnt_name_zh
,count(*) FILTER (WHERE name_zh_classical IS NOT NULL) AS _cnt_name_zh_classical
,count(*) FILTER (WHERE name_zh_hans IS NOT NULL) AS _cnt_name_zh_hans
,count(*) FILTER (WHERE name_zh_hant IS NOT NULL) AS _cnt_name_zh_hant
,count(*) FILTER (WHERE name_zh_hk IS NOT NULL) AS _cnt_name_zh_hk
,count(*) FILTER (WHERE name_zh_min_nan IS NOT NULL) AS _cnt_name_zh_min_nan
,count(*) FILTER (WHERE name_yue IS NOT NULL) AS _cnt_name_yue
from locality_wdlabels
;
+-[ RECORD 1 ]-----------+--------+
| _cnt_base_wikidata | 987949 |
| _cnt_name_en | 788403 |
| _cnt_name_zh | 197732 |
| _cnt_name_zh_classical | 0 |
| _cnt_name_zh_hans | 91037 |
| _cnt_name_zh_hant | 68299 |
| _cnt_name_zh_hk | 51997 |
| _cnt_name_zh_min_nan | 0 |
| _cnt_name_yue | 0 |
+------------------------+--------+
I don't have good solutions.
|
Nice, thanks for the stats and Wikidata tips @ImreSamu! The related Tilezen PR is tilezen/vector-datasource#1956, which has some logic for how to detect and backfill against the various options. I'll apply similar changes to the Python script in this repo. |
Hey @ImreSamu do you have any tips on how to extend the existing (well, branch) script to include the two new language variants you mentioned? I tried the following, but it barfs in Python, and using https://query.wikidata.org/ it compains about the I also tried searching for https://www.wikidata.org/wiki/Q62 which is San Francisco since I know it has all 3 variants but no dice.
|
ouch ... https://stackoverflow.com/questions/11075261/special-characters-in-sparql-variables IMHO:
minimal SPARQL example
SELECT
?e ?i ?r ?population
?name_de
?name_en
?name_zh
?name_zh_hans
?name_zh_hant
WHERE {
{
SELECT DISTINCT ?e ?i ?r
WHERE{
VALUES ?i { wd:Q2102493 wd:Q1781 }
OPTIONAL{ ?i owl:sameAs ?r. }
BIND(COALESCE(?r, ?i) AS ?e).
}
}
SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
OPTIONAL{?e wdt:P1082 ?population .}
OPTIONAL{?e rdfs:label ?name_de FILTER((LANG(?name_de))="de").}
OPTIONAL{?e rdfs:label ?name_en FILTER((LANG(?name_en))="en").}
OPTIONAL{?e rdfs:label ?name_zh FILTER((LANG(?name_zh))="zh").}
OPTIONAL{?e rdfs:label ?name_zh_hans FILTER((LANG(?name_zh_hans))="zh-hans").}
OPTIONAL{?e rdfs:label ?name_zh_hant FILTER((LANG(?name_zh_hant))="zh-hant").}
} EDIT:
|
That works, thanks! Now to determine when someone has put in English text into one of the Chinese values, oy vey. |
I got this working locally and will push a branch soon with support for Simplified and Traditional Chinese names, thanks @ImreSamu ! I also fixed Italian and unborked the new Farsi so 2x win. |
This work is reflected in #446 (which is too crazy big for a PR) |
Fixed via #446. |
In support of #302 and tilezen/vector-datasource#1955, Natural Earth needs to include name localization for both traditional and simplified Chinese. Now we just have an ambiguous
name_zh
property.In the case of Chinese (and some other languages), the "spoken" language has multiple "written" character sets (Traditional and Simplified) and is spoken and written in multiple countries using different configs (eg
zh-CN
implieszh-Hans
).When we harvest localized names from Wikidata we need to source Traditional Chinese separately from Simplified Chinese, and put them in two different properties like
name_zh-hs
orname_zhs
(Chinese simplified irrespective of country) andname_zh-ht
orname_zht
(Chinese traditiional irrespective of country). Normally these might bename_zh-hans
(Chinese simplified) andname_zh-hant
but shapefile's DBF has a 10 character limit on the column names.There should also be some consideration and compatibility with the point-of-view / worldview being introduced in v5.
The text was updated successfully, but these errors were encountered: