Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default language for.. say metadata.title? #52

Open
jccr opened this issue Mar 19, 2020 · 14 comments
Open

Default language for.. say metadata.title? #52

jccr opened this issue Mar 19, 2020 · 14 comments

Comments

@jccr
Copy link
Contributor

jccr commented Mar 19, 2020

Given I have a publication with a title in metadata like this:

{
  "metadata": {
    "title": {
      "fr": "Vingt mille lieues sous les mers",
      "en": "Twenty Thousand Leagues Under the Sea",
      "ja": "海底二万里"
    }
  }
}

What would the default language be? If all I want is just any string, without having a localization preference. Would it be the first in the "list", i.e. the value of "fr"?

If so.. the order of the keys might be a problem.

@llemeurfr
Copy link
Contributor

IMO there is no default language embedded in the publication. There is instead a preferred language (or a list of) in the reading app.

@jccr
Copy link
Contributor Author

jccr commented Mar 19, 2020

Im looking at this from the Shared Models API perspective.

Trying to deal with two types of data, for example in typescript:

interface LocalizedString {
  [key: string]: string
}

interface Metadata {
  title: string | LocalizedString
}

When I want to grab a value from title...
I have to deal with the union type first with some "unwrapping" code that IMO is too cumbersome.

Ideally I think I want this:

interface Metadata {
  title: LocalizedString
}

Where all data is normalized to that structure.

{
  "metadata": {
    "title": {
      "fr": "Vingt mille lieues sous les mers",
      "en": "Twenty Thousand Leagues Under the Sea",
      "ja": "海底二万里"
    }
  }
}

would work fine as is, and would fit into LocalizedString nicely.

But.. what about the case if the data is just a simple bare string? Like this:

{
  "metadata": {
    "title": "Twenty Thousand Leagues Under the Sea"
  }
}

In my interface design it would end up being parsed like this:

title = {
  "": "Twenty Thousand Leagues Under the Sea"
}

Still ugly.. but it's normalized (is it better? I'm asking myself)

@jccr
Copy link
Contributor Author

jccr commented Mar 19, 2020

Alright, my thinking is now I'm moving towards your suggestion @llemeurfr

@jccr
Copy link
Contributor Author

jccr commented Mar 19, 2020

Still would like to draft up a design for a convenient API though, and IMO It's easier with normalization of the data.

@HadrienGardeur
Copy link
Member

@jccr have you looked at the APIs in the Swift version?

@jccr
Copy link
Contributor Author

jccr commented Mar 19, 2020

@HadrienGardeur I have actually. I'll go back and iterate my thoughts on that approach too.

@mickael-menu
Copy link
Member

Actually the Kotlin version is more up-to-date now. But thank you for raising this issue, we improvised a bit there when this should be specified and shared among platforms.

Here's how it works on Kotlin:

  • We normalize the JSON to a LocalizedString object holding a Map<String?, Translation>.
    • LocalizedString.Translation only contains a String for now, but could be extended to support text direction, for example.
  • If we don't know the language, then the key can be null (e.g. when parsing a RWPM). But with EPUB, we try to use the xml:lang element, or fallback on the publication's language (@qnga might chime in on this).
  • When serializing LocalizedString to JSON, if a key is null then we use the BCP-47 language code und, which is made for that.

    The 'und' (Undetermined) primary language subtag identifies linguistic content whose language is not determined. IETF

  • In the shared model, we decided to offer a simple API, considering that most reading apps won't care about the translations (the test app doesn't use it, for example). Therefore, for Metadata.title, we actually have two properties:
    • localizedTitle which is the LocalizedString object.
    • title which is an alias to localizedTitle.string.
    • This choice was also guided by the need to stay backward-compatible with the previous API.

Here's the API of LocalizedString:

  • (property) translations: Map<String?, Translation>
    • Provides a direct access to the translations map.
  • getOrFallback(language: String?): Translation?
    • Returns the translation matching the given BCP-47 tag.
    • If not found (or if no language code is given), falls back on these language codes, in order:
      1. the default user locale
      2. null
      3. und
      4. en
      5. or the first translation found in the map (this might be a problem since maps are not ordered)
  • (property) defaultTranslation: Translation? = getOrFallback(null)
  • (property) string: String = defaultTranslation.string
  • (static) fromJSON(json): LocalizedString?
    • Creates a LocalizedString from a JSON string or JSON BCP–47 language map.
  • (static) fromString(strings: Map<String?, String>): LocalizedString
    • Creates a LocalizedString from a map of strings. It's convenient when parsing a package.
  • There are some additional APIs to help build or modify a LocalizedString, since it is immutable.

So as you can see, metadata.title is actually an alias to metadata.localizedTitle.getOrFallback(null).string, which ideally returns the translation matching the user's locale. Which matches what @llemeurfr said:

IMO there is no default language embedded in the publication. There is instead a preferred language (or a list of) in the reading app.

One thing we might want to discuss is the heuristics to decide how to fallback on the default translation. It would be nice to be able to use the publication's first language instead of null or en, but we don't have access to it in LocalizedString, unless we provide it at construction.

@qnga
Copy link
Member

qnga commented Mar 20, 2020

If we don't know the language, then the key can be null (e.g. when parsing a RWPM). But with EPUB, we try to use the xml:lang element, or fallback on the publication's language (@qnga might chime in on this).

Sure, I can chime in. I think I already suggested somewhere to drop this fallback on the publication's language. This behaviour looks like an unjustified and unnecessary assertion since RWPM supports a non specified language. When directly parsing a RWPM title with no specified language, no such an assertion is made, and as far as I know, this interpretation is in no way favoured by the Epub specification.

@mickael-menu
Copy link
Member

I think I already suggested somewhere to drop this fallback on the publication's language.

I agree with you, and it would lead to simpler parsing. I think only the Kotlin implementation falls back on the publication language right now.

@danielweck
Copy link
Member

In the TypeScript implementation, for "contributors" metadata (e.g. author), as well as for title and subtitle metadata, we use the underscore _ pseudo-language-key as a fallback for cases where there are "alternative scripts" declared in the package OPF (as per the EPUB3 definition), and when the parser cannot determine the language of the string based on XML lang attribute (on the meta itself, or package OPF root element), or failing that, use the "primary" package OPF meta language instead (i.e. "primary" = first item in the array). Obviously, _ is not a great solution, so I will migrate to und instead. Thanks Mickael for pointing this out.

Current parser algo inspired from:
https://github.com/readium/architecture/blob/master/streamer/parser/metadata.md#title

@jccr
Copy link
Contributor Author

jccr commented Mar 20, 2020

Man! I was looking for something like und

Thanks for the analysis, everyone! 👍

@llemeurfr
Copy link
Contributor

I think I already suggested somewhere to drop this [language of the publication] fallback on the publication's language.

This is exactly what I myself did in the Go implementation for the LCP server, when parsing W3C Manifests, as the low level json unmarshalling of a Localizable string would then rely on a global variable (the global language of the publication) and this would lead to a terrible implementation.

As qnga said, in EPUB the language of the publication (which may be multiple) is not directly related to the language of its metadata.

W3C Publication are slightly different because there are two different properties: inLanguage for the publication and a top level language (here) for the manifest, -> metadata. But we can be pretty certain that the latter will not be used before long, and there is no corresponding property in the RWPM.

In conclusion, I think we can rephrase Mickaël's wording as: If we don't know the language (because the property is expressed as a plain string), then the key is "und".

@danielweck
Copy link
Member

W3C Publication are slightly different because ...

For all intents and purposes, isn't EPUB OPF's xml:lang the same as W3C WebPub's @context language? (and EPUB OPF's metadata dc:language the same as W3C WebPub's inLanguage)

@llemeurfr
Copy link
Contributor

@danielweck you're right, xml:lang in EPUB has the same use than the @context / language in JSON-LD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants