Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special collations #8523

Merged
merged 1 commit into from
Oct 12, 2020
Merged

Special collations #8523

merged 1 commit into from
Oct 12, 2020

Conversation

ericharmeling
Copy link
Contributor

Fixes #1471.

  • Added docs on collation subtags.
  • Added note on deterministic and non-deterministic support.

@ericharmeling ericharmeling requested a review from bdarnell October 1, 2020 18:42
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: 💯 🎉

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell)

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really got nerd-sniped on this one...

It's great that we have the pg_collation table; now that we have that I'd remove the talk about Go.

In general it's time to make this more self-contained instead of linking out to external resources. For example, the link to "unicode language and locale identifiers" includes a lot of stuff that is irrelevant (the ca calendar modifiers) and more importantly it includes some stuff that is relevant but not supported in Go (such as the kf case-first modifier reference).

I'm also seeing some more issues now that I've opened this can of worms. The pg_collation table shows the collation names containing hyphens instead of underscores, which requires using double quotes (not the usual sql single quotes) in the collation syntax. Can we change that table to use underscores? If not, we should probably use hyphens and double quotes throughout the docs to match.

And versioning of collations are a potential nightmare. Collations rarely change, but now that i've found the source data I see that the chinese and japanese collations have both had changes in the last couple of years. So maybe we should add some best practices to always use uncollated strings for real columns and to only use collated string types via computed columns (and soon computed indexes).

So let me take a stab at the whole thing:

Supported Collations

CockroachDB supports collations identified by Unicode locale identifiers. For example, en-US identifies US English, es is Spanish, and fr-CA is Canadian French. Collation names are case-insensitive, and hyphens and underscores are interchangeable. If a hyphen is used, the collation name must be enclosed in double quotes when used in SQL syntax (not single quotes, which are used for SQL string literals).

A list of supported collations can be found in the pg_catalog.pg_collation table: SELECT collname from pg_catalog.pg_collation;). In addition, some aliases for these collations are also supported, although they do not appear in the table. For example, es-419 (Latin American Spanish) and zh-Hans (Simplified Chinese) are supported, but they do not appear in the pg_collations table because they are equivalent to the basic es and zh collations.

Some Unicode locale extensions are also supported. For example, the ks modifier changes the "strength" of a collation, causing it to treat certain classes of characters as equivalent (PostgreSQL calls theses "non-deterministic collations"). For example, setting ks to level2 makes a collation case-insensitive (for languages that have this concept). To use one of these extensions, append -u- to the base locale name followed by the extension: en-US-u-ks-level2 is case-insensitive US English. Currently supported extensions are co (collation type), ks (strength), kc (case level), kb (backwards second level weight), kn (numeric), ks (strength), and ka (alternate handling). These extensions are defined in the Unicode Collation Algorithm.

Collation Versioning

CockroachDB updates with new versions of the Unicode standard each year, and does not currently provide a mechanism to specify the version of Unicode in use. While changes to collations are rare, they are possible, especially in languages that use large numbers of characters such as Chinese. It is possible for a collation change to invalidate existing collated string data. For this reason we recommend storing data in columns with an uncollated string type, then using a computed column for the desired collation. This way, in the event a collation change has undesired effects, the computed column can be dropped and recreated.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell and @ericharmeling)


v19.2/collate.md, line 28 at r1 (raw file):

CockroachDB supports the collation locales provided by Go's [language package](https://godoc.org/golang.org/x/text/language#Tag), using [BCP47-based](https://tools.ietf.org/html/bcp47) identifiers and extensions.

For example, the identifier `es-419` specifies Latin American Spanish, and `en_US_u_ks_level1` specifies case-insensitive American English.

es-419 doesn't show up in pg_collation (although it appears to work). I think this is because latin american spanish uses the same collation rules as spain spanish. fr-ca might be a better example since canadian french does have different rules than in france.

Down below you use ks_level2 for case-insensitive instead of ks_level1. I think level2 is the right answer here (level1 is both case- and accent-insensitive)

Copy link
Contributor Author

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR @RaduBerinde !

And thank you for the thorough review and feedback @bdarnell !

Following your feedback, I've updated the 19.2 files for a second review (will propagate to other versions when approved). I basically just copyedited what you wrote above, and then updated the example following your advice. Fortunately, there are no other instances of local extensions in the docs, so we don't need to replace any other examples with double quotes and hyphens.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell)

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell)

Copy link
Contributor

@lnhsingh lnhsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One comment on a format suggestion

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell, @ericharmeling, and @lnhsingh)


v19.2/collate.md, line 52 at r3 (raw file):

CockroachDB supports standard aliases for the collations listed in `pg_collation`. For example, `es-419` (Latin American Spanish) and `zh-Hans` (Simplified Chinese) are supported, but they do not appear in the `pg_collations` table because they are equivalent to the `es` and `zh` collations listed in the table.

CockroachDB also supports the following Unicode locale extensions: `co` (collation type), `ks` (strength), `kc` (case level), `kb` (backwards second level weight), `kn` (numeric), `ks` (strength), and `ka` (alternate handling). To use a locale extension, append `-u-` to the base locale name, followed by the extension. For example, `en-US-u-ks-level2` is case-insensitive US English. The `ks` modifier changes the "strength" of the collation, causing it to treat certain classes of characters as equivalent (PostgreSQL calls these "non-deterministic collations"). Setting the `ks` to `level2` makes the collation case-insensitive (for languages that have this concept).

Format suggestion: Make the Unicode locale extensions a bulleted list for better readability (this applies to the other versions too)


v19.2/collate.md, line 124 at r3 (raw file):

{% include copy-clipboard.html %}
~~~ sql
> INSERT INTO nocase_strings VALUES ('Hello, friend.' COLLATE "en-US-u-ks-level2"), ('Hi. My name is Petee.' COLLATE "en-US-u-ks-level2");

I like this example lol 🐶

Copy link
Contributor Author

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR, @lnhsingh !

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell and @lnhsingh)


v19.2/collate.md, line 52 at r3 (raw file):

Previously, lnhsingh (Lauren Hirata Singh) wrote…

Format suggestion: Make the Unicode locale extensions a bulleted list for better readability (this applies to the other versions too)

Done.


v19.2/collate.md, line 124 at r3 (raw file):

Previously, lnhsingh (Lauren Hirata Singh) wrote…

I like this example lol 🐶

:)

@ericharmeling ericharmeling merged commit 0351cb9 into master Oct 12, 2020
@ericharmeling ericharmeling deleted the special-collations branch October 12, 2020 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document special collations
5 participants