Special collations #8523

ericharmeling · 2020-10-01T18:42:31Z

Added docs on collation subtags.
Added note on deterministic and non-deterministic support.

cockroach-teamcity · 2020-10-01T18:42:37Z

This change is

cockroach-teamcity · 2020-10-01T18:46:25Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/8be14c8671eef4450f26e61eb9f1042763110389/

Edited pages:

RaduBerinde

💯 🎉

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell)

bdarnell

I really got nerd-sniped on this one...

It's great that we have the pg_collation table; now that we have that I'd remove the talk about Go.

In general it's time to make this more self-contained instead of linking out to external resources. For example, the link to "unicode language and locale identifiers" includes a lot of stuff that is irrelevant (the ca calendar modifiers) and more importantly it includes some stuff that is relevant but not supported in Go (such as the kf case-first modifier reference).

I'm also seeing some more issues now that I've opened this can of worms. The pg_collation table shows the collation names containing hyphens instead of underscores, which requires using double quotes (not the usual sql single quotes) in the collation syntax. Can we change that table to use underscores? If not, we should probably use hyphens and double quotes throughout the docs to match.

And versioning of collations are a potential nightmare. Collations rarely change, but now that i've found the source data I see that the chinese and japanese collations have both had changes in the last couple of years. So maybe we should add some best practices to always use uncollated strings for real columns and to only use collated string types via computed columns (and soon computed indexes).

So let me take a stab at the whole thing:

Supported Collations

CockroachDB supports collations identified by Unicode locale identifiers. For example, en-US identifies US English, es is Spanish, and fr-CA is Canadian French. Collation names are case-insensitive, and hyphens and underscores are interchangeable. If a hyphen is used, the collation name must be enclosed in double quotes when used in SQL syntax (not single quotes, which are used for SQL string literals).

A list of supported collations can be found in the pg_catalog.pg_collation table: SELECT collname from pg_catalog.pg_collation;). In addition, some aliases for these collations are also supported, although they do not appear in the table. For example, es-419 (Latin American Spanish) and zh-Hans (Simplified Chinese) are supported, but they do not appear in the pg_collations table because they are equivalent to the basic es and zh collations.

Some Unicode locale extensions are also supported. For example, the ks modifier changes the "strength" of a collation, causing it to treat certain classes of characters as equivalent (PostgreSQL calls theses "non-deterministic collations"). For example, setting ks to level2 makes a collation case-insensitive (for languages that have this concept). To use one of these extensions, append -u- to the base locale name followed by the extension: en-US-u-ks-level2 is case-insensitive US English. Currently supported extensions are co (collation type), ks (strength), kc (case level), kb (backwards second level weight), kn (numeric), ks (strength), and ka (alternate handling). These extensions are defined in the Unicode Collation Algorithm.

Collation Versioning

CockroachDB updates with new versions of the Unicode standard each year, and does not currently provide a mechanism to specify the version of Unicode in use. While changes to collations are rare, they are possible, especially in languages that use large numbers of characters such as Chinese. It is possible for a collation change to invalidate existing collated string data. For this reason we recommend storing data in columns with an uncollated string type, then using a computed column for the desired collation. This way, in the event a collation change has undesired effects, the computed column can be dropped and recreated.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell and @ericharmeling)

v19.2/collate.md, line 28 at r1 (raw file):

CockroachDB supports the collation locales provided by Go's [language package](https://godoc.org/golang.org/x/text/language#Tag), using [BCP47-based](https://tools.ietf.org/html/bcp47) identifiers and extensions.

For example, the identifier `es-419` specifies Latin American Spanish, and `en_US_u_ks_level1` specifies case-insensitive American English.

es-419 doesn't show up in pg_collation (although it appears to work). I think this is because latin american spanish uses the same collation rules as spain spanish. fr-ca might be a better example since canadian french does have different rules than in france.

Down below you use ks_level2 for case-insensitive instead of ks_level1. I think level2 is the right answer here (level1 is both case- and accent-insensitive)

ericharmeling

TFTR @RaduBerinde !

And thank you for the thorough review and feedback @bdarnell !

Following your feedback, I've updated the 19.2 files for a second review (will propagate to other versions when approved). I basically just copyedited what you wrote above, and then updated the example following your advice. Fortunately, there are no other instances of local extensions in the docs, so we don't need to replace any other examples with double quotes and hyphens.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell)

cockroach-teamcity · 2020-10-06T20:54:42Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/8d4ae3f60be0f389e16a1160049ca10db4444430/

Edited pages:

bdarnell

Reviewed 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell)

cockroach-teamcity · 2020-10-08T21:47:45Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/17c6a7f829f184870cb78195dd268541b557f504/

Edited pages:

lnhsingh

LGTM! One comment on a format suggestion

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell, @ericharmeling, and @lnhsingh)

v19.2/collate.md, line 52 at r3 (raw file):

CockroachDB supports standard aliases for the collations listed in `pg_collation`. For example, `es-419` (Latin American Spanish) and `zh-Hans` (Simplified Chinese) are supported, but they do not appear in the `pg_collations` table because they are equivalent to the `es` and `zh` collations listed in the table.

CockroachDB also supports the following Unicode locale extensions: `co` (collation type), `ks` (strength), `kc` (case level), `kb` (backwards second level weight), `kn` (numeric), `ks` (strength), and `ka` (alternate handling). To use a locale extension, append `-u-` to the base locale name, followed by the extension. For example, `en-US-u-ks-level2` is case-insensitive US English. The `ks` modifier changes the "strength" of the collation, causing it to treat certain classes of characters as equivalent (PostgreSQL calls these "non-deterministic collations"). Setting the `ks` to `level2` makes the collation case-insensitive (for languages that have this concept).

Format suggestion: Make the Unicode locale extensions a bulleted list for better readability (this applies to the other versions too)

v19.2/collate.md, line 124 at r3 (raw file):

{% include copy-clipboard.html %}
~~~ sql
> INSERT INTO nocase_strings VALUES ('Hello, friend.' COLLATE "en-US-u-ks-level2"), ('Hi. My name is Petee.' COLLATE "en-US-u-ks-level2");

I like this example lol 🐶

ericharmeling

TFTR, @lnhsingh !

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell and @lnhsingh)

v19.2/collate.md, line 52 at r3 (raw file):

Previously, lnhsingh (Lauren Hirata Singh) wrote…

Format suggestion: Make the Unicode locale extensions a bulleted list for better readability (this applies to the other versions too)

Done.

v19.2/collate.md, line 124 at r3 (raw file):

Previously, lnhsingh (Lauren Hirata Singh) wrote…

I like this example lol 🐶

:)

cockroach-teamcity · 2020-10-12T16:27:56Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/c9523285548bf925b393b5c283d4e9c654e5708d/

Edited pages:

ericharmeling requested a review from bdarnell October 1, 2020 18:42

ericharmeling mentioned this pull request Oct 1, 2020

Document special collations #1471

Closed

RaduBerinde approved these changes Oct 1, 2020

View reviewed changes

bdarnell reviewed Oct 1, 2020

View reviewed changes

ericharmeling commented Oct 6, 2020

View reviewed changes

bdarnell approved these changes Oct 8, 2020

View reviewed changes

ericharmeling force-pushed the special-collations branch from 8d4ae3f to 17c6a7f Compare October 8, 2020 21:44

ericharmeling requested a review from lnhsingh October 8, 2020 21:44

lnhsingh reviewed Oct 9, 2020

View reviewed changes

Collation support extensions and versioning documented

c952328

ericharmeling force-pushed the special-collations branch from 17c6a7f to c952328 Compare October 12, 2020 16:24

ericharmeling commented Oct 12, 2020

View reviewed changes

ericharmeling merged commit 0351cb9 into master Oct 12, 2020

ericharmeling deleted the special-collations branch October 12, 2020 16:29

ericharmeling mentioned this pull request Oct 25, 2021

deps: update golang.org/x/text to 0.3.6 #11317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special collations #8523

Special collations #8523

ericharmeling commented Oct 1, 2020

cockroach-teamcity commented Oct 1, 2020

cockroach-teamcity commented Oct 1, 2020

RaduBerinde left a comment

bdarnell left a comment

ericharmeling left a comment

cockroach-teamcity commented Oct 6, 2020

bdarnell left a comment

cockroach-teamcity commented Oct 8, 2020

lnhsingh left a comment

ericharmeling left a comment

cockroach-teamcity commented Oct 12, 2020

Special collations #8523

Special collations #8523

Conversation

ericharmeling commented Oct 1, 2020

cockroach-teamcity commented Oct 1, 2020

cockroach-teamcity commented Oct 1, 2020

RaduBerinde left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

Supported Collations

Collation Versioning

ericharmeling left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Oct 6, 2020

bdarnell left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Oct 8, 2020

lnhsingh left a comment

Choose a reason for hiding this comment

ericharmeling left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Oct 12, 2020