-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert all tables to use UTF8MB4 charset #9767
Conversation
387325f
to
1d3bfa7
Compare
1d3bfa7
to
9ae166d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please drop commits 2 and 3 though, not much value in committing those.
# Disable foreign key constraints because foreign keys using strings will prevent conversion due to a charset mismatch. | ||
connection.execute "SET foreign_key_checks = 0;" | ||
connection.tables.each do |table| | ||
next if table == "schema_migrations" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth a quick inline comment on why this is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, this is a data migration, so this isn't necessary any longer. I think I'm right in saying Active Record migrations lock the migrations table (or used to? don't know if they still do actually) while the migration is running to stop multiple migrations running simultaneously. This would prevent the conversion from running for the table. Data migrations don't appear to do that, so we no longer need that line.
d64ef8b
to
93b7648
Compare
93b7648
to
85b56cf
Compare
We have seen an increasing number of errors recently relating to publishers attempting to use 4 byte characters in Whitehall. This commit adds a data migration which will convert all of the tables to use the UTF8MB4 charset. This uses a data migration because it is not subject to the 15 minute limit for schema migrations applied in the kubernetes configuration. The migration will need to be run out of hours as the tables will need to be locked during conversion.
85b56cf
to
6bab3a5
Compare
This PR: - No longer sets `slug` to `nil` when the locale is non-English. Every locale is considered "sluggable". - This also means we no longer need the `sluggable_locale?`, `slug_eligible?` methods and ASCII checks. (The latter has caused a number of Zendesk tickets, whereby even English HTML Attachments have been given a content-ID based URL, because they contain a non-ASCII character like a fancy single quote). - `sluggable_string` doesn't appear to have been in use, so it was deleted. (`sluggable_string` is used in the Document class: `friendly_id :sluggable_string`. But the Attachment class uses `friendly_id :title`, so overriding the `sluggable_string` method had no effect). - Adds a test for friendly_id's native "uniqueness" feature (which we've opted into via `use: :scoped, scope: :attachable` on the Attachment class), so in theory we should be protected from clashes in the attachment slug namespace. Therefore the issue alluded to in #5321 (which reverts the last time this was attempted) should not happen again. - Adds a basic safeguard that the slug-generation results in a slug that contains at least 4 a-z or A-Z characters. This is to protect against generating a slug like `-`, which could happen from sluggifying a title like `{chinese}-{chinese}`. - That said, if we somehow DO get into a state whereby an attachment has what we'd call an invalid slug (<4 azAZ chars), we don't want to fix it if the attachment is already live. So we check for `safely_resluggable?`, introduced in #5942. Result: - Most HTML attachments should now have human-readable slugs. Only languages that are non-ASCII and cannot be converted to ASCII, such as Chinese, will continue to fall back to the content-ID based URLs. Next steps: - Now that all Whitehall tables use the UTF8MB4 charset (#9767), we could potentially use CJK character sets natively in slugs - but it may not always show nicely in the browser. Compare `https://example.com/你好` with `https://example.com/%E4%BD%A0%E5%A5%BD`. - The option we'd likely want to go with is conversion to their Pinyin representations, e.g. `https://example.com/ni-hao`, probably using the `pinyin` or `ruby-pinyin` gem. Trello: https://trello.com/c/YGsEG9YV/3418-improve-html-attachment-slug-creation
This PR: - No longer sets `slug` to `nil` when the locale is non-English. Every locale is considered "sluggable". - This also means we no longer need the `sluggable_locale?`, `slug_eligible?` methods and ASCII checks. (The latter has caused a number of Zendesk tickets, whereby even English HTML Attachments have been given a content-ID based URL, because they contain a non-ASCII character like a fancy single quote). - `sluggable_string` doesn't appear to have been in use, so it was deleted. (`sluggable_string` is used in the Document class: `friendly_id :sluggable_string`. But the Attachment class uses `friendly_id :title`, so overriding the `sluggable_string` method had no effect). - Adds a test for friendly_id's native "uniqueness" feature (which we've opted into via `use: :scoped, scope: :attachable` on the Attachment class), so in theory we should be protected from clashes in the attachment slug namespace. Therefore the issue alluded to in #5321 (which reverts the last time this was attempted) should not happen again. - Adds a basic safeguard that the slug-generation results in a slug that contains at least 4 a-z or A-Z characters. This is to protect against generating a slug like `-`, which could happen from sluggifying a title like `{chinese}-{chinese}`. - That said, if we somehow DO get into a state whereby an attachment has what we'd call an invalid slug (<4 azAZ chars), we don't want to fix it if the attachment is already live. So we check for `safely_resluggable?`, introduced in #5942. Result: - Most HTML attachments should now have human-readable slugs. Only languages that are non-ASCII and cannot be converted to ASCII, such as Chinese, will continue to fall back to the content-ID based URLs. Next steps: - Now that all Whitehall tables use the UTF8MB4 charset (#9767), we could potentially use CJK character sets natively in slugs - but it may not always show nicely in the browser. Compare `https://example.com/你好` with `https://example.com/%E4%BD%A0%E5%A5%BD`. - The option we'd likely want to go with is conversion to their Pinyin representations, e.g. `https://example.com/ni-hao`, probably using the `pinyin` or `ruby-pinyin` gem. Trello: https://trello.com/c/YGsEG9YV/3418-improve-html-attachment-slug-creation
This PR: - No longer sets `slug` to `nil` when the locale is non-English. Every locale is considered "sluggable". - This also means we no longer need the `sluggable_locale?`, `slug_eligible?` methods and ASCII checks. (The latter has caused a number of Zendesk tickets, whereby even English HTML Attachments have been given a content-ID based URL, because they contain a non-ASCII character like a fancy single quote). - `sluggable_string` doesn't appear to have been in use, so it was deleted. (`sluggable_string` is used in the Document class: `friendly_id :sluggable_string`. But the Attachment class uses `friendly_id :title`, so overriding the `sluggable_string` method had no effect). - Adds a test for friendly_id's native "uniqueness" feature (which we've opted into via `use: :scoped, scope: :attachable` on the Attachment class), so in theory we should be protected from clashes in the attachment slug namespace. Therefore the issue alluded to in #5321 (which reverts the last time this was attempted) should not happen again. - Adds a basic safeguard that the slug-generation results in a slug that contains at least 4 a-z or A-Z characters. This is to protect against generating a slug like `-`, which could happen from sluggifying a title like `{chinese}-{chinese}`. - That said, if we somehow DO get into a state whereby an attachment has what we'd call an invalid slug (<4 azAZ chars), we don't want to fix it if the attachment is already live. So we check for `safely_resluggable?`, introduced in #5942. Result: - Most HTML attachments should now have human-readable slugs. Only languages that are non-ASCII and cannot be converted to ASCII, such as Chinese, will continue to fall back to the content-ID based URLs. Next steps: - Now that all Whitehall tables use the UTF8MB4 charset (#9767), we could potentially use CJK character sets natively in slugs - but it may not always show nicely in the browser. Compare `https://example.com/你好` with `https://example.com/%E4%BD%A0%E5%A5%BD`. - The option we'd likely want to go with is conversion to their Pinyin representations, e.g. `https://example.com/ni-hao`, probably using the `pinyin` or `ruby-pinyin` gem. Trello: https://trello.com/c/YGsEG9YV/3418-improve-html-attachment-slug-creation
We have seen an increasing number of errors recently relating to publishers attempting to use 4 byte characters in Whitehall. This commit adds a data migration which will convert all of the tables to use the UTF8MB4 charset. This uses a data migration because it is not subject to the 15 minute limit for schema migrations applied in the kubernetes configuration.
The migration will need to be run out of hours as the tables will need to be locked during conversion.
Trello: https://trello.com/c/aTlLraLX