-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API] Non-English (unicode, extended ASCII, emoji) Characters Cannot Be Slug-ified in Tags #71
Comments
Adding in these links in case they're helpful. These are some references I've come across in researching this: |
Hi @BethanyG, First of all, thanks for documenting this issue so well! I went through the discussions and it seems the issue @billglover found in #69 (comment) is caused by the code here. The Here's a snippet to demonstrate the problem.
So the solution to our problem here is to install unidecode as a dependency. After that taggit will handle unicode tags just fine! Licensing is not an issue for us. Unidecode is GPL2+ and our code is GPL3 (https://github.com/codebuddies/backend/blob/master/cbv3_django_prototype/COPYING). So it's perfectly compatible. |
@sunu - Thank you so much for your careful reading and sleuthing! I am very happy to hear that we may have a straightforward solution to an ugly problem (and I am really happy we're GPL3!!). I am concerned that unidecode transliterates as opposed to preserving the unicode characters (the further from ascii you get, the less sense any sort of transliteration makes), so I went digging some more and found awesome slugify. Wonder if we could use that? ..or alter/override the What are other's thoughts? @lpatmo , @billglover - do we want to transliterate non-ascii characters into ascii, or preserve the unicode as written? |
I'm not a fan of the transliteration although it does seem to work. Is there any reason we need to use slugs and can't just return the slug guid for use in URLs? This feels like it would deliver all the functionality we'd need while saving it as a future enhancement to work on a common solution for tag to slug generation. |
I don't think we'd be using tag slugs as a piece of a URL (unless I'm missing a scenario), so I don't think we'd have concerns about Unicode characters there. But if we do have URL concerns and want to return a guid, we'd need to generate one. Taggit only assigns simple numeric key by default, and uses a slug for things like uniquification. The guid in earlier mock ups was something I was assigning in the DB when we were saving tags as JSON. I'll take a look at what the customization options are, I think we can add a field that's a guid if we want to go with that for now. |
It doesn't feel like we have a use-case for slugs outside the model (yet) so we may be able to avoid this paint. If we are going to use |
@BethanyG Just wanted to point out that the unicode text is preserved - not in Tag.slug but in Tag.name. The most common scenario where we want tags in the URLs is where we browse resources, hangouts etc by a tag. For example, lets say I want to see what hangouts and resources are there under the tag Django. I don't think we need an additional GUID field though. We can just use the slug with an unique constraint as the unique identifier in URLs. Personally I would prefer to get rid of slugs all together and just use the text. Unicode is supported in IRIs and web browsers handle them well (https://www.w3.org/International/articles/idn-and-iri/#iriworks) and I assume Django can as well. Better yet, consider the technique StackOverflow uses for its questions. The URL looks like |
@sunu - I understand that the unicode text is preserved in the tag name...but the slug is currently the field the lookup happens on, and is also the field used to determine the uniqueness of a tag. And I don't think transliteration is a good strategy for making the slug. I am right there with you - I don't have a good use for the slug field as it currently behaves, and the text field works for all the purposes I can think of. So I think we have a few options:
Any of these have implications across endpoints. What we decide for I really like your example of the way SO does urls - where there is an identifier (probably in our case a guid) followed by the name...and that would totally make sense with the tag-browsing scenario. Any options I'm leaving out (besides not using taggit)? I may play around with scenarios 2 & 3 in a branch and see how involved they are. |
I am playing with various additions/solutions in a branch -- but other approaches welcomed and encouraged!! @sunu -- please feel free to add yourself as an assignee, and play with some options (if you want to do that). I branched off the current code in @billglover & @lpatmo -- I am going to implement a |
Would be good to solve this, sounds like you have some sensible ideas for how to approach it. I've been biting my tongue on suggesting an array of strings 😄. I'm keen to focus on settling on the client facing contract (the JSON spec for requests/responses). This is harder to change as people build clients for the API. I am less concerned with how we implement this under the covers as we have some flexibility to change this as we learn more. |
Ok. Guid Added to representation. @billglover @lpatmo how does this look? |
Looks great!! 👏 |
This is looking really good! |
During the review of PR #69, @billglover uncovered a bug with
tags
that include Non-English characters (Unicode, extended ASCII, emoji): #69 (comment)In addition to not respecting the characters, the resulting strings fed to the serializer screw up the ordering and return of the
tag:slug
pairs, and are saved into the DB in garbled form.On the surface, this appeared to be a relatively straightforward problem: have Django or taggit respect the non-english characters. It is, however a rather complex situation.
Original PR filed for taggit
Second PR filed for taggit
linked issue (still open)
linked proposal for dealing with the issue (still open)
additional open issue blocking the above
Since we cannot rely on
django-taggit
to come up with an "easy" solution to this (nor do we have the forms and front end code complications that team does), we'll have to find a way to apprehend the strings and translate/transliterate or otherwise format them in a way that they can be slugified appropriately.This may mean using an alternative
slugify
process, or altering/patching thetaggit
code. Here is a link to thetaggit
code that saves tags to the DB, as a place to start.Not super-clear on the path forward. First thoughts are
Apprehend the strings ahead of
taggit
, and translate or transliterate them (not optimal unless we can re-translate them for appropriate display)Alter
taggit
code to slugify the strings in a way that can be stored in the DB and retrieved in their original unicode form, since we do not have the restriction of making them nice for URLs, etc.The text was updated successfully, but these errors were encountered: