Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API] Non-English (unicode, extended ASCII, emoji) Characters Cannot Be Slug-ified in Tags #71

Closed
BethanyG opened this issue Feb 19, 2020 · 14 comments · Fixed by #96
Closed
Assignees
Labels
bug Something isn't working

Comments

@BethanyG
Copy link
Member

BethanyG commented Feb 19, 2020

During the review of PR #69, @billglover uncovered a bug with tags that include Non-English characters (Unicode, extended ASCII, emoji): #69 (comment)

In addition to not respecting the characters, the resulting strings fed to the serializer screw up the ordering and return of the tag:slug pairs, and are saved into the DB in garbled form.

On the surface, this appeared to be a relatively straightforward problem: have Django or taggit respect the non-english characters. It is, however a rather complex situation.

Original PR filed for taggit
Second PR filed for taggit
linked issue (still open)
linked proposal for dealing with the issue (still open)
additional open issue blocking the above


Since we cannot rely on django-taggit to come up with an "easy" solution to this (nor do we have the forms and front end code complications that team does), we'll have to find a way to apprehend the strings and translate/transliterate or otherwise format them in a way that they can be slugified appropriately.

This may mean using an alternative slugify process, or altering/patching the taggit code. Here is a link to the taggit code that saves tags to the DB, as a place to start.

Not super-clear on the path forward. First thoughts are

  1. Apprehend the strings ahead of taggit, and translate or transliterate them (not optimal unless we can re-translate them for appropriate display)

  2. Alter taggit code to slugify the strings in a way that can be stored in the DB and retrieved in their original unicode form, since we do not have the restriction of making them nice for URLs, etc.

@BethanyG BethanyG added the bug Something isn't working label Feb 19, 2020
@BethanyG
Copy link
Member Author

Adding in these links in case they're helpful. These are some references I've come across in researching this:

@sunu
Copy link

sunu commented Feb 20, 2020

Hi @BethanyG,

First of all, thanks for documenting this issue so well! I went through the discussions and it seems the issue @billglover found in #69 (comment) is caused by the code here.

The taggit library is using a combination of unidecode and django.utils.text.slugify to generate slugs. But if unidecode is not installed, taggit uses a no-op function in its place (https://github.com/jazzband/django-taggit/blob/954d8360102fb2102db246bde1336a7e58ff10f2/taggit/models.py#L11). But then the problem is django.utils.text.slugify can't handle unicode text on its own.

Here's a snippet to demonstrate the problem.

In [1]: text = '这是中文'                                                                               

In [2]: from django.utils.text import slugify                                                           

In [3]: slugify(text)                                                                                   
Out[3]: ''

In [4]: from unidecode import unidecode                                                                 

In [5]: slugify(unidecode(text))                                                                        
Out[5]: 'zhe-shi-zhong-wen'

So the solution to our problem here is to install unidecode as a dependency. After that taggit will handle unicode tags just fine!

Licensing is not an issue for us. Unidecode is GPL2+ and our code is GPL3 (https://github.com/codebuddies/backend/blob/master/cbv3_django_prototype/COPYING). So it's perfectly compatible.

@BethanyG
Copy link
Member Author

BethanyG commented Feb 20, 2020

@sunu - Thank you so much for your careful reading and sleuthing! I am very happy to hear that we may have a straightforward solution to an ugly problem (and I am really happy we're GPL3!!).

I am concerned that unidecode transliterates as opposed to preserving the unicode characters (the further from ascii you get, the less sense any sort of transliteration makes), so I went digging some more and found awesome slugify. Wonder if we could use that?

..or alter/override the taggit code to simply call Djangos slugify with the unicode tag (e.g.slugify("ванна", allow_unicode=True))

What are other's thoughts? @lpatmo , @billglover - do we want to transliterate non-ascii characters into ascii, or preserve the unicode as written?

@billglover
Copy link
Contributor

billglover commented Feb 21, 2020

I'm not a fan of the transliteration although it does seem to work. Is there any reason we need to use slugs and can't just return the slug guid for use in URLs? This feels like it would deliver all the functionality we'd need while saving it as a future enhancement to work on a common solution for tag to slug generation.

@BethanyG
Copy link
Member Author

I don't think we'd be using tag slugs as a piece of a URL (unless I'm missing a scenario), so I don't think we'd have concerns about Unicode characters there.

But if we do have URL concerns and want to return a guid, we'd need to generate one. Taggit only assigns simple numeric key by default, and uses a slug for things like uniquification. The guid in earlier mock ups was something I was assigning in the DB when we were saving tags as JSON.

I'll take a look at what the customization options are, I think we can add a field that's a guid if we want to go with that for now.

@billglover
Copy link
Contributor

It doesn't feel like we have a use-case for slugs outside the model (yet) so we may be able to avoid this paint.

If we are going to use guid as a unique identifier for resources, and not the auto incrementing ID then I think we should try to be consisted across other types (e.g. tags, users, hangouts, etc.)

@sunu
Copy link

sunu commented Feb 21, 2020

@BethanyG Just wanted to point out that the unicode text is preserved - not in Tag.slug but in Tag.name.

The most common scenario where we want tags in the URLs is where we browse resources, hangouts etc by a tag. For example, lets say I want to see what hangouts and resources are there under the tag Django. I don't think we need an additional GUID field though. We can just use the slug with an unique constraint as the unique identifier in URLs.

Personally I would prefer to get rid of slugs all together and just use the text. Unicode is supported in IRIs and web browsers handle them well (https://www.w3.org/International/articles/idn-and-iri/#iriworks) and I assume Django can as well. Better yet, consider the technique StackOverflow uses for its questions. The URL looks like codebuddies.org/tags/234/python%20programming. The numeric tag id is used as the identifier and the text is just to make the URL descriptive.

@BethanyG
Copy link
Member Author

@sunu - I understand that the unicode text is preserved in the tag name...but the slug is currently the field the lookup happens on, and is also the field used to determine the uniqueness of a tag. And I don't think transliteration is a good strategy for making the slug.

I am right there with you - I don't have a good use for the slug field as it currently behaves, and the text field works for all the purposes I can think of.

So I think we have a few options:

  1. Use Unidecode, live with its transliteration.

  2. Customize taggit via a through model, and re-define the slug to use the unicode slugifier of our choice...or write one.

  3. Customize taggit via a through model and add a guid for tags that could be used in place of a slug, or as a supplement to a slug .

  4. Do something analogous to this Russian Slugs for taggit solution - but wow, another generic key...not sure we want that bit of business in the mix.

Any of these have implications across endpoints. What we decide for resources, we'll probably want to stick to for everything else, so we should think about it a little bit. ;-)

I really like your example of the way SO does urls - where there is an identifier (probably in our case a guid) followed by the name...and that would totally make sense with the tag-browsing scenario.

Any options I'm leaving out (besides not using taggit)? I may play around with scenarios 2 & 3 in a branch and see how involved they are.

@BethanyG BethanyG added in progress needs discussion The fix for this issue needs discussion labels Feb 25, 2020
@BethanyG
Copy link
Member Author

BethanyG commented Feb 25, 2020

I am playing with various additions/solutions in a branch -- but other approaches welcomed and encouraged!!

@sunu -- please feel free to add yourself as an assignee, and play with some options (if you want to do that). I branched off the current code in master, so that would be the place to start from if you'd like to play.

@billglover & @lpatmo -- I am going to implement a through model, attempt to use a unicode slugifier, and also add a guid for each tag to the mix. Please chime in if you have thoughts/concerns, or something of higher priority.

@BethanyG BethanyG self-assigned this Feb 25, 2020
@billglover
Copy link
Contributor

billglover commented Feb 25, 2020

Would be good to solve this, sounds like you have some sensible ideas for how to approach it. I've been biting my tongue on suggesting an array of strings 😄.

I'm keen to focus on settling on the client facing contract (the JSON spec for requests/responses). This is harder to change as people build clients for the API. I am less concerned with how we implement this under the covers as we have some flexibility to change this as we learn more.

@BethanyG
Copy link
Member Author

...not yet all the way there, but I think this might be an encouraging start:

image

...now, if I could just get the guid to show up for my serializer, we'd be golden. Until, of course, we find the next set of bugs! 😄

@BethanyG
Copy link
Member Author

Ok. Guid Added to representation. @billglover @lpatmo how does this look?

image

@lpatmo
Copy link
Member

lpatmo commented Feb 27, 2020

Looks great!! 👏

@billglover
Copy link
Contributor

This is looking really good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants