Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alphabetical order issues in sorting #171

Closed
cessda-bitbucket-importer opened this issue Apr 30, 2020 · 12 comments
Closed

Alphabetical order issues in sorting #171

cessda-bitbucket-importer opened this issue Apr 30, 2020 · 12 comments
Assignees
Labels
Milestone

Comments

@cessda-bitbucket-importer

Original report on BitBucket by Taina Jääskeläinen.


Alphabetical ordering by titles: some issues.

  1. The labels are the wrong way round, I think (A-Z is actually Z-A).
  2. Goes haywire in the A-Z if there are ' , “, ( or * or a 4-letter numerical (year) in front of the title. Can you teach the system to ignore these easily? I will anyway at some point make issues for these for SPs, except for years which are allowed.
  3. What about ‘A' and ‘The’, does English sort by them of by the first ‘real’ word in the title? If the latter, the system should ignore the A and The.

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


See also #154

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


I’ll fix the label issues (via the linked issue), but cannot fix the behavioural ones, which will have to wait for the next maintenance phase.

@cessda-bitbucket-importer
Copy link
Author

Original comment by Taina Jääskeläinen.


I have made an issue to the service providers, if they have titles beginning with brackets or with single or double quotation marks.

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


The ElasticSearch config file for each language points to the default stopword list for that language (where available):

czech, danish, german, greek, english, finnish, french, hungarian, italian, dutch, norwegian, portuguese, swedish.

Elasticsearch provides the following predefined list of stopword languages:

_arabic__armenian__basque__brazilian__bulgarian__catalan__czech__danish__dutch__english__finnish__french__galician__german__greek__hindi__hungarian__indonesian__irish__italian__latvian__norwegian__persian__portuguese__romanian__russian__sorani__spanish__swedish__thai__turkish_.

So, no stopword lists are available for estonian, slovakian and slovenian

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


1 - fixed via #154

2 - TODO (see also https://github.com/cessda/cessda.metadata.office/issues/55 and https://github.com/cessda/cessda.metadata.office/issues/56)

3 - fixed via #204

@cessda-bitbucket-importer
Copy link
Author

Original comment by Taina Jääskeläinen.


Adding a sub-issue number 4:

Looking at Z-A sorting, it seems that if the title starts with a small letter and not a capital letter, the sorting goes haywire. Teach system to treat small and capital letters alike?

Sometimes there is a need to have the title to start with a small letter, for instance elderLUCID: London UCL Older adults' clear speech in interaction database. Here elderLUCID is the database name.

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


@matthew-morris-cessda Are you able to fix this? if so, please self-assign.

@cessda-bitbucket-importer
Copy link
Author

Original comment by Taina Jääskeläinen.


https://github.com/cessda/cessda.metadata.office/issues/56 is fixed and closed.

@cessda-bitbucket-importer
Copy link
Author

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


I’ve discovered the root cause:

Letters are represented by numbers by computers, for example the letter G is represented by the number 71.

This issue is caused by lowercase letters are represented with larger numbers (i.e. g is represented by the number 103). Elasticsearch sorts by these numbers by default.

This has been fixed as of cessda/cessda.cdc.osmh-indexer.cmm@8940f35 but a reindex is required in order for the fix to take effect.

@cessda-bitbucket-importer
Copy link
Author

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


[link to pull request removed](link to pull request removed)

@cessda-bitbucket-importer
Copy link
Author

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


171 - Resolved.PNG

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


Checked using Swedish alphabet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants