Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fixes #9799] Thesaurus selectbox performance in metadata editor is very slow #9800

Merged
merged 16 commits into from
Aug 9, 2022
Merged

[Fixes #9799] Thesaurus selectbox performance in metadata editor is very slow #9800

merged 16 commits into from
Aug 9, 2022

Conversation

mwallschlaeger
Copy link
Member

@mwallschlaeger mwallschlaeger commented Aug 2, 2022

References: #9799

As described in the issue. The requests to the database are very slow. This is basically related to a for loop used in the current implementation. To evaluate my changes i wrote a small benchmark to show the performance improvements done by this commit.

import argparse
import time

import django
django.setup()

from geonode.base.models import ThesaurusKeyword, ThesaurusKeywordLabel

parser = argparse.ArgumentParser(description='Test thesaurus performance in GeoNode')
parser.add_argument('-tid', dest='thesaurus_id', type=int, required=True, help="pass the thesaurus id from the Thesaurus Table to select catalog to run on")
parser.add_argument('-lang', dest='language', default="en", type=str, help="pass language the results are searched for")
args = parser.parse_args()

start_time = time.time()
print("starting run with current GeoNode implementation ...")
qs_local = []
qs_non_local = [("", "------")]
for key in ThesaurusKeyword.objects.filter(thesaurus_id=args.thesaurus_id):
    label = ThesaurusKeywordLabel.objects.filter(keyword=key).filter(lang=args.language)
    if label.exists():
        qs_local.append((label.get().keyword.id, label.get().label))
    else:
        qs_non_local.append((key.id, key.alt_label))

qs = qs_local + qs_non_local
print("run finished in: {} and found {} objects ...".format(round(time.time() - start_time, 5), len(qs)))

print()

start_time = time.time()
print("starting run with suggested improvement ...")

keyword_id_for_given_thesaurus = ThesaurusKeyword.objects.filter(thesaurus_id=args.thesaurus_id)  
qs_keyword_ids = ThesaurusKeywordLabel.objects.filter(lang=args.language, keyword_id__in=keyword_id_for_given_thesaurus).values("keyword_id")
not_qs_ids = ThesaurusKeywordLabel.objects.exclude(keyword_id__in=qs_keyword_ids).order_by("keyword_id").distinct("keyword_id").values("keyword_id")

qs_local = list(ThesaurusKeywordLabel.objects.filter(lang=args.language, keyword_id__in=keyword_id_for_given_thesaurus).values_list("keyword_id", "label"))
qs_non_local = list(keyword_id_for_given_thesaurus.filter(id__in=not_qs_ids).values_list("id", "alt_label"))

qs=qs_local + [("", "-------")] + qs_non_local
print("run finished in: {} and found {} objects ...".format(round(time.time() - start_time, 5), len(qs)))

When i run the script with the agrovoc loaded into the database (https://github.com/zalf-rdm/geonode-agrovoc-importer) on my Desktop PC i get the following results:

❯ python debug_test.py -tid 13
starting run with current GeoNode implementation ...
run finished in: 112.2935 and found 40251 objects ...

starting run with suggested improvement ...
run finished in: 0.26096 and found 40251 objects ...

So the new implementation is something like 400 times faster than the current implementation.

In my opinion the way of handling alt_labels so labels which are not available in the current language could also be improved.

Checklist

Reviewing is a process done by project maintainers, mostly on a volunteer basis. We try to keep the overhead as small as possible and appreciate if you help us to do so by completing the following items. Feel free to ask in a comment if you have troubles with any of them.

For all pull requests:

  • Confirm you have read the contribution guidelines
  • You have sent a Contribution Licence Agreement (CLA) as necessary (not required for small changes, e.g., fixing typos in the documentation)
  • Make sure the first PR targets the master branch, eventual backports will be managed later. This can be ignored if the PR is fixing an issue that only happens in a specific branch, but not in newer ones.

The following are required only for core and extension modules (they are welcomed, but not required, for contrib modules):

  • There is a ticket in https://github.com/GeoNode/geonode/issues describing the issue/improvement/feature (a notable exemption is, changes not visible to end-users)
  • The issue connected to the PR must have Labels and Milestone assigned
  • PR for bug fixes and small new features are presented as a single commit
  • Commit message must be in the form "[Fixes #<issue_number>] Title of the Issue"
  • New unit tests have been added covering the changes, unless there is an explanation on why the tests are not necessary/implemented
  • This PR passes all existing unit tests (test results will be reported by travis-ci after opening this PR)
  • This PR passes the QA checks: flake8 geonode
  • Commits changing the settings, UI, existing user workflows, or adding new functionality, need to include documentation updates
  • Commits adding new texts do use gettext and have updated .po / .mo files (without location infos)

Submitting the PR does not require you to check all items, but by the time it gets merged, they should be either satisfied or inapplicable.

@cla-bot cla-bot bot added the cla-signed CLA Bot: community license agreement signed label Aug 2, 2022
@afabiani afabiani requested a review from mattiagiupponi August 2, 2022 13:42
@afabiani afabiani requested a review from etj August 2, 2022 13:42
@afabiani afabiani added this to the 4.0.0 milestone Aug 2, 2022
Copy link
Contributor

@mattiagiupponi mattiagiupponi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @mwallschlaeger I see that no tests are added, the old ones are still valid? Or do they need some improvement?
(I'm asking since the build was failing :) )

@codecov
Copy link

codecov bot commented Aug 3, 2022

Codecov Report

Merging #9800 (6151ceb) into master (afc9194) will increase coverage by 0.01%.
The diff coverage is 72.91%.

@@            Coverage Diff             @@
##           master    #9800      +/-   ##
==========================================
+ Coverage   61.33%   61.35%   +0.01%     
==========================================
  Files         822      822              
  Lines       50277    50298      +21     
  Branches     7748     7745       -3     
==========================================
+ Hits        30837    30858      +21     
  Misses      17760    17760              
  Partials     1680     1680              

@mwallschlaeger
Copy link
Member Author

OOOK. The main problem with the languages is that in django shell in django.utils.translation.get_language() returns "en" but if django.utils.translation.get_language() got executed via the browser you get a language including the country like "en-us". Therefore, i added a function which removes the country code. Sadly there is no function in the django translation module which reliable returns the language without country code. So the search functions in geonode.base.view and geonode.base.form now try to first find entries with language code including the country, because for some languages this makes sense. If no entries are found for this, we change the language to only include the language ("en"). And if this results still contain nothing geonode shows the alt_labels.

I also added some tests which set the language with and without country code and try to get results from the _get_thesauro_keyword_label function in ThesaurusAvailableForm.

In current implementation geonode always returns alt_labels because its not able to handle the language_code including country code ...

Copy link
Contributor

@mattiagiupponi mattiagiupponi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks ok, there is just an issue with the flake8 formatting. Can you please fix it so the CircleCI test will run?

@mattiagiupponi mattiagiupponi modified the milestones: 4.0.0, 4.0.1 Aug 8, 2022
@mattiagiupponi mattiagiupponi merged commit baa3268 into GeoNode:master Aug 9, 2022
mattiagiupponi added a commit that referenced this pull request Aug 9, 2022
…ery slow (#9800)

* [Fixes #9799] Thesaurus selectbox performance in metadata editor is very slow

Co-authored-by: mattiagiupponi <[email protected]>
mattiagiupponi added a commit that referenced this pull request Aug 9, 2022
…ery slow (#9800) (#9836)

* [Fixes #9799] Thesaurus selectbox performance in metadata editor is very slow

Co-authored-by: mattiagiupponi <[email protected]>

Co-authored-by: Marcel Wallschläger <[email protected]>
@mwallschlaeger mwallschlaeger deleted the issue_9799 branch August 9, 2022 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed CLA Bot: community license agreement signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants