Proof of concept - Do not merge - Tests to decide labels field model #3283

jredrejo · 2021-09-16T18:59:09Z

Summary

Description of the change(s) you made

Creates new fields in the ContentNode model to test which one fits better for our new labels
- Four string fields
- Four array fields
- One json field
Create a test to check adding and filtering labels, and its descendants work
Create a test to measure performance when having many nodes in the table

Output when creating 1000 nodes

without indexes

USING ARRAY: Assigning random labels to 1003 nodes took 2.9013161659240723 seconds
USING ARRAY: Finding for resource label in 594 nodes took 0.32822537422180176 seconds
USING ARRAY: Finding for maths and descendants labell in 774 nodes took 0.43134069442749023 seconds
.
USING JSONB: Assigning random labels to 1003 nodes took 2.9802944660186768 seconds
USING JSONB: Finding for resource label in 578 nodes took 0.425093412399292 seconds
USING JSONB: Finding for maths and descendants labell in 788 nodes took 0.4425227642059326 seconds
.
USING STRING: Assigning random labels to 1003 nodes took 3.032484292984009 seconds
USING STRING: Finding for resource label in 598 nodes took 0.3240506649017334 seconds
USING STRING: Finding for maths and descendants labell in 761 nodes took 0.4070718288421631 seconds

with indexes

USING ARRAY: Assigning random labels to 1003 nodes took 3.483994722366333 seconds
USING ARRAY: Finding for resource label in 592 nodes took 0.34262681007385254 seconds
USING ARRAY ICONTAINS: Finding for maths and descendants label in 777 nodes took 0.5528428554534912 seconds
USING ARRAY OVERLAP: Finding for maths and descendants label in 777 nodes took 0.609926700592041 seconds
.
USING JSONB: Assigning random labels to 1003 nodes took 3.262275218963623 seconds
USING JSONB: Finding for resource label in 573 nodes took 0.33663392066955566 seconds
USING JSONB: Finding for maths and descendants label in 783 nodes took 0.458510160446167 seconds
.
USING STRING: Assigning random labels to 1003 nodes took 3.1667089462280273 seconds
USING STRING: Finding for resource label in 593 nodes took 0.45276808738708496 seconds
USING STRING: Finding for maths and descendants label in 781 nodes took 0.6456122398376465 seconds

Output when creating 100,000 nodes

Without indexes

USING ARRAY: Assigning random labels to 100003 nodes took 349.22260642051697 seconds
USING ARRAY: Finding for resource label in 58235 nodes took 36.27563977241516 seconds
USING ARRAY: Finding for maths and descendants labell in 76966 nodes took 49.18357467651367 seconds
.
USING JSONB: Assigning random labels to 100003 nodes took 323.29920983314514 seconds
USING JSONB: Finding for resource label in 58088 nodes took 37.14835238456726 seconds
USING JSONB: Finding for maths and descendants labell in 76971 nodes took 50.23621129989624 seconds
.
USING STRING: Assigning random labels to 100003 nodes took 318.8516821861267 seconds
USING STRING: Finding for resource label in 58325 nodes took 36.119210720062256 seconds
USING STRING: Finding for maths and descendants labell in 77391 nodes took 56.88925266265869 seconds

With indexes

USING ARRAY: Assigning random labels to 100003 nodes took 365.611202955246 seconds
USING ARRAY: Finding for resource label in 58275 nodes took 37.73640298843384 seconds
USING ARRAY ICONTAINS: Finding for maths and descendants label in 77110 nodes took 58.37085294723511 seconds
USING ARRAY OVERLAP: Finding for maths and descendants label in 77110 nodes took 68.87573957443237 seconds
.
USING JSONB: Assigning random labels to 100003 nodes took 383.17922043800354 seconds
USING JSONB: Finding for resource label in 58261 nodes took 42.6976261138916 seconds
USING JSONB: Finding for maths and descendants label in 77010 nodes took 53.09521484375 seconds
.
USING STRING: Assigning random labels to 100003 nodes took 385.1260995864868 seconds
USING STRING: Finding for resource label in 58501 nodes took 37.58138680458069 seconds
USING STRING: Finding for maths and descendants label in 77097 nodes took 52.43117928504944 seconds

Output when creating 10,000 nodes with indexes

USING ARRAY: Assigning random labels to 10003 nodes took 33.20896887779236 seconds
USING ARRAY: Finding for resource label in 5839 nodes took 3.9657087326049805 seconds
USING ARRAY ICONTAINS: Finding for maths and descendants label in 7664 nodes took 4.852238416671753 seconds
USING ARRAY OVERLAP: Finding for maths and descendants label in 7664 nodes took 4.907971382141113 seconds
.
USING JSONB: Assigning random labels to 10003 nodes took 35.136247634887695 seconds
USING JSONB: Finding for resource label in 5887 nodes took 3.6936919689178467 seconds
USING JSONB: Finding for maths and descendants label in 7704 nodes took 5.103723526000977 seconds
.
USING STRING: Assigning random labels to 10003 nodes took 34.664490938186646 seconds
USING STRING: Finding for resource label in 5853 nodes took 3.510707378387451 seconds
USING STRING: Finding for maths and descendants label in 7735 nodes took 6.093660116195679 seconds

Output when nodes are created using TreeBuilder

without indexes

11116 nodes created
USING ARRAY: Assigning random labels to 11116 nodes took 58.65201187133789 seconds
USING ARRAY: Finding for resource label in 6474 nodes took 3.9374704360961914 seconds
USING ARRAY ICONTAINS: Finding for maths and descendants label in 8533 nodes took 5.4344847202301025 seconds
USING ARRAY OVERLAP: Finding for maths and descendants label in 8533 nodes took 5.7019617557525635 seconds
.
USING JSONB: Assigning random labels to 11116 nodes took 54.31049466133118 seconds
USING JSONB: Finding for resource label in 6462 nodes took 4.321487188339233 seconds
USING JSONB: Finding for maths and descendants label in 8576 nodes took 5.750378608703613 seconds
.
USING STRING: Assigning random labels to 11116 nodes took 66.23195552825928 seconds
USING STRING: Finding for resource label in 6501 nodes took 3.68668532371521 seconds
USING STRING: Finding for maths and descendants label in 8537 nodes took 5.358809947967529 seconds

with indexes

11116 nodes created
USING ARRAY: Assigning random labels to 11116 nodes took 64.20341920852661 seconds
USING ARRAY: Finding for resource label in 6421 nodes took 3.920243501663208 seconds
USING ARRAY ICONTAINS: Finding for maths and descendants label in 8581 nodes took 6.055354356765747 seconds
USING ARRAY OVERLAP: Finding for maths and descendants label in 8581 nodes took 7.290720224380493 seconds
.
USING JSONB: Assigning random labels to 11116 nodes took 65.31257438659668 seconds
USING JSONB: Finding for resource label in 6472 nodes took 4.850388526916504 seconds
USING JSONB: Finding for maths and descendants label in 8570 nodes took 6.467301607131958 seconds
.
USING STRING: Assigning random labels to 11116 nodes took 54.25987720489502 seconds
USING STRING: Finding for resource label in 6528 nodes took 4.061512231826782 seconds
USING STRING: Finding for maths and descendants label in 8557 nodes took 7.213321208953857 seconds

Reviewer guidance

These tests were run in a laptop with 16 GB of RAM and i7-8550U CPU @ 1.80GHz
It would be good to know if results in another hardware are too different

To run the massive test:
LABELS_MASSIVE=true pytest -s contentcuration/contentcuration/tests/test_labels.py::LabelsMassiveTestCase
The number of nodes to be created can be changed in line 92 of contentcuration/contentcuration/tests/test_labels.py

Are there any risky areas that deserve extra testing?

After looking at this PR a decission on the model field should be taken

References

#3263

Comments

Notice no indexes were applied in the new fields. After indexes are applied, inserting labels will take longer, but searching for them will take less time.

Contributor's Checklist

PR process:

If this is an important user-facing change, PR or related issue the CHANGELOG label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later time
If this includes an internal dependency change, a link to the diff is provided
The docs label has been added if this introduces a change that needs to be updated in the user docs?
If any Python requirements have changed, the updated requirements.txt files also included in this PR
Opportunities for using Google Analytics here are noted
Migrations are safe for a large db

Studio-specifc:

All user-facing strings are translated properly
The notranslate class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)
All UI components are LTR and RTL compliant
Views are organized into pages, components, and layouts directories as described in the docs
Users' storage used is recalculated properly on any changes to main tree files
If there new ways this uses user data that needs to be factored into our Privacy Policy, it has been noted.

Testing:

Code is clean and well-commented
Contributor has fully tested the PR manually
If there are any front-end changes, before/after screenshots are included
Critical user journeys are covered by Gherkin stories
Any new interactions have been added to the QA Sheet
Critical and brittle code paths are covered by unit tests

Reviewer's Checklist

This section is for reviewers to fill out.

Automated test coverage is satisfactory
PR is fully functional
PR has been tested for accessibility regressions
External dependency files were updated if necessary (yarn and pip)
Documentation is updated
Contributor is in AUTHORS.md

rtibbles · 2021-09-16T19:34:56Z

contentcuration/contentcuration/tests/test_labels.py

+        )
+
+        init_time = time()
+        strings = len(ContentNode.objects.filter(categoryArray__icontains=MATH))


An alternative to this might be to pass in an array of MATH and all its descendant categories and instead of icontains do overlap:

strings = len(ContentNode.objects.filter(categoryArray__overlap=[MATH, ....]))

Done, and also tested it using indexes, and creating the nodes using TreeBuilder. I have updated the PR description with all the numbers.
Conclusions I can extract with the current logic to search for tags:

Indexes make things slower

Using overlap instead of contains is slower in arrays

When searching for not-categories, i.e. pure arrays without descendants, strings is the fastest format

When searching for categories, i.e. having descendants, jsonb is the fastest format

sonarqubecloud · 2021-09-17T18:55:44Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
1 Security Hotspot
2 Code Smells

No Coverage information
No Duplication information

Tests for labels field

7f5682c

jredrejo added the work-in-progress label Sep 16, 2021

jredrejo requested review from bjester and rtibbles September 16, 2021 18:59

rtibbles reviewed Sep 16, 2021

View reviewed changes

Adding indexes & using overlap in arrays searching

dcee7d3

jredrejo closed this Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of concept - Do not merge - Tests to decide labels field model #3283

Proof of concept - Do not merge - Tests to decide labels field model #3283

jredrejo commented Sep 16, 2021 •

edited

Loading

rtibbles Sep 16, 2021

jredrejo Sep 17, 2021 •

edited

Loading

sonarqubecloud bot commented Sep 17, 2021

Proof of concept - Do not merge - Tests to decide labels field model #3283

Proof of concept - Do not merge - Tests to decide labels field model #3283

Conversation

jredrejo commented Sep 16, 2021 • edited Loading

Summary

Description of the change(s) you made

Output when creating 1000 nodes

without indexes

with indexes

Output when creating 100,000 nodes

Without indexes

With indexes

Output when creating 10,000 nodes with indexes

Output when nodes are created using TreeBuilder

without indexes

with indexes

Reviewer guidance

Are there any risky areas that deserve extra testing?

References

Comments

Contributor's Checklist

Reviewer's Checklist

This section is for reviewers to fill out.

rtibbles Sep 16, 2021

Choose a reason for hiding this comment

jredrejo Sep 17, 2021 • edited Loading

Choose a reason for hiding this comment

sonarqubecloud bot commented Sep 17, 2021

jredrejo commented Sep 16, 2021 •

edited

Loading

jredrejo Sep 17, 2021 •

edited

Loading