Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic search search_after pagination #4473

Merged
merged 22 commits into from
Dec 26, 2024
Merged

Elastic search search_after pagination #4473

merged 22 commits into from
Dec 26, 2024

Conversation

hamza-56
Copy link
Contributor

@hamza-56 hamza-56 commented Oct 23, 2024

Testing instructions:

GET /api/v2/search/all/

  • page_size parameter limits the number of results returned.
  • search_after parameter enables pagination by returning results that appear after a specified document.

Example:

GET http://localhost:18381/api/v2/search/all/?page_size=1&search_after=[1728521062000,%22program:0b41be9d-10eb-4e10-ba2e-b2e47479e7b9%22]

@hamza-56 hamza-56 self-assigned this Oct 23, 2024
@hamza-56 hamza-56 marked this pull request as ready for review November 29, 2024 00:46
---------
ElasticSearch enforces a strict limit on the number of records that can be retrieved in a single query.
This limit is controlled by the `MAX_RESULT_WINDOW` setting, which defaults to 10,000.
When this limit is exceeded, data loss occurs in responses retrieved from the `api/v1/search/all/` endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the issue was restricted to search/all only, any endpoint using ES (like catalog endpoints) would also face the issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated ✅


To address this issue, we need a more efficient way to paginate large query results.
The solution must allow for seamless and reliable pagination without imposing excessive resource demands on the system.
Furthermore, it should ensure that the existing search functionality and search responses remain unaffected in the current version of the endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular line serves no purpose w.r.t context.

@@ -99,6 +101,41 @@ class CustomPageNumberPagination(PageNumberPagination):
page_size_query_param = 'page_size'


class SearchAfterPagination(PageNumberPagination):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can move this to pagination.py file

Copy link
Contributor Author

@hamza-56 hamza-56 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already an existing CustomPageNumberPagination class in viewsets.py. Ideally, both classes should be moved to pagination.py. However, following the current pattern, I created the new pagination class in the same file for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i think we can move both of these classes to pagination.py

@@ -77,6 +77,9 @@ class ProgramDocument(BaseDocument, OrganizationsMixin):
def prepare_aggregation_key(self, obj):
return 'program:{}'.format(obj.uuid)

def prepare_aggregation_uuid(self, obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[thinking] Can we add prepare_aggregation_uuid to BaseDocument since it is the same across all the Model Documents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the same, as the model names are different; each aggregation UUID includes the model name followed by the UUID.

course_discovery/apps/api/v2/views/search.py Outdated Show resolved Hide resolved
Copy link
Contributor

@AfaqShuaib09 AfaqShuaib09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a one comment to address, Overall very nice work on this 🎉

Copy link
Contributor

@Ali-D-Akbar Ali-D-Akbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, functionality wise. Great work 👏

@@ -0,0 +1,131 @@
from course_discovery.apps.course_metadata.search_indexes import documents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add Docstring please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little consistency among the docstrings in classes would be nice i.e. add docstrings for all classes except for meta. 🫡

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ✅

@@ -0,0 +1,113 @@
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please add docstring in all the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ✅



@ddt.ddt
class AggregateSearchViewSetV2Tests(mixins.LoginMixin, ElasticsearchTestMixin, mixins.APITestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little docstring with updated descriptions for classes and tests would be really nice since the unit tests are a little hard to understand by the first look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ✅

if search_after:
query_params["search_after"] = search_after
response = self.client.get(self.list_path, data=query_params)
assert response.status_code == 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally, please update normal assert == to assertEqual to make it look cool.


class Meta(CourseRunSearchDocumentSerializer.Meta):
document = CourseRunDocument
fields = CourseRunSearchDocumentSerializer.Meta.fields + SEARCH_INDEX_ADDITIONAL_FIELDS_V2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally, you can turn these into a mixin like:

class V2SerializerMixin:
    """
    Mixin to extend the fields attribute in the Meta class of serializers.
    """

    @staticmethod
    def extend_fields(base_fields):
        return base_fields + SEARCH_INDEX_ADDITIONAL_FIELDS_V2

and use it like
fields = V2SerializerMixin.extend_fields(CourseRunSearchDocumentSerializer.Meta.fields)

@@ -1,3 +1,5 @@
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File Docstring plis.

@hamza-56 hamza-56 merged commit c3261d4 into master Dec 26, 2024
14 checks passed
@hamza-56 hamza-56 deleted the hamza/PROD-4012 branch December 26, 2024 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants