Search endpoint #233

ckyrouac · 2019-04-24T14:48:40Z

No description provided.

ckyrouac · 2019-04-24T14:51:14Z

@Glutexo ready for a review.

dehort · 2019-04-26T18:31:00Z

@Glutexo has this been reviewed?

Glutexo · 2019-04-26T18:32:52Z

Started reviewing today, but left it for maintenance of the pagination pull request. Will have to do it on Monday.

Glutexo · 2019-04-26T18:33:54Z

From a quick glance it looks fine, but wanted to thoroughly test it.

Glutexo

This pull request introduces two rather independent changes:

Field exclusion from the get hosts result
Adding the POST search operation

I think they can come in two separate pull requests. Mainly because I have some comments to both of them and they can be addressed and then merged independently.

I am not 100% sure what is the point of this search operation. It introduces only search by host IDs, which is already available on the GET endpoint. But it can be just a base for more sophisticated search system, so ok.

But still, having an operation that gets resources but uses POST is not RESTful. What is the reasoning behind this? Is it because the GET parameters can get too big? There are still some arbitrary limits on the size of the query. And even if it‘s because of the size of the request… Search queries written by humans probably never hit the GET limit. Queries by zillions of host IDs are going to be composed by machines and those can just do more than one query. I wouldn’t be afraid of decreasing performance, since it’s already necessary to paginate if the result set is too big – doing more queries and then paginating is the same as paginating a single query. I’d rather limit the number of hosts in the GET parameter than getting by POST.

/hosts/search?host_ids[]=47363ef2-720c-4c88-89be-11983bda65e6&host_ids[]=337a287f-e2d9-4692-856d-19efc548a04d

A query like this would be more RESTful.

That’s for the philosophical part. For the code, see my inline comments.

app/__init__.py

app/error_handlers.py

app/models.py

Glutexo · 2019-04-29T12:27:00Z

test_api.py

+    def test_query_with_exclude_system_profile(self):
+        host_list = self.added_hosts
+        host_id_list = [host.id for host in host_list]
+        query_doc = {"host_id_list": host_id_list, "exclude_fields": ["system_profile_facts"]}


This line is too long.

Since there is no linter defined on this project, what is the accepted line length?

Shortened the line

I’d stick to PEP 8 which says:

Limit all lines to a maximum of 79 characters.

For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be limited to 72 characters.

api/search.py

api/host.py

Glutexo · 2019-04-29T12:44:23Z

api/search.py

@@ -0,0 +1,24 @@
+import logging


I don’t like much the fact that the search is out of the hosts namespace and is not presented at /hosts/search. Like this, it’s logical to have this separate module, but still it shares many functions with the api.host module resulting in a spaghetti code. Also it requires (although softly) renaming the shared methods, which increases confusion rate in that module.

If the search operation will be moved to the /hosts/search endpoint, then I’d move this only operation to the api.host module. Refactoring that module can be then easier and done in one place.

Even if just renamed to hosts_search, I’d still put it to the bulky api.host module. Some extraction can be done later as a separate step.

I don't see an issue with having a /search endpoint.

Eventually I would like to move the host related logic (including searching) out of the api.host module and into a separate module so that logic can be shared.

api/host.py

ckyrouac · 2019-04-29T13:45:44Z

IMO /hosts/search is not RESTful (although REST can be subjective at times). I don't like using /hosts/search because it overloads the already defined /hosts/{id} path. Am I asking for a host resource with ID search?

POST to retrieve objects has become fairly common in REST interfaces when doing a "search" operation. Elastic Search actually does a GET with a body for it's search operations. It's a hotly debated topic which has no single best answer.

https://stackoverflow.com/questions/5020704/how-to-design-restful-search-filtering
https://softwareengineering.stackexchange.com/questions/233164/how-do-searches-fit-into-a-restful-interface

I agree needing this endpoint is not ideal. The reason we need this is because we do not yet have a good way to join data across the platform. One use case I'm aware of is that advisor has a long list of host ids it needs to retrieve details for to join with it's host specific data. The list is too large to put in a query parameter. Regardless of pagination, they only want to retrieve those specific hosts.

I'd rather not create separate PRs for this (extra overhead with minimal benefit). I think it all fits into the "search endpoint" task. I will fix the other comments in a new commit.

Glutexo · 2019-04-30T08:24:51Z

I was thinking the same with /hosts/search; had the same issue with /hosts/download. However since the host ID is always a UUID, it’s at least unambiguous. Rails even uses /resource/new as a default for new resource page. But agreed. If it’s not going to be a completely generic search, but a search only for hosts, I’d at least rename it to host_search and then rename the host_id_list argument to id_list, or even better, just id. Like that it can reside in the api.hosts module, which can be somehow split by refactoring later.

I see the advantage of POST and I can even bend my mind around it like – we are creating a new (ephemeral) search object that contains the results. If this is only for Advisor that needs the retrieve many hosts at once, I’d just use GET and tell them to use more requests. They already have to do this, because there is a limited number of hosts per page. So no difference in performance and not even in the complexity – reduced on Advisor side, but increased on the inventory side. I’d save the POST for times when some complex filtering (probably from the UI) comes into play – then we can store a real search object by POST and then GET its results by its ID.

And thanks for the fixes!

ckyrouac · 2019-04-30T14:22:06Z

What's the disadvantage to creating a more generic /search endpoint which could be enhanced later? I'd rather start with a foundation to improve upon than create a specific endpoint for each use case. I also don't want to design a public API simply to fit into a single code file.

I think the discussion on using GET/POST is out of scope. I've been told (see RHCLOUD-144) the requirement is to use POST to retrieve a list of hosts by id. At this point, I don't think we can (or should) put more work on the consumers of inventory to retrieve host details. We will likely deprecate this /search endpoint when the real search solution is complete next release.

dehort · 2019-05-01T00:09:25Z

I haven't had a chance to look at this in detail. I did notice that you have to explicitly exclude the facts and system_profile fields. Can you modify this so that they are excluded from the response by default and only included when explicitly requested? Our main performance issues have been due to the large json document processing.

ckyrouac · 2019-05-01T13:51:12Z

I can do that if you think it's necessary. I'd rather not because I think it muddies the API.

As a user of an API, I would intuitively expect all fields to be returned unless I specify to exclude fields (or am required to explicitly specify all the fields I want). It seems odd to exclude arbitrary (at least from a users perspective) fields unless specified to include them.

Couldn't the client (only advisor in this case?) specify the exclude in their query?

dehort · 2019-05-01T16:46:18Z

Unfortunately, I feel like its necessary at this point. Processing large amounts of json has proven to hurt performance. So I'd like to return as little data as possible by default and make the client explicitly ask for the larger data sets.

ckyrouac · 2019-05-01T17:12:00Z

ok. Pushed a new commit switching to include_fields.

Glutexo

Reviewed the new pieces of code and also added some more comments. I agree that once it’s decided we’ll have a POST endpoint, it’s pointless to argue about it here. I’d just rename it to host_search. Even if it’s may get more generic in the future, it’ll still search for hosts. I think it’s improbable to have a single endpoint that would search for different entities, at least in this stage of development.

I get the point of gradual improvements as well as that of solving multiple issues in one pull requests. Although I don’t support it much, I can live with it and don’t let it block this pull request.

Glutexo · 2019-05-02T13:48:01Z

api/search.py

+    include = validated_input.data["include_fields"]
+    exclude = []
+
+    if "facts" not in include:


This dance around including and excluding the fields is confusing. We can at least keep this translation in just one place – in this method.

I think we can remove the argument from the get_host_list_by_id_list function. It serves optimization there, but since its sister function get_host_system_profile_by_id doesn’t do this, I think we can save the DB fields exclusion from the query for some later PR.

In the Host.to_json method the current exclude list parameter is more a boolean parameter. It can be either directly a boolean parameter include_facts, or at least an include list parameter. I don’t like these to_json methods at all, but this cat fit at least for now.

The latter would require to extract the Host.to_json part of the build_paginated_host_list_response function, or change the exclude argument in its signature as well, but that shouldn’t be an issue.

Like that, the include won’t be translated to exclude anywhere.

I would like to remove the Host.to_json() methods and replace them with an export() method that simply returns a "complete" host dict. We could then have an exporter object that is responsible for filtering out the unneeded fields.

Here is an example that shows what I was thinking:
https://github.com/dehort/insights-host-inventory/tree/add_exporter_object_to_render_host_query_output

Glutexo · 2019-05-02T14:02:47Z

api/search.py

+    if "facts" not in include:
+        exclude.append("facts")
+    if "system_profile_facts" not in include:
+        exclude.append("system_profile_facts")


If the system_profile_facts is present in the include list, it still doesn’t find its way to the response. That’s because the Host.to_json method doesn’t add the field to its result no matter what is in its exclude parameter.

Glutexo · 2019-05-02T14:43:18Z

app/models.py

@@ -83,15 +83,21 @@ def from_json(cls, d):
            d.get("system_profile", {}),
        )

-    def to_json(self):
+    def to_json(self, exclude=None):


The exclude is essentially a boolean parameter. Let’s change it to something like include_facts.

Glutexo · 2019-05-02T14:48:02Z

app/models.py

+
+class SearchSchema(Schema):
+    host_id_list = fields.List(fields.Str(validate=verify_uuid_format),
+                               validate=validate.Length(min=1, max=5000),


I’d also add the constraints to the API specification too.

The minimum and maximum number of items.

That this key is required.

Glutexo · 2019-05-02T14:50:44Z

app/models.py

+    host_id_list = fields.List(fields.Str(validate=verify_uuid_format),
+                               validate=validate.Length(min=1, max=5000),
+                               required=True)
+    include_fields = fields.List(fields.Str(validate=validate.Length(min=1, max=255)),


The field value is essentially an enum("facts", "system_profile_facts"). Let’s describe it like that both here and in the API specification. The length argument would be then unnecessary.

It would be nice to list out to list out the valid values. I think an enum makes sense here.

Glutexo · 2019-05-02T15:15:33Z

test_api.py

-HOST_URL = "/api/inventory/v1/hosts"
+BASE_URL = "/api/inventory/v1"
+HOST_URL = BASE_URL + "/hosts"
+SEARCH_URL = BASE_URL + "/search"


If the endpoint is renamed, it’d have to be changed here to..

Suggested change

SEARCH_URL = BASE_URL + "/search"

SEARCH_URL = BASE_URL + "/host_search"

Glutexo · 2019-05-02T15:30:15Z

test_api.py

+        for host in response["results"]:
+            assert host["id"] in host_id_list
+            assert "facts" in host
+            assert "system_profile_facts" not in host


The system_profile_facts field is never present in the response, regardless of the included fields value.

Glutexo · 2019-05-02T15:31:30Z

test_api.py

+            assert "system_profile_facts" not in host
+
+        self._post_paging_test(SEARCH_URL, query_doc, len(self.added_hosts))
+


Either test the inclusion of the system_profile_facts field and see it fails, or remove the hints that this field can be included.

Glutexo · 2019-05-02T15:32:38Z

test_api.py

+
+        self._post_paging_test(SEARCH_URL, query_doc, len(self.added_hosts))
+
+    def test_query_with_exclude_facts_and_system_profile(self):


Now, regardless of the included_fields value.

Glutexo · 2019-05-02T15:40:35Z

api/search.py

@@ -0,0 +1,24 @@
+import logging


Even if just renamed to hosts_search, I’d still put it to the bulky api.host module. Some extraction can be done later as a separate step.

Glutexo · 2019-05-06T15:41:36Z

test_api.py

+        host_id_list = [host.id for host in host_list]
+        query_doc = {"host_id_list": host_id_list, "include_fields": ["invalid"]}
+        response = self.post(SEARCH_URL, query_doc, 400)
+        assert "Invalid include_fields. Valid values are: ['facts', 'system_profile_facts']" in response["detail"]["include_fields"][0]


Suggested change

assert "Invalid include_fields. Valid values are: ['facts', 'system_profile_facts']" in response["detail"]["include_fields"][0]

self.assertIn("Invalid include_fields. Valid values are: ['facts', 'system_profile_facts']", response["detail"]["include_fields"][0])

Glutexo · 2019-05-06T16:06:49Z

test_api.py

@@ -1488,6 +1501,71 @@ def test_query_with_no_matching_insights_id(self):
        self._base_query_test(uuid_that_does_not_exist_in_db, 0)


+class SearchTests(PreCreatedHostsBaseTestCase):


If the endpoint gets renamed to /hosts_search, I’d rename this class to HostsSearchTests for symmetry.

Glutexo · 2019-05-20T09:56:48Z

Talked a bit with @karelhala and finally understood, where does the need for the search by many host IDs comes from. The main use case for this – having a list of hosts filtered by conditions not known to the Inventory – can’t be viably solved just by increasing the number of requestable host IDs. The applications possibly need to pass hundreds of IDs to the query and then request separate pages of the result set, grab the results in different order, use additional filters etc. I think that the only way to this is to use this POST to create a search query object in the Inventory, add host IDs to it and then use a generated ID of this object to request the hosts.

dehort · 2019-06-03T21:23:49Z

api/search.py

@@ -0,0 +1,24 @@
+import logging


I don't see an issue with having a /search endpoint.

Eventually I would like to move the host related logic (including searching) out of the api.host module and into a separate module so that logic can be shared.

dehort · 2019-06-03T21:31:10Z

api/search.py

+    include = validated_input.data["include_fields"]
+    exclude = []
+
+    if "facts" not in include:


I would like to remove the Host.to_json() methods and replace them with an export() method that simply returns a "complete" host dict. We could then have an exporter object that is responsible for filtering out the unneeded fields.

Here is an example that shows what I was thinking:
https://github.com/dehort/insights-host-inventory/tree/add_exporter_object_to_render_host_query_output

dehort · 2019-06-03T21:33:04Z

app/models.py

+
+class SearchSchema(Schema):
+    host_id_list = fields.List(fields.Str(validate=verify_uuid_format),
+                               validate=validate.Length(min=1, max=5000),


dehort · 2019-06-03T21:33:48Z

app/models.py

+    host_id_list = fields.List(fields.Str(validate=verify_uuid_format),
+                               validate=validate.Length(min=1, max=5000),
+                               required=True)
+    include_fields = fields.List(fields.Str(validate=validate.Length(min=1, max=255)),


It would be nice to list out to list out the valid values. I think an enum makes sense here.

dehort · 2019-06-03T21:35:32Z

app/models.py

+
+    @validates("exclude_fields")
+    def validate_exclude_fields(self, fields):
+        valid_fields = ["facts", "system_profile_facts"]


I was thinking the search by post would allow for including / excluding the facts and system_profile.

ckyrouac mentioned this pull request Apr 24, 2019

Use post to query hosts #193

Closed

Glutexo requested changes Apr 29, 2019

View reviewed changes

Glutexo reviewed Apr 29, 2019

View reviewed changes

api/host.py Outdated Show resolved Hide resolved

ckyrouac force-pushed the search_endpoint branch from 4c88982 to 3e9e26b Compare April 29, 2019 14:17

ckyrouac added 3 commits May 1, 2019 13:11

Search endpoint

fee0fa3

Search endpoint fixes for review

cb20b25

Switch from exclude_fields to include_fields in /search

626e62d

ckyrouac force-pushed the search_endpoint branch from 8dfe226 to 626e62d Compare May 1, 2019 17:11

Glutexo requested changes May 2, 2019

View reviewed changes

Glutexo reviewed May 6, 2019

View reviewed changes

dehort requested changes Jun 3, 2019

View reviewed changes

jharting added the abandoned May be done in the future, wontfix now label Oct 29, 2019

jharting closed this Oct 29, 2019

	SEARCH_URL = BASE_URL + "/search"
	SEARCH_URL = BASE_URL + "/host_search"

		assert "system_profile_facts" not in host

		self._post_paging_test(SEARCH_URL, query_doc, len(self.added_hosts))


		self._post_paging_test(SEARCH_URL, query_doc, len(self.added_hosts))

		def test_query_with_exclude_facts_and_system_profile(self):

	assert "Invalid include_fields. Valid values are: ['facts', 'system_profile_facts']" in response["detail"]["include_fields"][0]
	self.assertIn("Invalid include_fields. Valid values are: ['facts', 'system_profile_facts']", response["detail"]["include_fields"][0])

		@@ -1488,6 +1501,71 @@ def test_query_with_no_matching_insights_id(self):
		self._base_query_test(uuid_that_does_not_exist_in_db, 0)


		class SearchTests(PreCreatedHostsBaseTestCase):

Search endpoint #233

Search endpoint #233

Conversation

ckyrouac commented Apr 24, 2019

ckyrouac commented Apr 24, 2019

dehort commented Apr 26, 2019

Glutexo commented Apr 26, 2019

Glutexo commented Apr 26, 2019

Glutexo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Glutexo Apr 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ckyrouac commented Apr 29, 2019

Glutexo commented Apr 30, 2019 • edited Loading

ckyrouac commented Apr 30, 2019

dehort commented May 1, 2019

ckyrouac commented May 1, 2019

dehort commented May 1, 2019

ckyrouac commented May 1, 2019

Glutexo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Glutexo May 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Glutexo commented May 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Glutexo Apr 29, 2019 •

edited

Loading

Glutexo commented Apr 30, 2019 •

edited

Loading

Glutexo May 2, 2019 •

edited

Loading