Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search endpoint #233

Closed
wants to merge 3 commits into from
Closed

Conversation

ckyrouac
Copy link
Contributor

No description provided.

@ckyrouac ckyrouac mentioned this pull request Apr 24, 2019
@ckyrouac
Copy link
Contributor Author

@Glutexo ready for a review.

@dehort
Copy link
Collaborator

dehort commented Apr 26, 2019

@Glutexo has this been reviewed?

@Glutexo
Copy link
Collaborator

Glutexo commented Apr 26, 2019

Started reviewing today, but left it for maintenance of the pagination pull request. Will have to do it on Monday.

@Glutexo
Copy link
Collaborator

Glutexo commented Apr 26, 2019

From a quick glance it looks fine, but wanted to thoroughly test it.

Copy link
Collaborator

@Glutexo Glutexo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request introduces two rather independent changes:

  • Field exclusion from the get hosts result
  • Adding the POST search operation

I think they can come in two separate pull requests. Mainly because I have some comments to both of them and they can be addressed and then merged independently.

I am not 100% sure what is the point of this search operation. It introduces only search by host IDs, which is already available on the GET endpoint. But it can be just a base for more sophisticated search system, so ok.

But still, having an operation that gets resources but uses POST is not RESTful. What is the reasoning behind this? Is it because the GET parameters can get too big? There are still some arbitrary limits on the size of the query. And even if it‘s because of the size of the request… Search queries written by humans probably never hit the GET limit. Queries by zillions of host IDs are going to be composed by machines and those can just do more than one query. I wouldn’t be afraid of decreasing performance, since it’s already necessary to paginate if the result set is too big – doing more queries and then paginating is the same as paginating a single query. I’d rather limit the number of hosts in the GET parameter than getting by POST.

/hosts/search?host_ids[]=47363ef2-720c-4c88-89be-11983bda65e6&host_ids[]=337a287f-e2d9-4692-856d-19efc548a04d

A query like this would be more RESTful.

That’s for the philosophical part. For the code, see my inline comments.

test_api.py Outdated
def test_query_with_exclude_system_profile(self):
host_list = self.added_hosts
host_id_list = [host.id for host in host_list]
query_doc = {"host_id_list": host_id_list, "exclude_fields": ["system_profile_facts"]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is too long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no linter defined on this project, what is the accepted line length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shortened the line

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d stick to PEP 8 which says:

Limit all lines to a maximum of 79 characters.

For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be limited to 72 characters.

@@ -0,0 +1,24 @@
import logging
Copy link
Collaborator

@Glutexo Glutexo Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t like much the fact that the search is out of the hosts namespace and is not presented at /hosts/search. Like this, it’s logical to have this separate module, but still it shares many functions with the api.host module resulting in a spaghetti code. Also it requires (although softly) renaming the shared methods, which increases confusion rate in that module.

If the search operation will be moved to the /hosts/search endpoint, then I’d move this only operation to the api.host module. Refactoring that module can be then easier and done in one place.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if just renamed to hosts_search, I’d still put it to the bulky api.host module. Some extraction can be done later as a separate step.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an issue with having a /search endpoint.

Eventually I would like to move the host related logic (including searching) out of the api.host module and into a separate module so that logic can be shared.

@ckyrouac
Copy link
Contributor Author

IMO /hosts/search is not RESTful (although REST can be subjective at times). I don't like using /hosts/search because it overloads the already defined /hosts/{id} path. Am I asking for a host resource with ID search?

POST to retrieve objects has become fairly common in REST interfaces when doing a "search" operation. Elastic Search actually does a GET with a body for it's search operations. It's a hotly debated topic which has no single best answer.

https://stackoverflow.com/questions/5020704/how-to-design-restful-search-filtering
https://softwareengineering.stackexchange.com/questions/233164/how-do-searches-fit-into-a-restful-interface

I agree needing this endpoint is not ideal. The reason we need this is because we do not yet have a good way to join data across the platform. One use case I'm aware of is that advisor has a long list of host ids it needs to retrieve details for to join with it's host specific data. The list is too large to put in a query parameter. Regardless of pagination, they only want to retrieve those specific hosts.

I'd rather not create separate PRs for this (extra overhead with minimal benefit). I think it all fits into the "search endpoint" task. I will fix the other comments in a new commit.

@Glutexo
Copy link
Collaborator

Glutexo commented Apr 30, 2019

I was thinking the same with /hosts/search; had the same issue with /hosts/download. However since the host ID is always a UUID, it’s at least unambiguous. Rails even uses /resource/new as a default for new resource page. But agreed. If it’s not going to be a completely generic search, but a search only for hosts, I’d at least rename it to host_search and then rename the host_id_list argument to id_list, or even better, just id. Like that it can reside in the api.hosts module, which can be somehow split by refactoring later.

I see the advantage of POST and I can even bend my mind around it like – we are creating a new (ephemeral) search object that contains the results. If this is only for Advisor that needs the retrieve many hosts at once, I’d just use GET and tell them to use more requests. They already have to do this, because there is a limited number of hosts per page. So no difference in performance and not even in the complexity – reduced on Advisor side, but increased on the inventory side. I’d save the POST for times when some complex filtering (probably from the UI) comes into play – then we can store a real search object by POST and then GET its results by its ID.

And thanks for the fixes!

@ckyrouac
Copy link
Contributor Author

What's the disadvantage to creating a more generic /search endpoint which could be enhanced later? I'd rather start with a foundation to improve upon than create a specific endpoint for each use case. I also don't want to design a public API simply to fit into a single code file.

I think the discussion on using GET/POST is out of scope. I've been told (see RHCLOUD-144) the requirement is to use POST to retrieve a list of hosts by id. At this point, I don't think we can (or should) put more work on the consumers of inventory to retrieve host details. We will likely deprecate this /search endpoint when the real search solution is complete next release.

@dehort
Copy link
Collaborator

dehort commented May 1, 2019

I haven't had a chance to look at this in detail. I did notice that you have to explicitly exclude the facts and system_profile fields. Can you modify this so that they are excluded from the response by default and only included when explicitly requested? Our main performance issues have been due to the large json document processing.

@ckyrouac
Copy link
Contributor Author

ckyrouac commented May 1, 2019

I can do that if you think it's necessary. I'd rather not because I think it muddies the API.

As a user of an API, I would intuitively expect all fields to be returned unless I specify to exclude fields (or am required to explicitly specify all the fields I want). It seems odd to exclude arbitrary (at least from a users perspective) fields unless specified to include them.

Couldn't the client (only advisor in this case?) specify the exclude in their query?

@dehort
Copy link
Collaborator

dehort commented May 1, 2019

Unfortunately, I feel like its necessary at this point. Processing large amounts of json has proven to hurt performance. So I'd like to return as little data as possible by default and make the client explicitly ask for the larger data sets.

@ckyrouac ckyrouac force-pushed the search_endpoint branch from 8dfe226 to 626e62d Compare May 1, 2019 17:11
@ckyrouac
Copy link
Contributor Author

ckyrouac commented May 1, 2019

ok. Pushed a new commit switching to include_fields.

Copy link
Collaborator

@Glutexo Glutexo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the new pieces of code and also added some more comments. I agree that once it’s decided we’ll have a POST endpoint, it’s pointless to argue about it here. I’d just rename it to host_search. Even if it’s may get more generic in the future, it’ll still search for hosts. I think it’s improbable to have a single endpoint that would search for different entities, at least in this stage of development.

I get the point of gradual improvements as well as that of solving multiple issues in one pull requests. Although I don’t support it much, I can live with it and don’t let it block this pull request.

include = validated_input.data["include_fields"]
exclude = []

if "facts" not in include:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dance around including and excluding the fields is confusing. We can at least keep this translation in just one place – in this method.

  • I think we can remove the argument from the get_host_list_by_id_list function. It serves optimization there, but since its sister function get_host_system_profile_by_id doesn’t do this, I think we can save the DB fields exclusion from the query for some later PR.
  • In the Host.to_json method the current exclude list parameter is more a boolean parameter. It can be either directly a boolean parameter include_facts, or at least an include list parameter. I don’t like these to_json methods at all, but this cat fit at least for now.

The latter would require to extract the Host.to_json part of the build_paginated_host_list_response function, or change the exclude argument in its signature as well, but that shouldn’t be an issue.

Like that, the include won’t be translated to exclude anywhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to remove the Host.to_json() methods and replace them with an export() method that simply returns a "complete" host dict. We could then have an exporter object that is responsible for filtering out the unneeded fields.

Here is an example that shows what I was thinking:
https://github.com/dehort/insights-host-inventory/tree/add_exporter_object_to_render_host_query_output

if "facts" not in include:
exclude.append("facts")
if "system_profile_facts" not in include:
exclude.append("system_profile_facts")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the system_profile_facts is present in the include list, it still doesn’t find its way to the response. That’s because the Host.to_json method doesn’t add the field to its result no matter what is in its exclude parameter.

@@ -83,15 +83,21 @@ def from_json(cls, d):
d.get("system_profile", {}),
)

def to_json(self):
def to_json(self, exclude=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exclude is essentially a boolean parameter. Let’s change it to something like include_facts.


class SearchSchema(Schema):
host_id_list = fields.List(fields.Str(validate=verify_uuid_format),
validate=validate.Length(min=1, max=5000),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d also add the constraints to the API specification too.

host_id_list = fields.List(fields.Str(validate=verify_uuid_format),
validate=validate.Length(min=1, max=5000),
required=True)
include_fields = fields.List(fields.Str(validate=validate.Length(min=1, max=255)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field value is essentially an enum("facts", "system_profile_facts"). Let’s describe it like that both here and in the API specification. The length argument would be then unnecessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to list out to list out the valid values. I think an enum makes sense here.

HOST_URL = "/api/inventory/v1/hosts"
BASE_URL = "/api/inventory/v1"
HOST_URL = BASE_URL + "/hosts"
SEARCH_URL = BASE_URL + "/search"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the endpoint is renamed, it’d have to be changed here to..

Suggested change
SEARCH_URL = BASE_URL + "/search"
SEARCH_URL = BASE_URL + "/host_search"

for host in response["results"]:
assert host["id"] in host_id_list
assert "facts" in host
assert "system_profile_facts" not in host
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The system_profile_facts field is never present in the response, regardless of the included fields value.

assert "system_profile_facts" not in host

self._post_paging_test(SEARCH_URL, query_doc, len(self.added_hosts))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either test the inclusion of the system_profile_facts field and see it fails, or remove the hints that this field can be included.


self._post_paging_test(SEARCH_URL, query_doc, len(self.added_hosts))

def test_query_with_exclude_facts_and_system_profile(self):
Copy link
Collaborator

@Glutexo Glutexo May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, regardless of the included_fields value.

@@ -0,0 +1,24 @@
import logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if just renamed to hosts_search, I’d still put it to the bulky api.host module. Some extraction can be done later as a separate step.

host_id_list = [host.id for host in host_list]
query_doc = {"host_id_list": host_id_list, "include_fields": ["invalid"]}
response = self.post(SEARCH_URL, query_doc, 400)
assert "Invalid include_fields. Valid values are: ['facts', 'system_profile_facts']" in response["detail"]["include_fields"][0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert "Invalid include_fields. Valid values are: ['facts', 'system_profile_facts']" in response["detail"]["include_fields"][0]
self.assertIn("Invalid include_fields. Valid values are: ['facts', 'system_profile_facts']", response["detail"]["include_fields"][0])

@@ -1488,6 +1501,71 @@ def test_query_with_no_matching_insights_id(self):
self._base_query_test(uuid_that_does_not_exist_in_db, 0)


class SearchTests(PreCreatedHostsBaseTestCase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the endpoint gets renamed to /hosts_search, I’d rename this class to HostsSearchTests for symmetry.

@Glutexo
Copy link
Collaborator

Glutexo commented May 20, 2019

Talked a bit with @karelhala and finally understood, where does the need for the search by many host IDs comes from. The main use case for this – having a list of hosts filtered by conditions not known to the Inventory – can’t be viably solved just by increasing the number of requestable host IDs. The applications possibly need to pass hundreds of IDs to the query and then request separate pages of the result set, grab the results in different order, use additional filters etc. I think that the only way to this is to use this POST to create a search query object in the Inventory, add host IDs to it and then use a generated ID of this object to request the hosts.

@@ -0,0 +1,24 @@
import logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an issue with having a /search endpoint.

Eventually I would like to move the host related logic (including searching) out of the api.host module and into a separate module so that logic can be shared.

include = validated_input.data["include_fields"]
exclude = []

if "facts" not in include:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to remove the Host.to_json() methods and replace them with an export() method that simply returns a "complete" host dict. We could then have an exporter object that is responsible for filtering out the unneeded fields.

Here is an example that shows what I was thinking:
https://github.com/dehort/insights-host-inventory/tree/add_exporter_object_to_render_host_query_output


class SearchSchema(Schema):
host_id_list = fields.List(fields.Str(validate=verify_uuid_format),
validate=validate.Length(min=1, max=5000),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

host_id_list = fields.List(fields.Str(validate=verify_uuid_format),
validate=validate.Length(min=1, max=5000),
required=True)
include_fields = fields.List(fields.Str(validate=validate.Length(min=1, max=255)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to list out to list out the valid values. I think an enum makes sense here.


@validates("exclude_fields")
def validate_exclude_fields(self, fields):
valid_fields = ["facts", "system_profile_facts"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the search by post would allow for including / excluding the facts and system_profile.

@jharting jharting added the abandoned May be done in the future, wontfix now label Oct 29, 2019
@jharting jharting closed this Oct 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
abandoned May be done in the future, wontfix now
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants