Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow more configuration with validate_all #345

Closed
philvarner opened this issue May 17, 2021 · 14 comments · Fixed by #1150
Closed

Allow more configuration with validate_all #345

philvarner opened this issue May 17, 2021 · 14 comments · Fixed by #1150
Assignees
Milestone

Comments

@philvarner
Copy link
Collaborator

The Catalog method validate_all currently attempts to iterate all child catalogs, collections, and items. For large catalogs, this can take a huge amount of time. I think it would be useful to provide additional parameters to this method to allow more configurability.

One possibility is to simply allow a user to only validate certain types of objects, e.g., only catalogs and collections, with:

  • catalog: bool = True
  • collection: bool = True
  • items: bool = True

A user could then do validate_all(catalog = True, collection = True, items = False) to only recursively validate the catalogs and collections.

A more complex option would be to provide a max count of each type to validate:

  • catalog: Optional[int] = None
  • collection: Optional[int] = None
  • items: Optional[int] = None

This way a user could say validate_all(catalog = None, collection = None, items = 100), and it would validate all their catalog and collection objects, but only validate 100 items. This would give some good validation that the small-c catalog is implemented correctly, but actually finish when there are more than thousands of items.

@gadomski gadomski added this to the 1.8 milestone Jan 31, 2023
@jsignell jsignell self-assigned this Jun 1, 2023
@jsignell
Copy link
Member

jsignell commented Jun 5, 2023

Just to circle back here. I opened #1141 which covers the scenario described in this issue of limiting the amount of objects that are iterated over.

I ended up taking a slightly different path mostly because I couldn't quite figure out whether count the items contained within a child using the original proposal. So instead I added a single max number of objects and a recursive flag. Here is what it looks like:

In [1]: from pystac_client import Client
   ...: 
   ...: URL  = "https://planetarycomputer.microsoft.com/api/stac/v1"
   ...: catalog = Client.open(URL)

In [2]: catalog.validate_all(max_n=10)
Out[2]: 10

In [3]: catalog.validate_all(recursive=False, max_n=10)
Out[3]: 10

It would be good to get some 👍s on approach though.

@gadomski
Copy link
Member

gadomski commented Jun 5, 2023

I think Phil's complex suggestion is the best, with one modification -- I don't think there should be a distinction between Catalog and Collection, since you don't know what kind you're going to get when you read a child link. So, I think my proposed method signature is:

def validate_all(self, max_children: Optional[int] = None, max_items: Optional[int] = None):
    """Validate this catalog and all of its children and items, recursively.

    - If `max_children=None`, all children will be validated
    - If `max_children=0`, no children will be validated (they will still be read because they are needed to get the items)
    - If `max_items=None`, all items will be resolved and validated
    - If `max_items=0`, no items will be resolved and validated

    If both `max_children=0` and `max_items=0`, you should be using `validate` instead.
    """
    ...

@philvarner
Copy link
Collaborator Author

Agree that descendent catalogs and collections should be handled as one count.

Validate this catalog and some or all of its children and items, recursively

I do have a doubt about what the semantics of max_children should be -- should be the max number of C&Cs to validate, or should be the max number to traverse in the DAG? Reason being is that I had mind that the intention of this was to limit the number of network requests that are made (to something lower than "all of them"), rather than limiting the number that get validated after being retrieved (which is relatively wall clock time cheap operation compared to network I/O).

@jsignell
Copy link
Member

jsignell commented Jun 6, 2023

should be the max number of C&Cs to validate, or should be the max number to traverse in the DAG?

Is this solved if I keep the recursive flag so that plus the max_children and max_items would be:

def validate_all(self, max_children: Optional[int] = None, max_items: Optional[int] = None, recursive: bool = True):
    """Validate this catalog and all of its children and items, recursively (as long as recursive is True).

    - If `max_children=None`, all children will be validated
    - If `max_children=0`, no children will be validated (they will still be read because they are needed to get the items)
    - If `max_items=None`, all items will be resolved and validated
    - If `max_items=0`, no items will be resolved and validated
    - If `recursive=False`, children will not be recursed into- only the child object itself will be validated

    If both `max_children=0` and `max_items=0`, you should be using `validate` instead.
    """
    ...

@gadomski
Copy link
Member

gadomski commented Jun 6, 2023

Should be the max number of C&Cs to validate, or should be the max number to traverse in the DAG?

I'm not convinced that it is valuable to limit the number of child links traversed -- in almost all cases, there are orders (and orders) of magnitude more items than children, so if you're worried about network requests, you'll want to limit items, not children.

@philvarner
Copy link
Collaborator Author

Good point, so as long as that's clearly documented. Is there any reason to limit the number of c&c's validated at all in that case?

@gadomski
Copy link
Member

gadomski commented Jun 6, 2023

Is there any reason to limit the number of c&c's validated at all in that case?

Good point, maybe not. This seems nice and clean:

def validate_all(self, max_items: Optional[int] = None, recursive: bool = True):
    """Validate this catalog and all of its items.

    - If `max_items=None`, all items will be resolved and validated.
    - If `max_items=0`, no items will be resolved and validated.
    - If `max_items=n`, the first `n` items will be resolved and validated.
    - If `recursive=True`, `validate_all` will traverse this Catalog's children and validate them and their items as well.
    - If `recursive=True` and `max_items` is a non-zero integer, recursion will stop after `max_items` Items have been validated.
    """

@philvarner
Copy link
Collaborator Author

- If max_items=None, all items will be resolved and validated. This may never complete in your lifetime.

@jsignell
Copy link
Member

jsignell commented Jun 6, 2023

Ok, I like that solution (getting closer to my PR 😄), but I think the number of items within the c&cs should count towards the max_items as well. If you run validate_all on the root catalog, there are no items on that, but there are many many within each collection right?

@gadomski
Copy link
Member

gadomski commented Jun 6, 2023

I think the number of items within the c&cs should count towards the max_items as well

Agreed, I think that's what I'm saying with "If recursive=True and max_items is a non-zero integer, recursion will stop after max_items Items have been validated." but maybe that's not clear?

@jsignell
Copy link
Member

jsignell commented Jun 6, 2023

I think that is clear

@jsignell
Copy link
Member

jsignell commented Jun 6, 2023

Thanks for the feedback! I adjusted the PR.

@jsignell
Copy link
Member

jsignell commented Jun 6, 2023

I am wondering now though if the default should be non-null. If the goal is to prevent an accidental total catalog fetch then a non-null default seems necessary.

@gadomski
Copy link
Member

gadomski commented Jun 6, 2023

I am wondering now though if the default should be non-null. If the goal is to prevent an accidental total catalog fetch then a non-null default seems necessary.

We did this with pystac-client and it made folks (incl Tom) unhappy: stac-utils/pystac-client#237 (comment)

I think validate_all should by default to it all. There's a lot of use cases (e.g. local catalogs) where a total validation is what you want, and anything less is surprising. Also, folks shouldn't be validating all of the Planetary Computer :-P.

We could have a warning if we hit some sort of limit but that feels like squeezing a bit too much juice out of the orange to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants