Replacement functionality for get_all_items #237

TomAugspurger · 2022-06-08T20:47:49Z

(Followup to the deprecations from #206 which I haven't tested yet, so apologies if I missed something)

Over in https://github.com/microsoft/planetarycomputerexamples we use search.get_all_items() to get a single ItemCollection, which is then handed off to stackstac or odc.stac. Our use case requires having a concrete sequence of items ahead of time, not an iterator / iterable. What's the recommended replacement API? Something like

import pystac
import pystac_client

catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

time_range = "2020-12-01/2020-12-31"
bbox = [-122.2751, 47.5469, -121.9613, 47.7458]

search = catalog.search(collections=["landsat-c2-l2"], bbox=bbox, datetime=time_range)

pystac.ItemCollection(search.get_items(), clone_items=False)  # the replacement?

I have a couple concerns with this replacement, if it's the recommended alternative:

Previously, we had some surprisingly slow performance related to copying items and setting the right owner. It's not immediately clear to me whether the replacement does the right thing everywhere.
I worry about how easy it would be forget clone_items=False and now our users will get slower results until they profile things.

Given how common this use case is (at least in my corner of the world), I think it makes sense to keep a method on the search object that returns a concrete ItemCollection.

As an aside, I don't actually see a warning when running the old code on the Planetary Computer Hub using pystac 0.4.0, since pystac-client is emitting a DeprecationWarning. It's changed over the years, but DeprecationWarning isn't always visible to the end user. I don't know if pystac-client has a formal policy, but pandas uses either a FutureWarning for deprecations that will affect user-code, and DeprecationWarning for code that will likely affect library code (which is then upgraded to a FutureWarning in a future release).

The text was updated successfully, but these errors were encountered:

gadomski · 2022-06-14T22:11:02Z

If a user wants a List[pystac.Item], they can

items = list(item_search.items())

If they want un-PySTAC'd dictionaries:

items = list(item_search.items_as_dicts())

After discussion with yourself, @lossyrob, @matthewhanson, and @pjhartzell, it seems like the consensus is that pystac-client fulfills the role of "user convenience library", i.e. its API should solve the 80% use-case for the average user, rather than being a strict implementation (PySTAC is the "strict implementation" library). If that's the case, then we should:

Un-deprecate ItemSearch.get_all_items and ItemSearch.get_all_items_as_dict
- Question: should we keep those methods deprecated in favor of ItemSearch.all_items and ItemSearch.all_items_as_dict to bring them in line with the iterator versions?
Decide on the behavior when an items or items_as_dicts iterator gets to max_items. Should it:
- error?
- warn?
- something else?
Bring Client.get_all_items in line with ItemSearch.get_all_items (right now Client.get_all_items returns an iterator, which is different than ItemSearch.get_all_items)

I don't know if pystac-client has a formal policy, but pandas uses either a FutureWarning for deprecations that will affect user-code, and DeprecationWarning for code that will likely affect library code (which is then upgraded to a FutureWarning in a future release).

There isn't a formal policy. I don't know if I agree that FutureWarning is appropriate in this case. From https://docs.python.org/3/library/exceptions.html#FutureWarning:

when those warnings are intended for end users of applications that are written in Python

To me, the changes implemented in #206 are API changes, not application changes. To me, an "application change" would be a change to CLI behavior (e.g.). The docs for DeprecationWarning state:

when those warnings are intended for other Python developers

While notebooks blur the line between "Python developers" and "application users", you're still writing Python code in a notebook, so I'd say those users are "Python developers." Very happy to be argued out of this position, especially from the "what people actually see via the default warning filters" side of things.

TomAugspurger · 2022-07-08T20:02:39Z

Decide on the behavior when an items or items_as_dicts iterator gets to max_items. Should it:

xref #258, but one thing I didn't appreciate about the 0.4.0 release was that the default max_items behavior changed. Regardless of the behavior when max_items is reached, I think the default should be changed back to None.

And if users are explicitly saying max_items=100, then it's perfectly reasonable to (silently) return the 100 items even if there are (maybe) additional items in the result set. IMO there's no need for a warning or error if users are being explicit.

stac-utils/pystac-client#237

TomAugspurger · 2022-07-22T15:42:03Z

@gadomski, @philvarner do either of you have thoughts on the max_items change (leaving aside the other deprecations)?

I'm happy to submit a PR updating the behavior of max_items to match my expected behavior in #237 (comment).

gadomski · 2022-07-25T19:17:44Z

@gadomski, @philvarner do either of you have thoughts on the max_items change (leaving aside the other deprecations)?

I'm still of two minds about which I prefer, but since you have a concrete use-case I'd say a PR would be good, and we can use that PR's discussion to come to a decision (I expect I'll be 👍 because you've got a use-case :-) ).

I'll also open PR(s) for my other two proposed changes, and we can use those for discussion on those (since they're a little separate):

Un-deprecate ItemSearch.get_all_items and ItemSearch.get_all_items_as_dict

and

Bring Client.get_all_items in line with ItemSearch.get_all_items (right now Client.get_all_items returns an iterator, which is different than ItemSearch.get_all_items)

xref stac-utils#237 (comment) Doesn't close that issue, which also deals with the replacement for the deprecated methods. This just reverts the change to the default `max_items` to return all matching items by default.

philvarner · 2022-07-26T13:55:24Z

I think what happened here is we (mostly "I") made two distinct changes that didn't work well together.

The overall goal was to change the default behavior seen in get_all_items so that, by default, it wouldn't, for example, try to retrieve millions of items and put them all into memory before returning a value to the caller.

I would propose two changes:

set max_items back to None, but it only applies to the deprecated ItemCollection or List returning methods. Keep those methods deprecated.
items() (and any other iterator methods) do not have max_items applied to them, and they just keep iterating. If a user only wants a certain number of items, they should (pythonically) islice the iterator rather than using our proprietary parameters (e.g., max_items)

TomAugspurger · 2022-07-26T14:54:14Z

Thanks for the context @philvarner!

The overall goal was to change the default behavior seen in get_all_items so that, by default, it wouldn't, for example, try to retrieve millions of items and put them all into memory before returning a value to the caller.

I don't think I agree with this goal: it goes against what I would expect (and presumably what @duckontheweb or @matthewhanson expected when they initially wrote this). If I execute a query I expect to get back all the matching items. Having the ability to easily limit the result set is great, but is a surprising default (to me).

If we're worried about the UX of returning a potentially large result set, then there are perhaps other ways to indicate that to the user (logging progress or using tqdm come to mind).

If we want to take inspiration from other, perhaps similar contexts, then APIs like https://peps.python.org/pep-0249/ and libraries like https://www.psycopg.org/docs/ come to mind.

In that SQL context, the APIs provide separate methods fetchone, fetchmany, and fetchall (fetchall being similar to get_all_items()). The cursor objects are also iterable, returning all the objects matched by the query. And the maxinum number of results is specified within the query itself using SQL LIMIT clause (I suppose the STAC API doesn't have something similar, so this must be done client side).

So the analogy to pystac-client would be:

max_items instead of embedding a limit in the query
get_all_items instead of fetchall
The default behavior of iterating over a search result (search.items()) would return all matching items.

they should (pythonically) islice the iterator rather than using our proprietary parameters

It's worth emphasizing that this library will often be used interactively, and that should inform the design. Doing catalog.search(..., max_items=100) is easier than importing itertools and calling islice on the result.

I'd personally find max_items only applying to methods like get_all_items() but not .items() surprising.

philvarner · 2022-07-26T15:47:39Z

I think fetchall is not a good design. A call to catalog.search().get_all_items() to PC would likely never complete -- it would block indefinitely until either the server returned an error or memory was exhausted. I think having a method with a default behavior like this not an ideal design choice, where there are better alternatives.

I forgot max_items was in search(...), so I could see instead moving it to the items method(s), which is probably where it should have been anyway, which would get rid of the need for islice:

list_of_items = list(catalog.search(...).items(max_items=100))

get_all_items is the same as list(catalog.search(...).items()), so I don't think there's a need for both if max_items is reverted to default to None.

TomAugspurger · 2022-07-26T16:28:51Z

so I could see instead moving it to the items method(s)

Just to confirm, you're proposing adding a new parameter to .items() and friends with a default of max_items=None, implying that all the matching items are returned? As long as the default is to return all matching items, then I don't have a preference where the user specifies it.

get_all_items is the same as list(catalog.search(...).items()), so I don't think there's a need for both if max_items is reverted to default to None.

It's not quite the same. get_all_items returns a pystac.ItemCollection rather than a plain list. Some downstream users (microsoft/planetary-computer-sdk-for-python#38, I think places in stackstac) handle ItemCollections differently than list[pystac.Item].

philvarner · 2022-07-26T17:52:57Z

Correct.
Ah, I forgot get_all_items returned an IC rather than List[Item]. In that case, pystac.ItemCollection(catalog.search(...).items()) might work? How do they handle them differently?

TomAugspurger · 2022-07-26T18:51:01Z

Actually, did the return type of get_all_items change in the most recent pystac-client too? At least in prior versions it returned an ItemCollection. Now I'm not sure just from looking at the source code.

pystac.ItemCollection(catalog.search(...).items())

Close, but to avoid #49 you'd also need to pass clone_items=False. I'm worried that users will miss that and suffer poorer performance than what we have today (at least in the last release) from get_all_items() which we know does the right thing.

In the case of planetary_computer we're able to sign objects with known types (ItemCollection, Item, etc.). We could add support for signing sequences of things if necessary.

philvarner · 2022-07-27T14:29:42Z

get_all_items still returns an ItemCollection, doesn't it? Maybe I'm looking at the wrong code too.

So something like:

catalog.search(...).item_collection(max_items=None) # required keyword arg

which does apply clone_items=False. get_all_items would still be deprecated in favor of this method, that's more clearly named and explicit in it's danger.

TomAugspurger · 2022-07-27T15:32:28Z

Yeah, I was confused. ItemSearch.get_all_items() does return an ItemCollection in the latest pystac-client. I was looking at pystac_client.Client.get_all_items().

get_all_items would still be deprecated in favor of this method [.item_collection()], that's more clearly named

Makes sense. I can take care of that PR.

I really don't think that max_items should be a required keyword. I think the previous behavior of (by default) returning all the items was the right one. I suppose we have different values on what the right thing to do here is, but I'll try to re-state the case.

If I make a .search(collections=..., datetime=..., intersects=...) I really, really expect that search to return everything matching the conditions (regardless of how I consume it). That just matches all of my mental models for what's going on here, and I'd be very surprised to (silently!) drop items matching my query. It's so surprising I'll update essentially every use of pystac-client in the Planetary Computer docs to always set max_items if / when we update pystac-client (I should probably update them today, for users reading our docs and using the latest pystac-client).

Last thing regarding the API Client.search(..., max_items=10) vs. Client.search(...).items(max_items=10) I don't have a strong preference. I don't imagine people are repeatedly using .items() or using the various .get_* methods from the same search object, so it's the same amount of typing either way. My question would be whether it's worth the churn to go through a deprecation cycle for a new API. The proposed new .items(max_items=....) doesn't seems too much better to me.

That said, perhaps the fact that other parameters to client.search(...) map to the STAC API, while max_items doesn't is sufficient motivation to do the deprecation?

Adds an ItemSearch.item_collection method as a replacement for the deprecated `get_all_items()`. A few other notes: 1. I haven't implemented any of the `max_items` changes being worked out in stac-utils#237. That can be done before / after this PR once things are ironed out 2. I changed the `get_all_items` method to use Sphinx's builtin `deprecated` directive. 3. I updated the tests (a bit) I'd like to enable pytest's `-W error` flag to fail on warnings. But it looks like the `cli` is using the (deprecated) `get_all_items_as_dict` method. I haven't touched that in this PR, but I don't think the CLI should be using a deprecated function.

lossyrob · 2022-07-27T16:21:48Z

I really don't think that max_items should be a required keyword. I think the previous behavior of (by default) returning all the items was the right one. I suppose we have different values on what the right thing to do here is, but I'll try to re-state the case.

Voicing strong support for this perspective. The previous behavior matches what users expect; making things unexpected to real use cases in order to protect users from themselves isn't good design pattern to me.

Unless there are others who want to voice strong support that we need required protections in this case, I suggest that we move forward with Tom's suggested implementation and revert to the original max_items behavior.

duckontheweb · 2022-07-27T22:25:43Z

I’m also in favor of reverting to the old behavior of not limiting the number of Items returned. I got bitten by this recently when I was aggregating response geometries and it took me a long time to figure out why they were not adding up. I think the possibility of using up a bunch of resources by blindly calling this on a large response set it just one of the dangers of working with big data and is best managed by the user.

philvarner · 2022-07-27T23:47:23Z

philvarner · 2022-07-28T00:23:41Z

(1) use of ItemCollection in odcstac and stackstac

Earlier Tom said:

Over in https://github.com/microsoft/planetarycomputerexamples we use search.get_all_items() to get a single ItemCollection, which is then handed off to stackstac or odc.stac. Our use case requires having a concrete sequence of items ahead of time, not an iterator / iterable. What's the recommended replacement API? Something like

The odc.stac.load method takes an Interable[Item] (https://github.com/opendatacube/odc-stac/blob/1db2b1a7a10fb1a1fbd1bfee0a44b50fee22e911/odc/stac/_load.py#L203):

def load(
    items: Iterable[pystac.item.Item],

An ItemCollection is also an Interable[Item] because it implements __iter__ to return Items

I checked and there aren't any uses of ItemCollection in odc_stac.

In stackstac, the stack() method accepts a lot of types, but not Sequence[pystac.Item]:

def stack(
    items: Union[ItemCollectionIsh, ItemIsh],
...

ItemSequence = Sequence[ItemDict]

ItemIsh = Union[SatstacItem, PystacItem, ItemDict]
ItemCollectionIsh = Union[
    SatstacItemCollection, PystacCatalog, PystacItemCollection, ItemSequence
]

I've filed an issue requesting support for that:

Support Sequence[PystacItem] as an argument to stackstac.stack() method gjoseph92/stackstac#163

(2) ItemCollection mangles Items in default constructor

Filed an issue to change the default to cloning rather than not, since correct behavior is better than speed by default, and this has bitten nearly everyone who's used it.

ItemCollection constructor should clone items in iterator by default pystac#859

(3) moving max_items out of search() method

One other issue we have with the method is that the search() method has a bunch of parameters that configure things that aren't actually part of the search query. For example, max_items is a behavior on getting the results of the search rather than the query. We should move those out into the accessor methods rather than having them in search

Adds an ItemSearch.item_collection method as a replacement for the deprecated `get_all_items()`. A few other notes: 1. I haven't implemented any of the `max_items` changes being worked out in #237. That can be done before / after this PR once things are ironed out 2. I changed the `get_all_items` method to use Sphinx's builtin `deprecated` directive. 3. I updated the tests (a bit) I'd like to enable pytest's `-W error` flag to fail on warnings. But it looks like the `cli` is using the (deprecated) `get_all_items_as_dict` method. I haven't touched that in this PR, but I don't think the CLI should be using a deprecated function.

philvarner · 2022-07-28T01:27:06Z

Filed an issue to default max_items back to None

#278

* Restore default max_items behavior xref #237 (comment) Doesn't close that issue, which also deals with the replacement for the deprecated methods. This just reverts the change to the default `max_items` to return all matching items by default. * vcr * alias

TomAugspurger · 2022-08-08T20:29:35Z

I think everything here is addressed on main. Thanks everyone.

TomAugspurger mentioned this issue Jun 14, 2022

Support signing list[pystac.Item] microsoft/planetary-computer-sdk-for-python#38

Open

TomAugspurger mentioned this issue Jul 8, 2022

max_items change #258

Closed

TomAugspurger pushed a commit to microsoft/planetary-computer-containers that referenced this issue Jul 14, 2022

Pin pystac-client==0.3.5

f099295

stac-utils/pystac-client#237

TomAugspurger mentioned this issue Jul 25, 2022

Restore default max_items behavior #273

Merged

TomAugspurger mentioned this issue Jul 27, 2022

Added ItemSearch.item_collection #277

Merged

3 tasks

philvarner mentioned this issue Jul 28, 2022

revert max_items to default to None (retrieve all items) #278

Closed

3 tasks

TomAugspurger mentioned this issue Aug 5, 2022

Implement new ItemSearch result consumers #284

Merged

TomAugspurger closed this as completed Aug 8, 2022

TomAugspurger mentioned this issue Aug 15, 2022

update python to latest microsoft/planetary-computer-containers#49

Closed

wildintellect mentioned this issue Apr 21, 2023

Update pystac-client return methods. developmentseed/asdi-examples#7

Closed

gadomski mentioned this issue Jun 6, 2023

Allow more configuration with validate_all stac-utils/pystac#345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacement functionality for get_all_items #237

Replacement functionality for get_all_items #237

TomAugspurger commented Jun 8, 2022

gadomski commented Jun 14, 2022

TomAugspurger commented Jul 8, 2022 •

edited

Loading

TomAugspurger commented Jul 22, 2022

gadomski commented Jul 25, 2022

philvarner commented Jul 26, 2022

TomAugspurger commented Jul 26, 2022 •

edited

Loading

philvarner commented Jul 26, 2022

TomAugspurger commented Jul 26, 2022 •

edited

Loading

philvarner commented Jul 26, 2022

TomAugspurger commented Jul 26, 2022 •

edited

Loading

philvarner commented Jul 27, 2022

TomAugspurger commented Jul 27, 2022 •

edited

Loading

lossyrob commented Jul 27, 2022

duckontheweb commented Jul 27, 2022

philvarner commented Jul 27, 2022

philvarner commented Jul 28, 2022

philvarner commented Jul 28, 2022

TomAugspurger commented Aug 8, 2022

Replacement functionality for get_all_items #237

Replacement functionality for get_all_items #237

Comments

TomAugspurger commented Jun 8, 2022

gadomski commented Jun 14, 2022

TomAugspurger commented Jul 8, 2022 • edited Loading

TomAugspurger commented Jul 22, 2022

gadomski commented Jul 25, 2022

philvarner commented Jul 26, 2022

TomAugspurger commented Jul 26, 2022 • edited Loading

philvarner commented Jul 26, 2022

TomAugspurger commented Jul 26, 2022 • edited Loading

philvarner commented Jul 26, 2022

TomAugspurger commented Jul 26, 2022 • edited Loading

philvarner commented Jul 27, 2022

TomAugspurger commented Jul 27, 2022 • edited Loading

lossyrob commented Jul 27, 2022

duckontheweb commented Jul 27, 2022

philvarner commented Jul 27, 2022

philvarner commented Jul 28, 2022

philvarner commented Jul 28, 2022

TomAugspurger commented Aug 8, 2022

TomAugspurger commented Jul 8, 2022 •

edited

Loading

TomAugspurger commented Jul 26, 2022 •

edited

Loading

TomAugspurger commented Jul 26, 2022 •

edited

Loading

TomAugspurger commented Jul 26, 2022 •

edited

Loading

TomAugspurger commented Jul 27, 2022 •

edited

Loading