Paginate methods that return iterators #196

lemon24 · 2020-10-09T06:48:51Z

Requested in #192.

The underlying storage API is already paginated, we just need to find a nice way of exposing it.

lemon24 · 2020-10-09T08:10:31Z

Some thoughts regarding the API.

Current state

Currently (1.8) , Reader methods that return multiple things return a plain iterator:

def get_things(
    ...
) -> Iterator[Thing]: ...

There is no way to start this iterator from the middle.

The underlying (private) Storage API already supports pagination; it looks something like:

def get_things_page(
    ...,
    chunk_size: Optional[int] = None,
    last: Optional[T] = None
) -> Iterable[Tuple[Thing, T]]: ...

T is an opaque object that identifies each Thing (in practice, a Tuple[Any, ...] of some combination of basic types like str, int, datetime.datetime). Calling the method again with the same arguments, but with last=T_corresponding_to_some_Thing> makes the iterable start after that Thing. Calling the method with different arguments than the ones that generated the last value is "undefined" (it should be an error, but we're not checking).

chunk_size can be used to limit the number of things returned; 0/None means return all of them.

The current implementation uses these with scrolling window queries. The alternative, limit+offset queries, tends to cause performance issues, so the API should not be based on them.

Note that Reader.get_things() only calls Storage.get_things(). In the current SQLite implementation, Storage.get_things() then calls Storage.get_things_page(); this happens for two reasons (details here):

to avoid locking the database for too long (not consuming a generator should not lock the database), and
to avoid consuming too much memory (if returning a list instead of a generator)

Other implementations that support MVCC might not do this; Storage.get_things_page() was left as a "public" storage method specifically looking forward towards this issue.

Possible APIs

New Reader methods

The simplest thing that would work is to expose get_things_page() methods on Reader. This has two obvious disadvantages:

we'd be duplicating a lot of methods
the Iterable[Tuple[Thing, T]] return type is not easy to extend (e.g. if we later want to offer additional pagination info, like the last for the first/previous/last page).

Augumented iterable

Alternatively, we could make get_things() return an augumented Iterable that has additional attributes (initially, just .next, but later we can add object/page counts etc.).

This model is somewhat similar to the Stripe API pagination; we already have "auto-pagination" (the iterator gets the next page automatically), we need to expose normal pagination.

Object id (primary key)

An interesting thing Stripe does is that the type of its last is the same as the object id. It's true that our Tuple[Any, ...] contains the object primary key plus some other of it's attributes, which we can always retrieve based on the primary key.

The only downside is that we need to do a bit of additional work to get the extra attributes.

It would be nice to have a uniform way of getting the value of last. We can make this consistent with a property, but with the current API state, doing so is not always possible, because the returned objects don't always have all the data:

Feed methods yield Feed: rv.url. We can have a property.
Entry methods yield Entry/EntrySearchResult: (rv.feed_url, rv.id). We can have a property.
Feed metadata yield (key, value) tuples: (feed_url, rv[0]). We cannot have a property, because we don't have the feed URL; we can add it in backwards-compatible way. Note that the return value should remain a 2-tuple (subclass), so we can do dict(iter_feed_metadata(...)). Unlike feeds/entries, it would be harder to maintain next(get_things(id=id)) == get_thing(id); we could maybe do it by making the get_feed_metadata() return value a proxy; OTOH, it's not like that at the moment anyway.
Tag methods yield str: (Optional[feed_url], rv). We cannot have a property, because we don't have the feed URL. Subclassing str to add it would be ickier than subclassing tuple. Alternatively, we could add a new get_feed_tag_info() -> TagInfo method as proposed in I have to go through all the entries to find out how many there are #185 (comment) for other reasons.

However, we can add a uniform way of getting last later, in a largely backwards-compatible way (but it would help with testing to do it earlier).

Also, we could postpone adding pagination for feed metadata and tags until 2.0, when we can just change the return types.

For search_entries(), last contains the rank. From a few tests I did with the current implementation, the time for searching a specific entry (search_entries(query, entry=entry))) does increase with the size of the result set, but does not increase with how far in the result set you are.

The type of last

In Python-land, our current use of a Tuple[Any, ...] for last is likely OK. There are some issues with it, though:

It should be easier to serialize, to make the Reader API easier to use in non-Python settings.
It should not be possible to use a last value generated from different arguments passed to the same method ("undefined" behavior bad).
It may be preferable to obfuscate the value of last, to prevent people from depending on it.

Signed bytes / str

A possible solution would be to use ItsDangerous to serialize last into a string (solves 1). We can check the method call that generated the last value by either including the method name and arguments (or their hash) in the serialized value, or by using them as salt (solves 2). We can also obfuscate the value by using URLSafeSerializer to serialize it into a Base64-like string (solves 3).

Note that by default ItsDangerous only supports JSON-serializable types. Flask's TaggedJSONSerializer allows serializing additional types like tuple and datetime; if we don't want to depend on / vendor that, we could write our own (prototype).

Object id (primary key)

Using the object id as last would fix everything:

The most complex keys are 2-tuples: (feed URL, entry id), (feed URL, metadata key), (feed URL, tag). These are all relatively easy to serialize.
It's impossible to misuse last, since the only data is the object id.
The id will never change, so there's no need to obfuscate it.

Argument names

I am not sure chunk_size and last are the best argument names.

Looking at some other things for inspiration:

	chunk_size	last	notes
Stripe	limit	starting_after	they also have ending_before
OpenStack	limit	marker
WordPress	per_page	page (int) or offset (int)
Django REST framework CursorPagination	page_size	cursor (opaque string)	like our "signed bytes / str"
Django REST framework PageNumberPagination	page_size	page (int)
Django REST framework LimitOffsetPagination	limit	offset (int)

I think limit + starting_after is the best combination. This way, we can add ending_before later, and use cursor/marker for the opaque string if it's useful.

Storage private pagination

How does this interact with the (private) pagination of the current SQLite Storage implementation?

The reasons for internal pagination remain valid: avoid locking the database, avoid consuming too much memory (#167); paginated calls should still use the internal pagination.

Some optimizations we can do later:

for paginated calls with limit > Storage.chunk_size: if the size of the last page is less than chunk_size, limit rows should be requested
for paginated calls with limit <= Storage.chunk_size: limit rows should be requested

Initial implementation notes

What do we implement?

Pagination for get_feeds(), get_entries(), and search_entries(). A property returning the object id for the corresponding type; #159 has a discussion on naming that may be relevant.

We can do tags and metadata later, both because of the issues mentioned above, and because they're less expensive to call as-is (they are likely fewer and have less data).

What changes are required to support it?

Storage methods to go from "object id" to last.

Whatever changes are needed for the optimzations described in Storage private pagination.

How do we test it?

Individual tests for limit and start_after.

Parametrized tests; some options (not mutually exclusive):

Add new functions to decorators like with_call_entries_recent_method that mimick the non-paginated API but use the paginated API; add similar decorators for get_feeds() and search_entries(). This is nice because it reuses existing tests, but may complicate existing tests and increase test run time.
Add new tests that check the non-paginated API results == chained paginated API results. This is ideal, but we'd need to have representative test data (some of it may be trapped in tests now).
Make a reader fixture that does (2) for each method call. We could limit it to specific tests using markers. This kind of testing before the actual test might lead to unrelated failures.
Like (3), but find a way to duplicate the tests in a way that allows excluding them from normal test runs.

lemon24 · 2020-12-01T14:34:13Z

For #196.

lemon24 · 2020-12-09T14:30:55Z

We're not gonna do the parametrized tests at the moment, test_reader.py isn't organized enough.

For #196.

…amount of queries. For #196.

For #196.

lemon24 added API core labels Oct 9, 2020

lemon24 added a commit that referenced this issue Dec 4, 2020

get_feeds(starting_after=...).

19c990c

For #196.

lemon24 added a commit that referenced this issue Dec 5, 2020

get_feeds(limit=...).

a55daa8

For #196.

lemon24 mentioned this issue Dec 8, 2020

get_feeds() broken when there are too many feeds with no title #203

Closed

lemon24 added a commit that referenced this issue Dec 8, 2020

get_entries(limit=...), search_entries(limit=...).

0bd094e

For #196.

lemon24 added a commit that referenced this issue Dec 8, 2020

get_entries(starting_after=...).

acca08f

For #196.

lemon24 added a commit that referenced this issue Dec 8, 2020

search_entries(starting_after=...).

5cd4b07

For #196.

lemon24 added a commit that referenced this issue Dec 8, 2020

Use different chunk sizes when testing pagination.

6d55765

For #196.

lemon24 added a commit that referenced this issue Dec 9, 2020

Add .object_id property to the feed and entry data objects.

f59ce29

For #196.

lemon24 added a commit that referenced this issue Dec 9, 2020

Refactor pagination tests and add .object_id on feed and entry errors.

d4669fc

For #196.

lemon24 added a commit that referenced this issue Dec 9, 2020

Make using starting_after with sort='random' an error.

9f64cf1

For #196.

lemon24 added a commit that referenced this issue Dec 9, 2020

Clean up pagination tests.

eb13486

For #196.

lemon24 added a commit that referenced this issue Dec 9, 2020

Clean up pagination tests.

0350729

For #196.

lemon24 added a commit that referenced this issue Dec 9, 2020

Minor test fixups.

948e57b

For #196.

lemon24 added a commit that referenced this issue Dec 9, 2020

Minor test cleanups.

ebc6062

For #196.

lemon24 added a commit that referenced this issue Dec 9, 2020

More test cleanups.

940ab1d

For #196.

lemon24 added a commit that referenced this issue Dec 9, 2020

Make limit=... of get_feed(), get_entries() etc. work do the minimum …

fde920b

…amount of queries. For #196.

lemon24 added a commit that referenced this issue Dec 12, 2020

Pagination user guide section; changelog.

9ee59e5

For #196.

lemon24 added a commit that referenced this issue Dec 13, 2020

Use get_entries(limit=...) in the web app.

4da6071

For #196.

lemon24 closed this as completed Dec 13, 2020

lemon24 added a commit that referenced this issue Dec 13, 2020

Add missing docs reference. For #196.

e9fe3c1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paginate methods that return iterators #196

Paginate methods that return iterators #196

lemon24 commented Oct 9, 2020

lemon24 commented Oct 9, 2020 •

edited

Loading

lemon24 commented Dec 1, 2020 •

edited

Loading

lemon24 commented Dec 9, 2020

Paginate methods that return iterators #196

Paginate methods that return iterators #196

Comments

lemon24 commented Oct 9, 2020

lemon24 commented Oct 9, 2020 • edited Loading

Current state

Possible APIs

New Reader methods

Augumented iterable

Object id (primary key)

The type of last

Signed bytes / str

Object id (primary key)

Argument names

Storage private pagination

Initial implementation notes

What do we implement?

What changes are required to support it?

How do we test it?

lemon24 commented Dec 1, 2020 • edited Loading

lemon24 commented Dec 9, 2020

lemon24 commented Oct 9, 2020 •

edited

Loading

lemon24 commented Dec 1, 2020 •

edited

Loading