Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paginate methods that return iterators #196

Closed
lemon24 opened this issue Oct 9, 2020 · 3 comments
Closed

Paginate methods that return iterators #196

lemon24 opened this issue Oct 9, 2020 · 3 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Oct 9, 2020

Requested in #192.

The underlying storage API is already paginated, we just need to find a nice way of exposing it.

@lemon24
Copy link
Owner Author

lemon24 commented Oct 9, 2020

Some thoughts regarding the API.

Current state

Currently (1.8) , Reader methods that return multiple things return a plain iterator:

def get_things(
    ...
) -> Iterator[Thing]: ...

There is no way to start this iterator from the middle.

The underlying (private) Storage API already supports pagination; it looks something like:

def get_things_page(
    ...,
    chunk_size: Optional[int] = None,
    last: Optional[T] = None
) -> Iterable[Tuple[Thing, T]]: ...

T is an opaque object that identifies each Thing (in practice, a Tuple[Any, ...] of some combination of basic types like str, int, datetime.datetime). Calling the method again with the same arguments, but with last=T_corresponding_to_some_Thing> makes the iterable start after that Thing. Calling the method with different arguments than the ones that generated the last value is "undefined" (it should be an error, but we're not checking).

chunk_size can be used to limit the number of things returned; 0/None means return all of them.

The current implementation uses these with scrolling window queries. The alternative, limit+offset queries, tends to cause performance issues, so the API should not be based on them.

Note that Reader.get_things() only calls Storage.get_things(). In the current SQLite implementation, Storage.get_things() then calls Storage.get_things_page(); this happens for two reasons (details here):

  • to avoid locking the database for too long (not consuming a generator should not lock the database), and
  • to avoid consuming too much memory (if returning a list instead of a generator)

Other implementations that support MVCC might not do this; Storage.get_things_page() was left as a "public" storage method specifically looking forward towards this issue.

Possible APIs

New Reader methods

The simplest thing that would work is to expose get_things_page() methods on Reader. This has two obvious disadvantages:

  • we'd be duplicating a lot of methods
  • the Iterable[Tuple[Thing, T]] return type is not easy to extend (e.g. if we later want to offer additional pagination info, like the last for the first/previous/last page).

Augumented iterable

Alternatively, we could make get_things() return an augumented Iterable that has additional attributes (initially, just .next, but later we can add object/page counts etc.).

This model is somewhat similar to the Stripe API pagination; we already have "auto-pagination" (the iterator gets the next page automatically), we need to expose normal pagination.

Object id (primary key)

An interesting thing Stripe does is that the type of its last is the same as the object id. It's true that our Tuple[Any, ...] contains the object primary key plus some other of it's attributes, which we can always retrieve based on the primary key.

The only downside is that we need to do a bit of additional work to get the extra attributes.

It would be nice to have a uniform way of getting the value of last. We can make this consistent with a property, but with the current API state, doing so is not always possible, because the returned objects don't always have all the data:

  • Feed methods yield Feed: rv.url. We can have a property.
  • Entry methods yield Entry/EntrySearchResult: (rv.feed_url, rv.id). We can have a property.
  • Feed metadata yield (key, value) tuples: (feed_url, rv[0]). We cannot have a property, because we don't have the feed URL; we can add it in backwards-compatible way. Note that the return value should remain a 2-tuple (subclass), so we can do dict(iter_feed_metadata(...)). Unlike feeds/entries, it would be harder to maintain next(get_things(id=id)) == get_thing(id); we could maybe do it by making the get_feed_metadata() return value a proxy; OTOH, it's not like that at the moment anyway.
  • Tag methods yield str: (Optional[feed_url], rv). We cannot have a property, because we don't have the feed URL. Subclassing str to add it would be ickier than subclassing tuple. Alternatively, we could add a new get_feed_tag_info() -> TagInfo method as proposed in I have to go through all the entries to find out how many there are #185 (comment) for other reasons.

However, we can add a uniform way of getting last later, in a largely backwards-compatible way (but it would help with testing to do it earlier).

Also, we could postpone adding pagination for feed metadata and tags until 2.0, when we can just change the return types.

For search_entries(), last contains the rank. From a few tests I did with the current implementation, the time for searching a specific entry (search_entries(query, entry=entry))) does increase with the size of the result set, but does not increase with how far in the result set you are.

The type of last

In Python-land, our current use of a Tuple[Any, ...] for last is likely OK. There are some issues with it, though:

  1. It should be easier to serialize, to make the Reader API easier to use in non-Python settings.
  2. It should not be possible to use a last value generated from different arguments passed to the same method ("undefined" behavior bad).
  3. It may be preferable to obfuscate the value of last, to prevent people from depending on it.

Signed bytes / str

A possible solution would be to use ItsDangerous to serialize last into a string (solves 1). We can check the method call that generated the last value by either including the method name and arguments (or their hash) in the serialized value, or by using them as salt (solves 2). We can also obfuscate the value by using URLSafeSerializer to serialize it into a Base64-like string (solves 3).

Note that by default ItsDangerous only supports JSON-serializable types. Flask's TaggedJSONSerializer allows serializing additional types like tuple and datetime; if we don't want to depend on / vendor that, we could write our own (prototype).

Object id (primary key)

Using the object id as last would fix everything:

  1. The most complex keys are 2-tuples: (feed URL, entry id), (feed URL, metadata key), (feed URL, tag). These are all relatively easy to serialize.
  2. It's impossible to misuse last, since the only data is the object id.
  3. The id will never change, so there's no need to obfuscate it.

Argument names

I am not sure chunk_size and last are the best argument names.

Looking at some other things for inspiration:

chunk_size last notes
Stripe limit starting_after they also have ending_before
OpenStack limit marker
WordPress per_page page (int) or offset (int)
Django REST framework CursorPagination page_size cursor (opaque string) like our "signed bytes / str"
Django REST framework PageNumberPagination page_size page (int)
Django REST framework LimitOffsetPagination limit offset (int)

I think limit + starting_after is the best combination. This way, we can add ending_before later, and use cursor/marker for the opaque string if it's useful.

Storage private pagination

How does this interact with the (private) pagination of the current SQLite Storage implementation?

The reasons for internal pagination remain valid: avoid locking the database, avoid consuming too much memory (#167); paginated calls should still use the internal pagination.

Some optimizations we can do later:

  • for paginated calls with limit > Storage.chunk_size: if the size of the last page is less than chunk_size, limit rows should be requested
  • for paginated calls with limit <= Storage.chunk_size: limit rows should be requested

Initial implementation notes

What do we implement?

Pagination for get_feeds(), get_entries(), and search_entries(). A property returning the object id for the corresponding type; #159 has a discussion on naming that may be relevant.

We can do tags and metadata later, both because of the issues mentioned above, and because they're less expensive to call as-is (they are likely fewer and have less data).

What changes are required to support it?

Storage methods to go from "object id" to last.

Whatever changes are needed for the optimzations described in Storage private pagination.

How do we test it?

Individual tests for limit and start_after.

Parametrized tests; some options (not mutually exclusive):

  1. Add new functions to decorators like with_call_entries_recent_method that mimick the non-paginated API but use the paginated API; add similar decorators for get_feeds() and search_entries(). This is nice because it reuses existing tests, but may complicate existing tests and increase test run time.
  2. Add new tests that check the non-paginated API results == chained paginated API results. This is ideal, but we'd need to have representative test data (some of it may be trapped in tests now).
  3. Make a reader fixture that does (2) for each method call. We could limit it to specific tests using markers. This kind of testing before the actual test might lead to unrelated failures.
  4. Like (3), but find a way to duplicate the tests in a way that allows excluding them from normal test runs.

@lemon24
Copy link
Owner Author

lemon24 commented Dec 1, 2020

To do:

  • get_feeds()
    • starting_after + basic test
      • Storage method to go from "id" to last
    • limit (no optimizations) + basic test
    • [ ] parametrized tests
      • "id" property on the data objects
  • get_entries()
    • starting_after + basic test
      • Storage method to go from "id" to last
    • limit (no optimizations) + basic test
    • [ ] parametrized tests
      • "id" property on the data objects
  • search_entries()
    • starting_after + basic test
      • Storage method to go from "id" to last
    • limit (no optimizations) + basic test
    • [ ] parametrized tests
      • "id" property on the data objects
  • sort='random' behavior (test and/or document)
  • limit / starting_after optimizations
  • documentation
    • docstrings (including versionadded/versionchanged)
    • user guide
    • changelog
  • use limit for the web app /?limit=64 argument

lemon24 added a commit that referenced this issue Dec 4, 2020
lemon24 added a commit that referenced this issue Dec 5, 2020
lemon24 added a commit that referenced this issue Dec 8, 2020
lemon24 added a commit that referenced this issue Dec 8, 2020
lemon24 added a commit that referenced this issue Dec 9, 2020
lemon24 added a commit that referenced this issue Dec 9, 2020
lemon24 added a commit that referenced this issue Dec 9, 2020
lemon24 added a commit that referenced this issue Dec 9, 2020
@lemon24
Copy link
Owner Author

lemon24 commented Dec 9, 2020

We're not gonna do the parametrized tests at the moment, test_reader.py isn't organized enough.

lemon24 added a commit that referenced this issue Dec 9, 2020
lemon24 added a commit that referenced this issue Dec 9, 2020
lemon24 added a commit that referenced this issue Dec 12, 2020
lemon24 added a commit that referenced this issue Dec 13, 2020
@lemon24 lemon24 closed this as completed Dec 13, 2020
lemon24 added a commit that referenced this issue Dec 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant