Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose random sampling via API #1168

Merged
merged 50 commits into from
Aug 22, 2019
Merged

Expose random sampling via API #1168

merged 50 commits into from
Aug 22, 2019

Conversation

jc-harrison
Copy link
Member

Closes #1007

I have:

  • Formatted any Python files with black
  • Brought the branch up to date with master
  • Added any relevant Github labels
  • Added tests for any new additions
  • Added or updated any relevant documentation
  • Added an Architectural Decision Record (ADR), if appropriate
  • Added an MPLv2 License Header if appropriate
  • Updated the Changelog

Description

Adds a "sampling" parameter to all query schemas except top-level aggregate queries.
The random sample parameter validation is handled by a RandomSampleSchema, which post-load returns a RandomSampler object with a make_random_sample_object method. The make_random_sample_object method takes a Query object and calls its random_sample method with the appropriate parameters.

Adds a corresponding random_sample function to FlowClient, to generate a random sample spec.

Other changes:

  • Splits the RandomBase class into multiple classes for the different sampling methods, to simplify argument checking and avoid having a large if/else in _make_query.
  • Modifies the FlowDB random_ints function, so that it always returns exactly the requested number of random ints.
  • Removes the redundant Query._db_store_cache_metadata method.

Note: While it wouldn't make sense to expose a random sampling parameter for top-level exposed queries anyway (since these are all aggregates), it's worth noting that doing so would not be possible at the moment because un-seeded random samples cannot be stored, and so their results could not be retrieved by the API.

@jc-harrison jc-harrison added the FlowMachine Issues related to FlowMachine label Aug 16, 2019
@cypress
Copy link

cypress bot commented Aug 16, 2019



Test summary

54 0 0 0


Run details

Project FlowAuth
Status Passed
Commit 3bc6eda
Started Aug 22, 2019 8:04 AM
Ended Aug 22, 2019 8:07 AM
Duration 02:36 💡
OS Linux Debian - 8.11
Browser Electron 61

View run in Cypress Dashboard ➡️


This comment has been generated by cypress-bot as a result of this project's GitHub integration settings. You can manage this integration in this project's settings in the Cypress Dashboard

@codecov
Copy link

codecov bot commented Aug 16, 2019

Codecov Report

Merging #1168 into master will increase coverage by 0.23%.
The diff coverage is 95.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1168      +/-   ##
==========================================
+ Coverage   93.95%   94.18%   +0.23%     
==========================================
  Files         144      152       +8     
  Lines        7111     7435     +324     
  Branches      699      695       -4     
==========================================
+ Hits         6681     7003     +322     
- Misses        321      326       +5     
+ Partials      109      106       -3
Flag Coverage Δ
#flowapi_unit_tests 82.4% <ø> (ø) ⬆️
#flowauth_unit_tests 94.18% <ø> (ø) ⬆️
#flowclient_unit_tests 78.78% <20%> (-1.31%) ⬇️
#flowetl_unit_tests 96.63% <ø> (?)
#flowkit_jwt_generator_unit_tests 100% <ø> (ø) ⬆️
#flowmachine_unit_tests 90.55% <80%> (-0.18%) ⬇️
#integration_tests 67.13% <78.69%> (+1.37%) ⬆️
Impacted Files Coverage Δ
...e/flowmachine/core/server/query_schemas/handset.py 100% <100%> (ø) ⬆️
...achine/core/server/query_schemas/daily_location.py 100% <100%> (ø) ⬆️
...ine/core/server/query_schemas/subscriber_degree.py 100% <100%> (ø) ⬆️
...wmachine/core/server/query_schemas/displacement.py 100% <100%> (ø) ⬆️
flowmachine/flowmachine/core/query.py 93.22% <100%> (+0.85%) ⬆️
...wmachine/core/server/query_schemas/topup_amount.py 100% <100%> (ø) ⬆️
flowmachine/flowmachine/core/cache.py 96.29% <100%> (+0.52%) ⬆️
...ore/server/query_schemas/unique_location_counts.py 100% <100%> (ø) ⬆️
...ne/core/server/query_schemas/radius_of_gyration.py 100% <100%> (ø) ⬆️
...e/core/server/query_schemas/pareto_interactions.py 100% <100%> (ø) ⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b76f230...2d676c9. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 16, 2019

Codecov Report

Merging #1168 into master will increase coverage by 0.14%.
The diff coverage is 96.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1168      +/-   ##
==========================================
+ Coverage   93.98%   94.13%   +0.14%     
==========================================
  Files         151      153       +2     
  Lines        7336     7437     +101     
  Branches      697      693       -4     
==========================================
+ Hits         6895     7001     +106     
+ Misses        332      330       -2     
+ Partials      109      106       -3
Flag Coverage Δ
#flowapi_unit_tests 82.55% <ø> (ø) ⬆️
#flowauth_unit_tests 93.65% <ø> (ø) ⬆️
#flowclient_unit_tests 78.11% <14.28%> (-1.98%) ⬇️
#flowetl_unit_tests 96.63% <ø> (ø) ⬆️
#flowkit_jwt_generator_unit_tests 100% <ø> (ø) ⬆️
#flowmachine_unit_tests 90.75% <89.83%> (-0.03%) ⬇️
#integration_tests 66.98% <80.23%> (+1.28%) ⬆️
Impacted Files Coverage Δ
...e/flowmachine/core/server/query_schemas/handset.py 100% <100%> (ø) ⬆️
...achine/core/server/query_schemas/daily_location.py 100% <100%> (ø) ⬆️
...ine/core/server/query_schemas/subscriber_degree.py 100% <100%> (ø) ⬆️
...wmachine/core/server/query_schemas/displacement.py 100% <100%> (ø) ⬆️
flowmachine/flowmachine/core/query.py 93.22% <100%> (+0.85%) ⬆️
...wmachine/core/server/query_schemas/topup_amount.py 100% <100%> (ø) ⬆️
flowmachine/flowmachine/core/cache.py 96.29% <100%> (+0.52%) ⬆️
...ore/server/query_schemas/unique_location_counts.py 100% <100%> (ø) ⬆️
...ne/core/server/query_schemas/radius_of_gyration.py 100% <100%> (ø) ⬆️
...e/core/server/query_schemas/pareto_interactions.py 100% <100%> (ø) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8367d8...3bc6eda. Read the comment docs.

Copy link
Member

@greenape greenape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice :)

Just wondering if we can be a little more concise in a few places.

@@ -19,6 +20,7 @@ class UniqueLocationCountsSchema(Schema):
end_date = fields.Date(required=True)
aggregation_unit = AggregationUnit()
subscriber_subset = SubscriberSubset()
sampling = fields.Nested(RandomSampleSchema, allow_none=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular bit is pretty repetitive, would it be worth putting it in a shared parent class?

@@ -121,7 +122,7 @@ def __init__(self, name=None, schema=None, columns=None):
q_state_machine = QueryStateMachine(self.redis, self.md5)
q_state_machine.enqueue()
q_state_machine.execute()
self._db_store_cache_metadata(compute_time=0)
write_cache_metadata(self.connection, self, compute_time=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -617,6 +617,7 @@ def daily_location(
aggregation_unit: str,
method: str,
subscriber_subset: Union[dict, None] = None,
sampling: Union[dict, None] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in API world it makes sense for sampling to be rolled into the query params, but I wonder if client side it makes more sense to be able to do random_sample(other_query, ..)?

vals = [x["id"] for x in cursor.fetchall()]
assert [9, 4, 8] == vals
assert len(vals) == 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

"""
spec = {
sampled_query = query.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly more pythonic to do dict(query), although obviously in both cases any nested dicts still refer to their underlying object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good idea. I'm hoping a shallow copy is sufficient since random_sample doesn't modify any existing values in query, just adds a new one.

# be stored by accident.
@property
def table_name(self):
if hasattr(self, "seed") and self.seed is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if, rather than needing to check the logic in a bunch of places, it would be nicer to have a SeededRandom class to mix in, or SeededRandom(RandomBase, metaclass=ABCMeta) to use as another step in inheritance chain?

"""
Parameters passed when initialising this query.
"""
return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

return dict(seed=self.seed, **super()._sample_params)

return sampled_query


class SeededRandom(RandomBase, metaclass=ABCMeta):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this to SeedableRandom, because there's no guarantee it is actually seeded.

Base class for queries used to obtain a random sample from a table.
"""

def __init__(self, query, *, size=None, fraction=None, estimate_count=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotations while we're here.

Copy link
Member

@greenape greenape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonza :)

@jc-harrison jc-harrison added the ready-to-merge Label indicating a PR is OK to automerge label Aug 22, 2019
@mergify mergify bot merged commit c34a45e into master Aug 22, 2019
@mergify mergify bot deleted the expose-random-sampling branch August 22, 2019 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FlowClient Issues related to FlowClient FlowMachine Issues related to FlowMachine ready-to-merge Label indicating a PR is OK to automerge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose random sampling via api
2 participants