-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose random sampling via API #1168
Conversation
Test summaryRun details
View run in Cypress Dashboard ➡️ This comment has been generated by cypress-bot as a result of this project's GitHub integration settings. You can manage this integration in this project's settings in the Cypress Dashboard |
Codecov Report
@@ Coverage Diff @@
## master #1168 +/- ##
==========================================
+ Coverage 93.95% 94.18% +0.23%
==========================================
Files 144 152 +8
Lines 7111 7435 +324
Branches 699 695 -4
==========================================
+ Hits 6681 7003 +322
- Misses 321 326 +5
+ Partials 109 106 -3
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #1168 +/- ##
==========================================
+ Coverage 93.98% 94.13% +0.14%
==========================================
Files 151 153 +2
Lines 7336 7437 +101
Branches 697 693 -4
==========================================
+ Hits 6895 7001 +106
+ Misses 332 330 -2
+ Partials 109 106 -3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice :)
Just wondering if we can be a little more concise in a few places.
@@ -19,6 +20,7 @@ class UniqueLocationCountsSchema(Schema): | |||
end_date = fields.Date(required=True) | |||
aggregation_unit = AggregationUnit() | |||
subscriber_subset = SubscriberSubset() | |||
sampling = fields.Nested(RandomSampleSchema, allow_none=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular bit is pretty repetitive, would it be worth putting it in a shared parent class?
@@ -121,7 +122,7 @@ def __init__(self, name=None, schema=None, columns=None): | |||
q_state_machine = QueryStateMachine(self.redis, self.md5) | |||
q_state_machine.enqueue() | |||
q_state_machine.execute() | |||
self._db_store_cache_metadata(compute_time=0) | |||
write_cache_metadata(self.connection, self, compute_time=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
flowclient/flowclient/client.py
Outdated
@@ -617,6 +617,7 @@ def daily_location( | |||
aggregation_unit: str, | |||
method: str, | |||
subscriber_subset: Union[dict, None] = None, | |||
sampling: Union[dict, None] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in API world it makes sense for sampling to be rolled into the query params, but I wonder if client side it makes more sense to be able to do random_sample(other_query, ..)
?
vals = [x["id"] for x in cursor.fetchall()] | ||
assert [9, 4, 8] == vals | ||
assert len(vals) == 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥇
flowclient/flowclient/client.py
Outdated
""" | ||
spec = { | ||
sampled_query = query.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly more pythonic to do dict(query)
, although obviously in both cases any nested dicts still refer to their underlying object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good idea. I'm hoping a shallow copy is sufficient since random_sample
doesn't modify any existing values in query
, just adds a new one.
# be stored by accident. | ||
@property | ||
def table_name(self): | ||
if hasattr(self, "seed") and self.seed is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if, rather than needing to check the logic in a bunch of places, it would be nicer to have a SeededRandom
class to mix in, or SeededRandom(RandomBase, metaclass=ABCMeta)
to use as another step in inheritance chain?
""" | ||
Parameters passed when initialising this query. | ||
""" | ||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
return dict(seed=self.seed, **super()._sample_params)
return sampled_query | ||
|
||
|
||
class SeededRandom(RandomBase, metaclass=ABCMeta): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change this to SeedableRandom
, because there's no guarantee it is actually seeded.
Base class for queries used to obtain a random sample from a table. | ||
""" | ||
|
||
def __init__(self, query, *, size=None, fraction=None, estimate_count=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotations while we're here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bonza :)
Closes #1007
I have:
Description
Adds a
"sampling"
parameter to all query schemas except top-level aggregate queries.The random sample parameter validation is handled by a
RandomSampleSchema
, which post-load returns aRandomSampler
object with amake_random_sample_object
method. Themake_random_sample_object
method takes a Query object and calls itsrandom_sample
method with the appropriate parameters.Adds a corresponding
random_sample
function to FlowClient, to generate a random sample spec.Other changes:
RandomBase
class into multiple classes for the different sampling methods, to simplify argument checking and avoid having a large if/else in_make_query
.random_ints
function, so that it always returns exactly the requested number of random ints.Query._db_store_cache_metadata
method.Note: While it wouldn't make sense to expose a random sampling parameter for top-level exposed queries anyway (since these are all aggregates), it's worth noting that doing so would not be possible at the moment because un-seeded random samples cannot be stored, and so their results could not be retrieved by the API.