Expose random sampling via API #1168

jc-harrison · 2019-08-16T16:43:20Z

I have:

Formatted any Python files with black
Brought the branch up to date with master
Added any relevant Github labels
Added tests for any new additions
Added or updated any relevant documentation
Added an Architectural Decision Record (ADR), if appropriate
Added an MPLv2 License Header if appropriate
Updated the Changelog

Description

Adds a "sampling" parameter to all query schemas except top-level aggregate queries.
The random sample parameter validation is handled by a RandomSampleSchema, which post-load returns a RandomSampler object with a make_random_sample_object method. The make_random_sample_object method takes a Query object and calls its random_sample method with the appropriate parameters.

Adds a corresponding random_sample function to FlowClient, to generate a random sample spec.

Other changes:

Splits the RandomBase class into multiple classes for the different sampling methods, to simplify argument checking and avoid having a large if/else in _make_query.
Modifies the FlowDB random_ints function, so that it always returns exactly the requested number of random ints.
Removes the redundant Query._db_store_cache_metadata method.

Note: While it wouldn't make sense to expose a random sampling parameter for top-level exposed queries anyway (since these are all aggregates), it's worth noting that doing so would not be possible at the moment because un-seeded random samples cannot be stored, and so their results could not be retrieved by the API.

…Random object

cypress · 2019-08-16T17:04:25Z

Test summary

54 • 0 • 0 • 0

Run details


Project	FlowAuth
Status	Passed
Commit	`3bc6eda`
Started	Aug 22, 2019 8:04 AM
Ended	Aug 22, 2019 8:07 AM
Duration	02:36 💡
OS	Linux Debian - 8.11
Browser	Electron 61

View run in Cypress Dashboard ➡️

This comment has been generated by cypress-bot as a result of this project's GitHub integration settings. You can manage this integration in this project's settings in the Cypress Dashboard

codecov · 2019-08-16T17:10:56Z

Codecov Report

Merging #1168 into master will increase coverage by 0.23%.
The diff coverage is 95.65%.

@@            Coverage Diff             @@
##           master    #1168      +/-   ##
==========================================
+ Coverage   93.95%   94.18%   +0.23%     
==========================================
  Files         144      152       +8     
  Lines        7111     7435     +324     
  Branches      699      695       -4     
==========================================
+ Hits         6681     7003     +322     
- Misses        321      326       +5     
+ Partials      109      106       -3

Flag	Coverage Δ
#flowapi_unit_tests	`82.4% <ø> (ø)`	⬆️
#flowauth_unit_tests	`94.18% <ø> (ø)`	⬆️
#flowclient_unit_tests	`78.78% <20%> (-1.31%)`	⬇️
#flowetl_unit_tests	`96.63% <ø> (?)`
#flowkit_jwt_generator_unit_tests	`100% <ø> (ø)`	⬆️
#flowmachine_unit_tests	`90.55% <80%> (-0.18%)`	⬇️
#integration_tests	`67.13% <78.69%> (+1.37%)`	⬆️

Impacted Files	Coverage Δ
...e/flowmachine/core/server/query_schemas/handset.py	`100% <100%> (ø)`	⬆️
...achine/core/server/query_schemas/daily_location.py	`100% <100%> (ø)`	⬆️
...ine/core/server/query_schemas/subscriber_degree.py	`100% <100%> (ø)`	⬆️
...wmachine/core/server/query_schemas/displacement.py	`100% <100%> (ø)`	⬆️
flowmachine/flowmachine/core/query.py	`93.22% <100%> (+0.85%)`	⬆️
...wmachine/core/server/query_schemas/topup_amount.py	`100% <100%> (ø)`	⬆️
flowmachine/flowmachine/core/cache.py	`96.29% <100%> (+0.52%)`	⬆️
...ore/server/query_schemas/unique_location_counts.py	`100% <100%> (ø)`	⬆️
...ne/core/server/query_schemas/radius_of_gyration.py	`100% <100%> (ø)`	⬆️
...e/core/server/query_schemas/pareto_interactions.py	`100% <100%> (ø)`	⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b76f230...2d676c9. Read the comment docs.

codecov · 2019-08-16T17:10:58Z

Codecov Report

Merging #1168 into master will increase coverage by 0.14%.
The diff coverage is 96.04%.

@@            Coverage Diff             @@
##           master    #1168      +/-   ##
==========================================
+ Coverage   93.98%   94.13%   +0.14%     
==========================================
  Files         151      153       +2     
  Lines        7336     7437     +101     
  Branches      697      693       -4     
==========================================
+ Hits         6895     7001     +106     
+ Misses        332      330       -2     
+ Partials      109      106       -3

Flag	Coverage Δ
#flowapi_unit_tests	`82.55% <ø> (ø)`	⬆️
#flowauth_unit_tests	`93.65% <ø> (ø)`	⬆️
#flowclient_unit_tests	`78.11% <14.28%> (-1.98%)`	⬇️
#flowetl_unit_tests	`96.63% <ø> (ø)`	⬆️
#flowkit_jwt_generator_unit_tests	`100% <ø> (ø)`	⬆️
#flowmachine_unit_tests	`90.75% <89.83%> (-0.03%)`	⬇️
#integration_tests	`66.98% <80.23%> (+1.28%)`	⬆️

Impacted Files	Coverage Δ
...e/flowmachine/core/server/query_schemas/handset.py	`100% <100%> (ø)`	⬆️
...achine/core/server/query_schemas/daily_location.py	`100% <100%> (ø)`	⬆️
...ine/core/server/query_schemas/subscriber_degree.py	`100% <100%> (ø)`	⬆️
...wmachine/core/server/query_schemas/displacement.py	`100% <100%> (ø)`	⬆️
flowmachine/flowmachine/core/query.py	`93.22% <100%> (+0.85%)`	⬆️
...wmachine/core/server/query_schemas/topup_amount.py	`100% <100%> (ø)`	⬆️
flowmachine/flowmachine/core/cache.py	`96.29% <100%> (+0.52%)`	⬆️
...ore/server/query_schemas/unique_location_counts.py	`100% <100%> (ø)`	⬆️
...ne/core/server/query_schemas/radius_of_gyration.py	`100% <100%> (ø)`	⬆️
...e/core/server/query_schemas/pareto_interactions.py	`100% <100%> (ø)`	⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8367d8...3bc6eda. Read the comment docs.

greenape

Very nice :)

Just wondering if we can be a little more concise in a few places.

greenape · 2019-08-19T16:40:04Z

flowmachine/flowmachine/core/server/query_schemas/unique_location_counts.py

@@ -19,6 +20,7 @@ class UniqueLocationCountsSchema(Schema):
    end_date = fields.Date(required=True)
    aggregation_unit = AggregationUnit()
    subscriber_subset = SubscriberSubset()
+    sampling = fields.Nested(RandomSampleSchema, allow_none=True)


This particular bit is pretty repetitive, would it be worth putting it in a shared parent class?

greenape · 2019-08-19T16:40:20Z

flowmachine/flowmachine/core/table.py

@@ -121,7 +122,7 @@ def __init__(self, name=None, schema=None, columns=None):
        q_state_machine = QueryStateMachine(self.redis, self.md5)
        q_state_machine.enqueue()
        q_state_machine.execute()
-        self._db_store_cache_metadata(compute_time=0)
+        write_cache_metadata(self.connection, self, compute_time=0)


greenape · 2019-08-19T16:42:29Z

flowclient/flowclient/client.py

@@ -617,6 +617,7 @@ def daily_location(
    aggregation_unit: str,
    method: str,
    subscriber_subset: Union[dict, None] = None,
+    sampling: Union[dict, None] = None,


I think in API world it makes sense for sampling to be rolled into the query params, but I wonder if client side it makes more sense to be able to do random_sample(other_query, ..)?

greenape · 2019-08-19T16:43:39Z

flowdb/tests/test_utility_functions.py

    vals = [x["id"] for x in cursor.fetchall()]
-    assert [9, 4, 8] == vals
+    assert len(vals) == 5


…ndom-sampling

greenape · 2019-08-20T12:17:39Z

flowclient/flowclient/client.py

    """
-    spec = {
+    sampled_query = query.copy()


Slightly more pythonic to do dict(query), although obviously in both cases any nested dicts still refer to their underlying object.

Ah, good idea. I'm hoping a shallow copy is sufficient since random_sample doesn't modify any existing values in query, just adds a new one.

…ndom-sampling

greenape · 2019-08-21T11:43:38Z

flowmachine/flowmachine/core/random.py

+    # be stored by accident.
+    @property
+    def table_name(self):
+        if hasattr(self, "seed") and self.seed is not None:


Wondering if, rather than needing to check the logic in a bunch of places, it would be nicer to have a SeededRandom class to mix in, or SeededRandom(RandomBase, metaclass=ABCMeta) to use as another step in inheritance chain?

…ndom-sampling

greenape · 2019-08-21T15:29:27Z

flowmachine/flowmachine/core/random.py

+        """
+        Parameters passed when initialising this query.
+        """
+        return {


How about:

return dict(seed=self.seed, **super()._sample_params)

greenape · 2019-08-21T15:30:07Z

flowmachine/flowmachine/core/random.py

+        return sampled_query
+
+
+class SeededRandom(RandomBase, metaclass=ABCMeta):


Let's change this to SeedableRandom, because there's no guarantee it is actually seeded.

greenape · 2019-08-21T15:48:34Z

flowmachine/flowmachine/core/random.py

+    Base class for queries used to obtain a random sample from a table.
+    """
+
+    def __init__(self, query, *, size=None, fraction=None, estimate_count=True):


Type annotations while we're here.

…ndom-sampling

greenape

Bonza :)

jc-harrison added 30 commits July 30, 2019 17:27

Define random sample schema

f90e226

Update Random docstring

48c59df

Change 'seed' value validation

0e01212

Add post-load random sample object with method to create flowmachine …

a69e10e

…Random object

Add missing import

41c4e04

Use a OneOffSchema for RandomSampleSchema

c644fb7

Add sampling parameter to daily_location

d696245

Small fixes

83178f9

Re-implement random_ints to return the specified number of ints

ccb2e53

Fix incorrect calling signature in random.py

8ee3404

Relax seed validity condition in query.py

a7b3614

No longer need a size_buffer for random_ids

972736b

Use random_sample method in random_sample schema

7849440

Add tests for random sampling schema

be641b2

Update docstring

1b1ffa8

Move seed check from Query class to Random class

1e792bd

Refactor Random classes

84102d7

Fix tests

c5f0ef2

Add sampling_method field for API spec visibility

90d5afd

Approve integration tests

8ea9a23

Allow sampling=None

fa55c60

Add random_sample function to FlowClient

12b7945

Fix default values in flowclient.handset

8c73a19

Add docstring for random_sample

a8d955d

Approve integration tests

2871525

Add integration tests for random sampling

ec3a4a0

Remove _db_store_cache_metadata method

c18226c

Make random samples picklable

46414d8

Add test for pickling Random objects

619c582

Fix tests that were skipped due to bad names

9e80c19

jc-harrison added the FlowMachine Issues related to FlowMachine label Aug 16, 2019

Approve integration tests

2d676c9

greenape requested changes Aug 19, 2019

View reviewed changes

jc-harrison added 2 commits August 20, 2019 11:39

Merge branch 'master' of github.com:Flowminder/FlowKit into expose-ra…

1b4c107

…ndom-sampling

Pass query to random_sample in FlowClient

0102a7a

greenape reviewed Aug 20, 2019

View reviewed changes

jc-harrison added 4 commits August 20, 2019 14:31

Use dict() instead of copy()

8eff4ec

Merge branch 'master' of github.com:Flowminder/FlowKit into expose-ra…

847a0c1

…ndom-sampling

Move sampling stuff into parent classes

1c7b426

Merge branch 'master' into expose-random-sampling

9fda795

greenape reviewed Aug 21, 2019

View reviewed changes

jc-harrison added 3 commits August 21, 2019 15:33

Merge branch 'master' of github.com:Flowminder/FlowKit into expose-ra…

d3af3b4

…ndom-sampling

Add SeededRandom class

ad275e0

Merge branch 'master' of github.com:Flowminder/FlowKit into expose-ra…

7f2e06d

…ndom-sampling

greenape reviewed Aug 21, 2019

View reviewed changes

jc-harrison added 3 commits August 21, 2019 16:40

Fix __init__

95ec982

Rename SeededRandom to SeedableRandom

ae0a353

Don't duplicate _sample_params

d7b87b4

greenape reviewed Aug 21, 2019

View reviewed changes

jc-harrison added 3 commits August 21, 2019 17:42

Type annotations

91db44d

Merge branch 'master' of github.com:Flowminder/FlowKit into expose-ra…

de922fb

…ndom-sampling

Merge branch 'master' into expose-random-sampling

3bc6eda

greenape approved these changes Aug 22, 2019

View reviewed changes

jc-harrison added the ready-to-merge Label indicating a PR is OK to automerge label Aug 22, 2019

mergify bot merged commit c34a45e into master Aug 22, 2019

mergify bot deleted the expose-random-sampling branch August 22, 2019 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose random sampling via API #1168

Expose random sampling via API #1168

jc-harrison commented Aug 16, 2019

cypress bot commented Aug 16, 2019 •

edited

Loading

codecov bot commented Aug 16, 2019

codecov bot commented Aug 16, 2019 •

edited

Loading

greenape left a comment

greenape Aug 19, 2019

greenape Aug 19, 2019

greenape Aug 19, 2019

greenape Aug 19, 2019

greenape Aug 20, 2019

jc-harrison Aug 20, 2019

greenape Aug 21, 2019

greenape Aug 21, 2019

greenape Aug 21, 2019

greenape Aug 21, 2019

greenape left a comment

		return sampled_query


		class SeededRandom(RandomBase, metaclass=ABCMeta):

Expose random sampling via API #1168

Expose random sampling via API #1168

Conversation

jc-harrison commented Aug 16, 2019

I have:

Description

cypress bot commented Aug 16, 2019 • edited Loading

Test summary

Run details

codecov bot commented Aug 16, 2019

Codecov Report

codecov bot commented Aug 16, 2019 • edited Loading

Codecov Report

greenape left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greenape left a comment

Choose a reason for hiding this comment

cypress bot commented Aug 16, 2019 •

edited

Loading

codecov bot commented Aug 16, 2019 •

edited

Loading