-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sql search_metric_contribution #1077
Add sql search_metric_contribution #1077
Conversation
@fbertsch would you mind taking a look at this? Ding is an intern who's working on productionizing a table to help us look at questions like the following on a daily basis: "What percent of Ad Clicks are contributed by users in the top [10, 20, 30] percent of the SAP distribution?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, our tables are partitioned by submission_date
. Right now this table doesn't have that field, so this will end up failing.
Additionally, this table will only really be useful at the end of the month; during the middle of the money, the fields (e.g. search_with_ads
) will only have half the current month.
Additionally, months aren't even; some are longer than others.
I recommend instead including the @submission_date
parameter as the submission_date
field. Then read the past 27 days before that day (submission_date >= DATE_SUB(@submission_date, INTERVAL 27 DAY)
, and get the aggregates of those fields over that entire time period. That way we get daily updates of this table, every day is getting quantiles over the same time period, and it should match other metrics we use over monthly intervals.
@din-din-din, let me know if you'd like to meet to discuss those changes. Happy to help out. |
This LGTM, @fbertsch please confirm :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more changes. Getting there! Thanks for persevering.
description: > | ||
A table shows how much of one metric is contributed by another metric. | ||
owners: | ||
- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,9 @@ | |||
friendly_name: Search Metric Contribution | |||
description: > | |||
A table shows how much of one metric is contributed by another metric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use a bit more fleshing out: What specific questions does this solve? What are the fields? What is it derived from?
labels: | ||
schedule: daily | ||
scheduling: | ||
dag_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems to be empty.
FROM | ||
`moz-fx-data-shared-prod.search.search_clients_daily` | ||
WHERE | ||
submission_date >= DATE_SUB(@submission_date, INTERVAL 27 DAY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
submission_date >= DATE_SUB(@submission_date, INTERVAL 27 DAY) | |
submission_date BETWEEN DATE_SUB(@submission_date, INTERVAL 27 DAY) AND @submission_date |
This prevents us from reading the future during backfills.
WHERE | ||
submission_date >= DATE_SUB(@submission_date, INTERVAL 27 DAY) | ||
GROUP BY | ||
1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per our sql style guide, we want explicitly named columns.
GROUP BY | ||
1 | ||
), | ||
monthly_aggregates AS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the CTE below can be merged into one (with all SUM
fields and all APPROX_QUANTILE
fields).
@@ -0,0 +1,213 @@ | |||
-- temp function returns how much of one metric is contributed by the quantile of another metric | |||
CREATE TEMP FUNCTION get_quantile(metric1 FLOAT64, metric2 FLOAT64, quantile FLOAT64) AS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved into the udfs directory, and given a more descriptive name. The name of the file you create there will be udf/{udf_name}.sql
, and inside that file use CREATE OR REPLACE FUNCTION udf.{udf_name}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added udf.quantile_search_metric_contribution and called this function in query. But the log showed a 'function not found' error. Why was this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind reviewing the new udf I added? Otherwise I can't add derived datasets I believe. Let me know if there were any problems! Thanks! :)
labels: | ||
schedule: daily | ||
scheduling: | ||
dag_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dag_name: | |
dag_name: bqetl_search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the comments. I will review these and get back to you!
- add quantile_search_metric_contribution.sql in udf folder - some fixes on search_metric_contribution - update search_metric_contribution metadata
dthorn mentioned in #1081
@fbertsch are you able to "manually deploy" this UDF as dthorn describes? Or should we move the udf to a separate PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I can get this working. Thanks for the contribution @din-din-din!
assert_equals(1.0, udf.quantile_search_metric_contribution(73, 1, 60)), | ||
assert_null(udf.quantile_search_metric_contribution(29, 9, 60)), | ||
assert_equals(8.0, udf.quantile_search_metric_contribution(118, 8, 21)), | ||
assert_null(udf.quantile_search_metric_contribution(58, 5, 21)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is incorrect. It returns 5 because 58 > 21.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fbertsch! I fixed that. Let me know if there were any other problems!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
Add sql folder search_metric_contribution under search_derived. Please review and let @benmiroglio and me know if there are any problems! Thanks!