Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sql search_metric_contribution #1077

Merged

Conversation

din-din-din
Copy link

Add sql folder search_metric_contribution under search_derived. Please review and let @benmiroglio and me know if there are any problems! Thanks!

@benmiroglio
Copy link
Contributor

@fbertsch would you mind taking a look at this? Ding is an intern who's working on productionizing a table to help us look at questions like the following on a daily basis:

"What percent of Ad Clicks are contributed by users in the top [10, 20, 30] percent of the SAP distribution?"

Copy link
Contributor

@fbertsch fbertsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, our tables are partitioned by submission_date. Right now this table doesn't have that field, so this will end up failing.

Additionally, this table will only really be useful at the end of the month; during the middle of the money, the fields (e.g. search_with_ads) will only have half the current month.

Additionally, months aren't even; some are longer than others.

I recommend instead including the @submission_date parameter as the submission_date field. Then read the past 27 days before that day (submission_date >= DATE_SUB(@submission_date, INTERVAL 27 DAY), and get the aggregates of those fields over that entire time period. That way we get daily updates of this table, every day is getting quantiles over the same time period, and it should match other metrics we use over monthly intervals.

@fbertsch
Copy link
Contributor

@din-din-din, let me know if you'd like to meet to discuss those changes. Happy to help out.

@benmiroglio
Copy link
Contributor

This LGTM, @fbertsch please confirm :)

Copy link
Contributor

@fbertsch fbertsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more changes. Getting there! Thanks for persevering.

description: >
A table shows how much of one metric is contributed by another metric.
owners:
-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@@ -0,0 +1,9 @@
friendly_name: Search Metric Contribution
description: >
A table shows how much of one metric is contributed by another metric.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a bit more fleshing out: What specific questions does this solve? What are the fields? What is it derived from?

labels:
schedule: daily
scheduling:
dag_name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems to be empty.

FROM
`moz-fx-data-shared-prod.search.search_clients_daily`
WHERE
submission_date >= DATE_SUB(@submission_date, INTERVAL 27 DAY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
submission_date >= DATE_SUB(@submission_date, INTERVAL 27 DAY)
submission_date BETWEEN DATE_SUB(@submission_date, INTERVAL 27 DAY) AND @submission_date

This prevents us from reading the future during backfills.

WHERE
submission_date >= DATE_SUB(@submission_date, INTERVAL 27 DAY)
GROUP BY
1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our sql style guide, we want explicitly named columns.

GROUP BY
1
),
monthly_aggregates AS (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the CTE below can be merged into one (with all SUM fields and all APPROX_QUANTILE fields).

@@ -0,0 +1,213 @@
-- temp function returns how much of one metric is contributed by the quantile of another metric
CREATE TEMP FUNCTION get_quantile(metric1 FLOAT64, metric2 FLOAT64, quantile FLOAT64) AS (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved into the udfs directory, and given a more descriptive name. The name of the file you create there will be udf/{udf_name}.sql, and inside that file use CREATE OR REPLACE FUNCTION udf.{udf_name}.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added udf.quantile_search_metric_contribution and called this function in query. But the log showed a 'function not found' error. Why was this issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind reviewing the new udf I added? Otherwise I can't add derived datasets I believe. Let me know if there were any problems! Thanks! :)

labels:
schedule: daily
scheduling:
dag_name:
Copy link
Contributor

@fbertsch fbertsch Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dag_name:
dag_name: bqetl_search

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the comments. I will review these and get back to you!

Ding Ding added 3 commits June 18, 2020 15:13
- add quantile_search_metric_contribution.sql in udf folder
- some fixes on search_metric_contribution
- update search_metric_contribution metadata
@benmiroglio
Copy link
Contributor

dthorn mentioned in #1081

Yeah, that's a known-issue with PRs that add a persistent UDF and use it in the same PR. The process for this is that we deploy the UDF manually right before merging the PR, and dry-run again to ensure it works as expected. Alternately the UDF can be moved to a separate PR.

@fbertsch are you able to "manually deploy" this UDF as dthorn describes? Or should we move the udf to a separate PR?

Copy link
Contributor

@fbertsch fbertsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I can get this working. Thanks for the contribution @din-din-din!

assert_equals(1.0, udf.quantile_search_metric_contribution(73, 1, 60)),
assert_null(udf.quantile_search_metric_contribution(29, 9, 60)),
assert_equals(8.0, udf.quantile_search_metric_contribution(118, 8, 21)),
assert_null(udf.quantile_search_metric_contribution(58, 5, 21));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is incorrect. It returns 5 because 58 > 21.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fbertsch! I fixed that. Let me know if there were any other problems!

Copy link
Contributor

@fbertsch fbertsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@fbertsch fbertsch merged commit 2cba4ac into mozilla:master Jun 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants