-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider channel-specific main_v4 tables, filtering probes per channel #1216
Comments
cc @acmiyaguchi Do you have a good sense of what percentage of probes we'd expect to never be populated in main pings? |
I'm not familiar with the sparsity of the main ping, but I expect there's a fairly large fraction of columns that are empty. Because of how we initially created the table, we have been using the all probes since Firefox 30 when generating the schema. There is the probe dictionary stats that shows the number of reported probes. I also took a look at the mozaggregator dumps to see the distribution of total users per probe. We haven't had release data in there for over a year, so I took a look at the log(total) for all of the non-keyed probes on 2018-11-01. The plot gives some reference for how populated each of the columns in the main ping table should be. For reference If we counted the number of null values for each probe in the main ping, I expect a large fraction of the columns would be unpopulated in release. I think opt-in would account for about 1-2k these columns. I think another 1k of these columns would be sparse due to expired probes. The fraction of rows populated with expired probes would probably be related to the relative size of old versions in the table. We probably don't care about older probes, so it may be good to cut these too if we wanted to make a more compact table. I can take a look and gather more definitive stats. There is some code for generating the |
There is a quick and dirty way to see what the schema-generator will decide is a release probe:
|
For reference, this is a bit more than half the size of all probes:
|
I'm under the impression that a large number of potential probes (the majority?) in the main ping are not included in release. Or maybe they're included in release, but we basically never see values in release because you have to opt-in, whereas those probes are on by default in nightly and beta.
I think it would be worth spending a little time on this to see if a
main_v4_release
table could have a significantly smaller schema if we never added opt-in probes to it. That would mean much smaller data volume in the problematic large-schemamain_v4_nightly
table.The text was updated successfully, but these errors were encountered: