-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Counts estimate significantly off from actual count on raw IE datatable #3948
Comments
Results from the
Results
|
Here's a WIP PR that uses the actual count for Schedule E efile only. We can override the |
This query is based on the the view public.real_efile_se_f57_vw and real_efile.reps. real_efile_se_f57_vw itself has several source tables/views: Since we recently did a mass cleanup of old data for raw data, I was wondering if the statistics is off for the main source tables f57 and se. I had checked the statistics of all related tables and the statistics is updated recently. To be sure, I had also run a manual analysis of these base tables and the number of estimate is go haywire. After discussing with Laura, since all source tables/views for this query does not have huge amount of data, a direct count does not take long. It is therefore decided for this particular endpoint, skip the estimate, and directly go for the count. |
Thanks @fecjjeng! Based on our conversation, because Schedue E efile doesn't have that much data, we can just tell the API to use true counts. |
Quick thinking, Laura! |
What we're after:
A bug was discovered with the estimated counts on the raw IE datatable. As of 9/10/19 it was displaying
9,919,673,000
when the actual count was545,952
. We have it set that an estimate is sufficient when the count is over 500,000. It seems that because the actual count is over 500,000, it's not taken into account, so it's spitting out something that is severely inaccurate.We need to determine how to fix this in a way that provides a more accurate count of what actually exists in the data. Without a solution, this is a problem that will perpetuate as we continue to receive new data.
Example:
https://api.open.fec.gov/v1/schedules/schedule_e/efile/?api_key=DEMO_KEY&sort_hide_null=false&sort_nulls_last=true&data_type=efiling&is_notice=true&sort=-expenditure_date&per_page=30&page=1
Completion criteria:
The text was updated successfully, but these errors were encountered: