Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counts estimate significantly off from actual count on raw IE datatable #3948

Closed
2 tasks done
JonellaCulmer opened this issue Sep 10, 2019 · 5 comments · Fixed by #4002
Closed
2 tasks done

Counts estimate significantly off from actual count on raw IE datatable #3948

JonellaCulmer opened this issue Sep 10, 2019 · 5 comments · Fixed by #4002

Comments

@JonellaCulmer
Copy link
Contributor

JonellaCulmer commented Sep 10, 2019

What we're after:
A bug was discovered with the estimated counts on the raw IE datatable. As of 9/10/19 it was displaying 9,919,673,000 when the actual count was 545,952. We have it set that an estimate is sufficient when the count is over 500,000. It seems that because the actual count is over 500,000, it's not taken into account, so it's spitting out something that is severely inaccurate.

We need to determine how to fix this in a way that provides a more accurate count of what actually exists in the data. Without a solution, this is a problem that will perpetuate as we continue to receive new data.

Screen Shot 2019-09-10 at 12 29 05 PM

Example:
https://api.open.fec.gov/v1/schedules/schedule_e/efile/?api_key=DEMO_KEY&sort_hide_null=false&sort_nulls_last=true&data_type=efiling&is_notice=true&sort=-expenditure_date&per_page=30&page=1

Completion criteria:

  • Raw filing data is being actively being reduced to 4 months. We are testing in staging and will be moving to apply to production soon. See ticket: Only keep four months of raw filings data in PG #3789
  • After stage and production data is reduced, check the actual database counts of raw IE data vs what's displayed on the page and in the API call. The assumption is the totals should be more accurate.
@JonellaCulmer JonellaCulmer added this to the Sprint 10.2 milestone Sep 10, 2019
@patphongs patphongs modified the milestones: Sprint 10.2, Sprint 10.3 Sep 20, 2019
@lbeaufort
Copy link
Member

Results from the EXPLAIN:
Query

SELECT real_efile_se_f57_vw.line_num, real_efile_se_f57_vw.tran_id, real_efile_se_f57_vw.imageno, real_efile_se_f57_vw.entity, real_efile_se_f57_vw.amend, real_efile_se_f57_vw.br_tran_id, real_efile_se_f57_vw.br_sname, real_efile_se_f57_vw.create_dt, real_efile_se_f57_vw.repid, real_efile_se_f57_vw.rel_lineno, real_efile_se_f57_vw.comid, real_efile_se_f57_vw.prefix, real_efile_se_f57_vw.fname, real_efile_se_f57_vw.mname, real_efile_se_f57_vw.lname, real_efile_se_f57_vw.suffix, real_efile_se_f57_vw.str1, real_efile_se_f57_vw.str2, real_efile_se_f57_vw.city, real_efile_se_f57_vw.state, real_efile_se_f57_vw.zip, real_efile_se_f57_vw.pcf_lname, real_efile_se_f57_vw.pcf_mname, real_efile_se_f57_vw.pcf_fname, real_efile_se_f57_vw.pcf_suffix, real_efile_se_f57_vw.pcf_prefix, real_efile_se_f57_vw.so_canid, real_efile_se_f57_vw.so_can_name, real_efile_se_f57_vw.so_can_prefix, real_efile_se_f57_vw.so_can_fname, real_efile_se_f57_vw.so_can_mname, real_efile_se_f57_vw.so_can_suffix, real_efile_se_f57_vw.so_can_off, real_efile_se_f57_vw.so_can_state, real_efile_se_f57_vw.so_can_dist, real_efile_se_f57_vw.exp_desc, real_efile_se_f57_vw.exp_date, real_efile_se_f57_vw.amount, real_efile_se_f57_vw.ytd, real_efile_se_f57_vw.cat_code, real_efile_se_f57_vw.supop, real_efile_se_f57_vw.not_date, real_efile_se_f57_vw.dissem_dt, real_efile_se_f57_vw.cand_pty_affiliation, reps_1.repid, reps_1.form, reps_1.comid, reps_1.com_name, reps_1.timestamp, reps_1.create_dt, reps_1.from_date, reps_1.through_date, reps_1.starting, reps_1.ending, reps_1.rptcode, reps_1.previd, reps_1.rptnum, reps_1.filed_date, reps_1.superceded, real_efile_se_f57_vw.memo_code, real_efile_se_f57_vw.memo_text, real_efile_se_f57_vw.cand_fulltxt, real_efile_se_f57_vw.most_recent, efiling_amendment_chain_vw_1.repid, efiling_amendment_chain_vw_1.previd, efiling_amendment_chain_vw_1.amendment_chain, efiling_amendment_chain_vw_1.longest_chain, efiling_amendment_chain_vw_1.most_recent_filing, efiling_amendment_chain_vw_1.depth, efiling_amendment_chain_vw_1.last, ofec_committee_history_mv_1.idx, ofec_committee_history_mv_1.name, ofec_committee_history_mv_1.committee_id, ofec_committee_history_mv_1.cycles, ofec_committee_history_mv_1.treasurer_name, ofec_committee_history_mv_1.treasurer_text, ofec_committee_history_mv_1.committee_type, ofec_committee_history_mv_1.committee_type_full, ofec_committee_history_mv_1.filing_frequency, ofec_committee_history_mv_1.designation, ofec_committee_history_mv_1.designation_full, ofec_committee_history_mv_1.organization_type, ofec_committee_history_mv_1.organization_type_full, ofec_committee_history_mv_1.affiliated_committee_name, ofec_committee_history_mv_1.party, ofec_committee_history_mv_1.party_full, ofec_committee_history_mv_1.state, ofec_committee_history_mv_1.street_1, ofec_committee_history_mv_1.street_2, ofec_committee_history_mv_1.city, ofec_committee_history_mv_1.state_full, ofec_committee_history_mv_1.zip, ofec_committee_history_mv_1.candidate_ids, ofec_committee_history_mv_1.cycle 
FROM real_efile_se_f57_vw JOIN real_efile.reps AS reps_1 ON real_efile_se_f57_vw.repid = reps_1.repid LEFT OUTER JOIN efiling_amendment_chain_vw AS efiling_amendment_chain_vw_1 ON efiling_amendment_chain_vw_1.repid = reps_1.repid LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_1 ON real_efile_se_f57_vw.comid = ofec_committee_history_mv_1.committee_id AND EXTRACT(year FROM real_efile_se_f57_vw.create_dt) + CAST(EXTRACT(year FROM real_efile_se_f57_vw.create_dt) AS INTEGER) % 2 = ofec_committee_history_mv_1.cycle

Results

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                                                                                                                                                  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Hash Join  (cost=416559.18..46244192.95 rows=367838276 width=2853)                                                                                                                                                                                                          |
|   Hash Cond: (combined.repid = reps_1.repid)                                                                                                                                                                                                                                |
|   ->  Nested Loop Left Join  (cost=220359.65..29142065.46 rows=19972728 width=2577)                                                                                                                                                                                         |
|         ->  Merge Left Join  (cost=220359.23..15749221.34 rows=19972728 width=3132)                                                                                                                                                                                         |
|               Merge Cond: (combined.repid = amd.repid)                                                                                                                                                                                                                      |
|               CTE combined                                                                                                                                                                                                                                                  |
|                 ->  Append  (cost=0.00..17188.18 rows=13218 width=1740)                                                                                                                                                                                                     |
|                       ->  Seq Scan on f57  (cost=0.00..10791.76 rows=6076 width=1683)                                                                                                                                                                                       |
|                       ->  Seq Scan on se  (cost=0.00..6396.42 rows=7142 width=1790)                                                                                                                                                                                         |
|               ->  Sort  (cost=13148.77..13181.81 rows=13218 width=2201)                                                                                                                                                                                                     |
|                     Sort Key: combined.repid                                                                                                                                                                                                                                |
|                     ->  Hash Left Join  (cost=5368.40..12243.98 rows=13218 width=2201)                                                                                                                                                                                      |
|                           Hash Cond: (((combined.so_canid)::text = (cand.cand_id)::text) AND ((date_part('year'::text, combined.create_dt) + (((date_part('year'::text, combined.create_dt))::integer % 2))::double precision) = (cand.fec_election_yr)::double precision)) |
|                           ->  CTE Scan on combined  (cost=0.00..264.36 rows=13218 width=2198)                                                                                                                                                                               |
|                           ->  Hash  (cost=3724.16..3724.16 rows=109616 width=18)                                                                                                                                                                                            |
|                                 ->  Seq Scan on cand_valid_fec_yr cand  (cost=0.00..3724.16 rows=109616 width=18)                                                                                                                                                           |
|               ->  Sort  (cost=190022.28..190777.79 rows=302205 width=32)                                                                                                                                                                                                    |
|                     Sort Key: amd.repid                                                                                                                                                                                                                                     |
|                     ->  Subquery Scan on amd  (cost=152692.15..162513.82 rows=302205 width=32)                                                                                                                                                                              |
|                           ->  WindowAgg  (cost=152692.15..159491.77 rows=302205 width=184)                                                                                                                                                                                  |
|                                 CTE oldest_filing                                                                                                                                                                                                                           |
|                                   ->  Recursive Union  (cost=0.00..119139.59 rows=302205 width=75)                                                                                                                                                                          |
|                                         ->  Seq Scan on reps  (cost=0.00..6022.09 rows=12655 width=65)                                                                                                                                                                      |
|                                               Filter: (previd IS NULL)                                                                                                                                                                                                      |
|                                         ->  Hash Join  (cost=6069.02..10707.34 rows=28955 width=75)                                                                                                                                                                         |
|                                               Hash Cond: (oldest.repid = se_1.previd)                                                                                                                                                                                       |
|                                               ->  WorkTable Scan on oldest_filing oldest  (cost=0.00..2531.00 rows=126550 width=52)                                                                                                                                         |
|                                               ->  Hash  (cost=6022.09..6022.09 rows=3754 width=23)                                                                                                                                                                          |
|                                                     ->  Seq Scan on reps se_1  (cost=0.00..6022.09 rows=3754 width=23)                                                                                                                                                      |
|                                                           Filter: (previd IS NOT NULL)                                                                                                                                                                                      |
|                                 ->  Sort  (cost=33552.56..34308.08 rows=302205 width=36)                                                                                                                                                                                    |
|                                       Sort Key: oldest_filing.last, oldest_filing.depth                                                                                                                                                                                     |
|                                       ->  CTE Scan on oldest_filing  (cost=0.00..6044.10 rows=302205 width=36)                                                                                                                                                              |
|         ->  Index Scan using idx_ofec_committee_history_mv_committee_id on ofec_committee_history_mv ofec_committee_history_mv_1  (cost=0.42..0.65 rows=1 width=343)                                                                                                        |
|               Index Cond: ((combined.comid)::text = (committee_id)::text)                                                                                                                                                                                                   |
|               Filter: ((date_part('year'::text, combined.create_dt) + (((date_part('year'::text, combined.create_dt))::integer % 2))::double precision) = (cycle)::double precision)                                                                                        |
|   ->  Hash  (cost=181206.98..181206.98 rows=302205 width=276)                                                                                                                                                                                                               |
|         ->  Hash Right Join  (cost=158919.36..181206.98 rows=302205 width=276)                                                                                                                                                                                              |
|               Hash Cond: (oldest_filing_1.repid = reps_1.repid)                                                                                                                                                                                                             |
|               ->  WindowAgg  (cost=152692.15..167802.40 rows=302205 width=184)                                                                                                                                                                                              |
|                     CTE oldest_filing                                                                                                                                                                                                                                       |
|                       ->  Recursive Union  (cost=0.00..119139.59 rows=302205 width=75)                                                                                                                                                                                      |
|                             ->  Seq Scan on reps reps_2  (cost=0.00..6022.09 rows=12655 width=65)                                                                                                                                                                           |
|                                   Filter: (previd IS NULL)                                                                                                                                                                                                                  |
|                             ->  Hash Join  (cost=6069.02..10707.34 rows=28955 width=75)                                                                                                                                                                                     |
|                                   Hash Cond: (oldest_1.repid = se_2.previd)                                                                                                                                                                                                 |
|                                   ->  WorkTable Scan on oldest_filing oldest_1  (cost=0.00..2531.00 rows=126550 width=52)                                                                                                                                                   |
|                                   ->  Hash  (cost=6022.09..6022.09 rows=3754 width=23)                                                                                                                                                                                      |
|                                         ->  Seq Scan on reps se_2  (cost=0.00..6022.09 rows=3754 width=23)                                                                                                                                                                  |
|                                               Filter: (previd IS NOT NULL)                                                                                                                                                                                                  |
|                     ->  Sort  (cost=33552.56..34308.08 rows=302205 width=100)                                                                                                                                                                                               |
|                           Sort Key: oldest_filing_1.last, oldest_filing_1.depth                                                                                                                                                                                             |
|                           ->  CTE Scan on oldest_filing oldest_filing_1  (cost=0.00..6044.10 rows=302205 width=100)                                                                                                                                                         |
|               ->  Hash  (cost=6022.09..6022.09 rows=16409 width=128)                                                                                                                                                                                                        |
|                     ->  Seq Scan on reps reps_1  (cost=0.00..6022.09 rows=16409 width=128)                                                                                                                                                                                  |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

@lbeaufort
Copy link
Member

lbeaufort commented Oct 7, 2019

Here's a WIP PR that uses the actual count for Schedule E efile only. We can override the get method to use exact count. #4002

@fecjjeng
Copy link
Contributor

fecjjeng commented Oct 7, 2019

This query is based on the the view public.real_efile_se_f57_vw and real_efile.reps. real_efile_se_f57_vw itself has several source tables/views:
real_efile.f57
real_efile.se
disclosure.cand_valid_fec_yr
efiling_amendment_chain_vw

Since we recently did a mass cleanup of old data for raw data, I was wondering if the statistics is off for the main source tables f57 and se. I had checked the statistics of all related tables and the statistics is updated recently. To be sure, I had also run a manual analysis of these base tables and the number of estimate is go haywire.
Then I took a eliminate method to comment out one source table/view a time, realize when comment out the efiling_amendment_chain_vw in the query, the estimate count is normal. In theory, a view will take advantage of the statistic data and indexes, etc from its source tables. However, in this particular case, one of the source (efiling_amendment_chain_vw) itself is a very complicated query, that makes it difficult for the optimizer to do correct estimates based on query/query (one can view views are named query).

After discussing with Laura, since all source tables/views for this query does not have huge amount of data, a direct count does not take long. It is therefore decided for this particular endpoint, skip the estimate, and directly go for the count.

@lbeaufort
Copy link
Member

Thanks @fecjjeng! Based on our conversation, because Schedue E efile doesn't have that much data, we can just tell the API to use true counts.

@fecjjeng
Copy link
Contributor

fecjjeng commented Oct 7, 2019

Quick thinking, Laura!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants