Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add org type, designation to Schedule A and B tables #3787

Closed
9 of 10 tasks
dorothyyeager opened this issue May 23, 2019 · 12 comments · Fixed by #3867
Closed
9 of 10 tasks

Add org type, designation to Schedule A and B tables #3787

dorothyyeager opened this issue May 23, 2019 · 12 comments · Fixed by #3867
Assignees
Milestone

Comments

@dorothyyeager
Copy link
Contributor

dorothyyeager commented May 23, 2019

User story I was trying to search all contributions and disbursements made to corporate SSFs in 2019-2020 and set filter accordingly. The results appeared to yield a list of contributions made to all types of committees. The first results in this screenshot, for example, show contributions to a candidate and to a labor organization SSF.

image

To do

  • Add org_tp (Call it org_type) and cmte_dsgn (call it committee_designation) columns to Schedule A and B tables.
    • Need to consider how to handle information needed from Line 6 to determine Membership SSFs and nonconnected (see sql below)
  • Joseph to update Oracle procedure and packages to include org type and designation in the process to update these fields in A and B ("nightly refresh") used by Jean's Java program
  • Jean update Java programs and shell scripts to include org_tp and cmte_dsgn columns to schedule A and schedule B tables in daily transfer process.
  • Update programs (Oracle procedure, packages, java scripts, shell scripts) to track changes of cmte_dsgn and org_tp that happened without schedule A or B data changes.
  • create scripts to back fill data for these new columns for the existing rows.
  • Populate the data in Schedule A and B tables
  • Document the process so more than one person can update/maintain this process
  • Unblock merge Python changes Add org type to API for Schedules A and B #3841
  • Add a follow-up CMS issue to add the filters back (see Remove SSFs and Leadership PACs from receipts/individual contribution datatables fec-cms#2913 for original issue to remove them)
@lbeaufort
Copy link
Member

Thanks for entering this issue, @dorothyyeager! After speaking with @PaulClark2, we need organization type and designation in addition to committee type to filter by SSF's and Leadership PACs. We'd need to add these to Schedule A in order to filter by them, which is a big lift.

For now, think we should remove those options from the receipts/individual contributions datatables, since they don't work.

@lbeaufort
Copy link
Member

I'll go ahead and copy this to the CMS to remove the filters as a short-term workaround.

@lbeaufort lbeaufort changed the title Search for "all contributions to corporate SSFs" yields all contributions Add org type, designation to Schedule A datatables May 23, 2019
@PaulClark2
Copy link
Contributor

PaulClark2 commented May 24, 2019

This includes joint fundraising committees. For the statistical releases we exclude joint fundraising committees.

Separate Segregated Funds (SSFs):
Corporate: committee types N, Q or U, committee designations B, J or U, organization type C
Labor: committee types N, Q or U, committee designations B, J or U, organization type L
Trade: committee types N, Q or U, committee designations B, J or U, organization type T
Membership: committee types N, Q or U, committee designations B, J or U, organization type M, connected organization name NOT NULL and not equal to NONE
Cooperative: committee types N, Q or U, committee designations B, J or U, organization type V
Corporations without stock: committee types N, Q or U, committee designations B, J or U, organization type W

Nonconnected Committees:
Independent expenditure-only political committees (Super PACs): committee type O, committee designations B, J and U
Committees w with non-contribution accounts (Hybrid PACs): committee types V or W, committee designations B, J or U
Leadership PACs: committee types N or Q, committee designations D
Other nonconnected PACs:: committee types N, Q or U, committee designations B, J or U, organization type IS NULL OR (organization type M AND connected organization name IS NULL) OR (organization type M AND connected organization name NONE)

This sql is similar to what I use for the statistical releases. This sql includes joint fundraising committees.

SELECT 
(CASE WHEN cm.org_tp = 'C' THEN '1-Corporate'                        
    WHEN cm.org_tp = 'L' THEN '2-Labor'                        
    WHEN cm.org_tp = 'T' THEN '3-Trade'                        
    WHEN  ((cm.org_tp IS NULL) OR (cm.org_tp = 'M' AND cm.connected_org_nm IS NULL) OR (cm.org_tp = 'M' AND cm.connected_org_nm = 'NONE')) THEN '7-NonConnected'                    
    WHEN cm.org_tp = 'M'  AND cm.connected_org_nm IS NOT NULL AND cm.connected_org_nm <> 'NONE'  THEN '4-Membership'                        
    WHEN cm.org_tp = 'V' THEN '5-Cooperative'                        
    WHEN cm.org_tp = 'W' THEN '6-Corporation without stock'  END)  organization_type,
cm.cmte_id,
cm.cmte_nm,
cm.cmte_tp, 
cm.cmte_dsgn, 
cm.org_tp
FROM cmte_valid_fec_yr cm
WHERE cm.fec_election_yr = 2020
AND cm.cmte_tp in ('N', 'Q', 'U')
AND cm.cmte_dsgn not in 'D';

@lbeaufort
Copy link
Member

The CMS issue is already here: fecgov/fec-cms#2913

@lbeaufort
Copy link
Member

We should consider asking the database team whether they'd rather do this at the same time as the DB updates for #3798

@lbeaufort lbeaufort changed the title Add org type, designation to Schedule A datatables Add org type, designation to Schedule A and B datatables Jun 13, 2019
@PaulClark2
Copy link
Contributor

org_tp and cmte_dsgn are in cmte_valid_fec_yr

@lbeaufort lbeaufort changed the title Add org type, designation to Schedule A and B datatables Add org type, designation to Schedule A and B tables Jun 26, 2019
@lbeaufort lbeaufort modified the milestones: Sprint 9.3, Sprint 9.4 Jul 1, 2019
@dorothyyeager dorothyyeager assigned fecjjeng and unassigned rjayasekera Jul 2, 2019
@fecjjeng
Copy link
Contributor

fecjjeng commented Jul 9, 2019

org_tp and cmte_dsgn columns had been added to fec_fitem_sched_a and fec_fitem_sched_b tables. It is better to keep this concise name of the columns

  • these tables are semi-base tables, it is better to keep column name consistent with all the other base tables as much as we can so it is easier to understand and maintain.
  • the object name in database has length limit. Long object name will be truncated. With lengthy table or MV name, plus lengthy column name, the length of derived object name such as index (especially indexes with more than one column) will be truncated and cause problem.

@lbeaufort
Copy link
Member

lbeaufort commented Jul 9, 2019

@fecjjeng sounds good! I think the WIP PR has those column names. https://github.com/fecgov/openFEC/pull/3837/files

@fecjjeng
Copy link
Contributor

fecjjeng commented Jul 16, 2019

The following is a summary of the steps need to be done for this ticket. A document with more detail named ADD_UPDATE_ORG_TP_COLUMNS.txt is in the shared drive.
Please note: This is a general list of work need to be done for add columns for the sched_a/sched_b tables. The actual code change will really very depending on the nature of the columns added.
-- ***************************************************
-- Prepare migration script. The migration script needs to include
-- ***************************************************

  1. Add column statement

  2. If the added column need to be in the filter, then add create indexes statement.
    If this is a B-tree indexes, then include the sorting field (disb_dt/sub_id and disb_amt/sub_id) to make compound indexes

    Also update the following function as approprieate (which can be used to create indexes for the correspoinging fec_sched_a_xxxx_yyyy/fec_sched_b_xxxx_yyyy) in the future
    FUNCTION disclosure.finalize_itemized_schedule_a_tables
    FUNCTION disclosure.finalize_itemized_schedule_b_tables

NOTE: These columns will be added to all 3 cloud environment ahead of time at the same time since

  1. the java program that load the data (which is the same program, just pass in different db as parameter) will run to all environments
  2. the java program that handle the new data need to be in place and ready to transfer BEFORE backfill happened.
  3. the update statement to backfill this column for existing rows and indexes creation are time and resource consuming so need to be performed at off-peak hour

The migration file will be submitted and merged AFTER the actual tasks (add column, backfill data, create indexes) had been done. Therefore it need to be written in a specific way so it will not error out IF the columns and indexes already exists.

-- ***************************************************
-- Update Oracle packages, procedures, triggers, tables, materialized views, and views in intermediate database that calculate and control the Java Transfer program.
-- ***************************************************
-- ***************************************************
-- Update Java programs and shell script
-- ***************************************************
-- ***************************************************
-- Add columns to table, start daily transfer program, make sure all Oracle programs, Java programs, and shell scripts execute successfully
-- ***************************************************
-- ***************************************************
-- Prepare scripts to backfilled the data.
-- ***************************************************
-- ***************************************************
-- ***************************************************
-- on Postgresql Database
-- In this ticket, we also backfill the data for several tsvector columns that their content changed in issue #3798 (3 columns in sched_a and 2 columns in sched_b)
-- A temporary update triggers are added to each sched_a and sched_b partitions tables. Therefore during the update activity of this ticket, the update trigger will fire and update the tsvector columns


-- ***************************************************
-- ON AWS console
-- temporarily upgrade the class of our postgresql database to more powerful machine
-- NOTE: need to increase both master and replica. Otherwise a slow replica will not be able to catch up. This will slow down the whole process and the lag time will to too big to recover.
-- Parameters WAL_MAX_SIZE, WAL_MIN_SIZE, WAL_KEEP_SEGMENTS will need to be increased as well
-- Also need to anticipate increated storage size as well. Update operation in Postgresql will increase the size a lot. Also indexes will need space too.
-- ***************************************************

-- ***************************************************
-- on Postgresql Database


-- get "BEFORE UPDATE" size
select pg_size_pretty(pg_database_size('fec'));
-- ***************************************************
-- on Linux server
-- Execute the shell script to do the update work.(may not be able to finish all job at one time, need to break into smaller tasks)
-- NOTE: need to adjust the execution time to avoid nightly refresh time for all databases
-- NOTE: need to adjust the execution time to avoid peak query time for PRD database


-- ***************************************************
-- on Postgresql Database


-- ***************************************************
-- on Postgresql Database


-- get "AFTER UPDATE" size
select pg_size_pretty(pg_database_size('fec'));

-- ****************************
-- validate the update


-- ****************************************************
There are few rows with cmte_id that does not have corresponding data in cmte_valid_fec_yr. These are all in earlier cycles.
Data after 2007_2008 are completed with cmte_dsgn/org_tp data

For those few rows, the tsvector columns are not updated. Use the following statement to update them.

-- ****************************
-- Run create indexes script


-- ****************************
-- check if all indexes are created
select tablename, count(*)
from pg_indexes
where tablename like 'fec_fitem_sched_a%'
and (indexname like 'idx_sched_a_%org_tp%' or indexname like 'idx_sched_a%_cmte_dsgn%')
group by tablename
order by tablename;

select tablename, indexname
from pg_indexes
where tablename like 'fec_fitem_sched_a%'
and (indexname like 'idx_sched_a_%org_tp%' or indexname like 'idx_sched_a%_cmte_dsgn%')
order by tablename, indexname
-- ****************************
-- clean up the temporary UPDATE TRIGGERS that were used to update the tsvector columns from another issue


-- ***************************************************
-- ON AWS console
-- downgrade the class to be back to the regular class we use.
-- NOTE: need to increase both master and replica. Otherwise a slow replica will not be able to catch up. This will slow down the whole process and the lag time will to too big to recover.
-- Parameters WAL_MAX_SIZE, WAL_MIN_SIZE, WAL_KEEP_SEGMENTS will need to be back to original value
-- Review the storage size increase, make it official via Terraform.
-- ***************************************************

@fecjjeng
Copy link
Contributor

DEV/STG/PRD databases had been backfilled, indexes created. Following is the summary of time consumed/space increased by this task.
NOTE: The time indicated is the actual running time, with most powerful class of AWS database we can use, does NOT include lag time in between to avoid peak time and nightly refresh time. So it is a minimum. The local database is usually 2 to 3 times faster than the cloud databases in the regular classes.

Local backup database:
Time: 57 hours

DEV:
Time: 22 hours 48 minutes
Space increased: 442 GB

STG:
Time: 17 hours 26 minutes
Space increased: 458 GB

PRD:
Time: 19 hours 31 minutes
Space increased: 315 GB

@fecjjeng
Copy link
Contributor

Unblock #3841 and moved it to ready column.

@fecjjeng
Copy link
Contributor

A CMS ticket fecgov/fec-cms#3054 had been opened to add the filter back on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants