As TDP has evolved so has it's validation mechanisms, messages, and expansiveness. As such, many of the datafiles locked in the database and S3 have not undergone TDP's latest and most stringent validation processes. Because data quality is so important to all TDP stakeholders we wanted to introduce a way to reparse and subsequently re-validate datafiles that have already been submitted to TDP to enhance the integrity and the quality of the submissions. The following lays out the process TDP takes to automate and execute this process, and how this process can be tested locally and in our deployed environments.
As a safety measure, this process must ALWAYS be executed manually by a system administrator. Once executed, all processes thereafter are completely automated. The steps below outline how this process executes.
- System admin logs in to the appropriate backend application. E.g.
tdp-backend-raft
.- See OFA Admin Backend App Login instructions below
- System admin executes the
clean_and_reparse
Django command. E.gpython manage.py clean_and_reparse ...options
. - System admin validates the command is selecting the appropriate set of datafiles to reparse and executes the command.
clean_and_reparse
collects the appropriate datafiles that match the system admin's command choices.clean_and_reparse
executes a backup of the Postgres database.clean_and_reparse
creates/deletes appropriate Elastic indices pending the system admin's command choices.clean_and_reparse
deletes documents from appropriate Elastic indices pending the system admin's command choices.clean_and_reparse
deletes all Postgres rows associated to all selected datafiles.clean_and_reparse
deletesDataFileSummary
andParserError
objects associated with the selected datafiles.clean_and_reparse
re-saves the selected datafiles to the database.clean_and_reparse
pushes a newparser_task
onto the Redis queue for each of the selected datafiles.
Make sure you have submitted a few datafiles, ideally accross program types and fiscal timeframes.
- Browse the indices and the DAC and verify the indices reflect the document counts you expect and the DAC reflects the record counts you expect.
- Exec into the backend container.
- Execute
python manage.py clean_and_reparse -h
to get an idea of what options you might want to specify. - Execute the
clean_and_reparse
command with your selected options. - Verify in the above URL that Elastic is consistent with the options you selected.
- Verify the DAC has the same amount of records as in step 1.
This section assumes that you have submitted the following files: ADS.E2J.FTP1.TS06
, cat_4_edge_case.txt
, and small_ssp_section1.txt
. After submitting, your indices should match the indices below:
index docs.count
.kibana_1 1
dev_ssp_m1_submissions 5
dev_ssp_m2_submissions 6
dev_ssp_m3_submissions 8
dev_tanf_t1_submissions 817
dev_tanf_t2_submissions 884
dev_tanf_t3_submissions 1380
All tests are considered to have been run INDEPENDENTLY. For each test, your Elastic and DAC state should match the initial conditions above. The commands in the section below should be run in between each test if you want to match the expected output.
The commands should ALWAYS be executed in the order they appear below.
- curl -X DELETE 'http://localhost:9200/dev*'
- python manage.py search_index --rebuild
- Execute
python manage.py clean_and_reparse -a -n
- If this is the first time you're executing a command with new indices, because we have to create an alias in Elastic with the same name as the
original index i.e. (
dev_tanf_t1_submissions
), the old indices no matter whether you specified-d
or not will be deleted. From thereafter, the command will always respect the-d
switch.
- If this is the first time you're executing a command with new indices, because we have to create an alias in Elastic with the same name as the
original index i.e. (
- Expected Elastic results.
- If this is the first time you have ran the command the indices url should reflect 21 indices prefixed with
dev
and they should contain the same number of documents as the original indices did. The new indices will also have a datetime suffix indicating when the reparse occurred. - If this is the second time running this command the indices url should reflect 42 indices prefixed with
dev
and they should each contain the same number of documents as the original indices did. The latest indices will have a new datetime suffix delineating them from the other indices.
- If this is the first time you have ran the command the indices url should reflect 21 indices prefixed with
- Expected DAC results.
- The DAC record counts should be exactly the same no matter how many times the command is run.
- The primary key for all reparsed datafiles should no longer be the same.
ParserError
andDataFileSummary
objects should be consistent with the file.
- Execute
python manage.py clean_and_reparse -a
- The expected results for this command will be exactly the same as above. The only difference is that no matter how many times you execute this command, you should only see 21 indices in Elastic with the
dev
prefix.
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open .kibana_1 VKeA-BPcSQmJJl_AbZr8gQ 1 0 1 0 4.9kb 4.9kb
yellow open dev_ssp_m1_submissions_2024-07-05_17.26.26 mDIiQxJrRdq0z7W9H_QUYg 1 1 5 0 24kb 24kb
yellow open dev_ssp_m2_submissions_2024-07-05_17.26.26 OUrgAN1XRKOJgJHwr4xm7w 1 1 6 0 33.6kb 33.6kb
yellow open dev_ssp_m3_submissions_2024-07-05_17.26.26 60fCBXHGTMK31MyWw4t2gQ 1 1 8 0 32.4kb 32.4kb
yellow open dev_tanf_t1_submissions_2024-07-05_17.26.26 19f_lawWQKSeuwejo2Qgvw 1 1 817 0 288.2kb 288.2kb
yellow open dev_tanf_t2_submissions_2024-07-05_17.26.26 dPj2BdNtSJyAxCqnMaV2aw 1 1 884 0 414.4kb 414.4kb
yellow open dev_tanf_t3_submissions_2024-07-05_17.26.26 e7bEl0AURPmcZ5kiFwclcA 1 1 1380 0 355.2kb 355.2kb
Running the clean_and_reparse
command in a Cloud.gov environment will require the executor to do some exploratory data analysis for the environment to verify things are running correctly. With that said, the logic and general expected results for the local example commands above will be a one to one match with same command executed in Cloud.gov. Below are the general steps a system admin will follow to execute a desired command and also verify the results of the command.
- System admin logs in to the appropriate backend application. E.g.
tdp-backend-raft
. - System admin has the DAC open and verifies the counts of records, and other models before executing command.
- System admin logs into the environments Elastic proxy. E.g.
cf ssh tdp-elastic-proxy-dev
. - System admin queries the indices for their counts from the Elastic proxy:
curl http://localhost:8080/_cat/indices/?pretty&v&s=index
- System admin executes the
clean_and_reparse
Django command from the backend app. E.gpython manage.py clean_and_reparse -a -n
. - System admin verifies the DAC is consistent and the Elastic indices match their expectations.
API endpoint: api.fr.cloud.gov
$ cf login -a api.fr.cloud.gov --sso
Temporary Authentication Code ( Get one at https://login.fr.cloud.gov/passcode ): <one-time passcode redacted>
Authenticating...
OK
Select an org:
1. hhs-acf-ofa
2. sandbox-hhs
Org (enter to skip): 1
1
Targeted org hhs-acf-ofa.
Select a space:
1. tanf-dev
2. tanf-prod
3. tanf-staging
Space (enter to skip): 1
1
Targeted space tanf-dev.
API endpoint: https://api.fr.cloud.gov
API version: 3.170.0
user: <USER_NAME>
org: hhs-acf-ofa
space: tanf-dev
-
Get the app GUID
$ cf curl v3/apps/$(cf app tdp-backend-qasp --guid)/processes | jq --raw-output '.resources | .[]? | select(.type == "web").guid' <PROCESS_GUID redacted>
-
Get the SSH code
$ cf ssh-code <SSH_CODE redacted>
-
SSH into the App
$ ssh -p 2222 cf:<PROCESS_GUID redacted>/[email protected] The authenticity of host '[ssh.fr.cloud.gov]:2222 ([2620:108:d00f::fcd:e8d8]:2222)' can't be established. RSA key fingerprint is <KEY redacted>. This key is not known by any other names Please type 'yes', 'no' or the fingerprint: yes Could not create directory '/u/.ssh' (No such file or directory). Failed to add the host to the list of known hosts (/u/.ssh/known_hosts). cf:<PROCESS_GUID redacted>/[email protected]'s password:<SSH_CODE - will be invisible>
$ /tmp/lifecycle/shell
$ python manage.py clean_and_reparse -h
usage: manage.py clean_and_parse [-h] [-q {Q1,Q2,Q3,Q4}] [-y FISCAL_YEAR] [-a] [-n] [-d] [--configuration CONFIGURATION] [--version] [-v {0,1,2,3}] [--settings SETTINGS] [--pythonpath PYTHONPATH] [--traceback] [--no-color] [--force-color] [--skip-checks]
Delete and reparse a set of datafiles. All reparsed data will be moved into a new set of Elastic indexes.
options:
-h, --help show this help message and exit
-q {Q1,Q2,Q3,Q4}, --fiscal_quarter {Q1,Q2,Q3,Q4}
Reparse all files in the fiscal quarter, e.g. Q1.
-y FISCAL_YEAR, --fiscal_year FISCAL_YEAR
Reparse all files in the fiscal year, e.g. 2021.
-a, --all Clean and reparse all datafiles. If selected, fiscal_year/quarter aren't necessary.
--configuration CONFIGURATION
The name of the configuration class to load, e.g. "Development". If this isn't provided, the DJANGO_CONFIGURATION environment variable will be used.
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output, 2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g. "myproject.settings.main". If this isn't provided, the DJANGO_SETTINGS_MODULE environment variable will be used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g. "/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--force-color Force colorization of the command output.
--skip-checks Skip system checks.