This repository contains the data processing infrastructure for the Criminal Justice Administrative Records System (CJARS).
The execution of the various processing stages is managed by master_code.do
. The control scripts for each stage refer to the metadata in Dataset.csv
, which tracks the completion date for each stage along with other dataset-specific information used in the production process.
The production pipeline consists of four stages, listed here along with their respective primary control scripts and relevant sub-stages:
- Localization
/1_localization/localize.do
/1_localization/Utility_Programs/localize.ado
/1_localization/<dataset_id>/convert.do
andfix.do
- Standardization
/2_standardization/standardize.do
/2_standardization/<dataset_id>/clean_pii.do
/2_standardization/impute_race_gender.do
- Entity Resolution
/3_entity_resolution/entity_resolution.do
/3_entity_resolution/state_roster.do
/3_entity_resolution/cjars_id_push_state.do
- Harmonization
/4_harmonization/harmonize.do
/4_harmonization/1_prepared/<dataset_id>/clean_<table>.do
/4_harmonization/2_combined/data_coverage.do
/4_harmonization/2_combined/combine.do
/4_harmonization/2_combined/all_state_combine.do
Most processing is handled in dataset-specific scripts stored in a directory structure that follows each dataset's dataset ID, an identifier with the following format:
<state abbreviation>/<sub-state geographic type>/<city or county name>/<agency type>/<intake date>
For example, a dataset from a municipal source:
CA/Mu/Example_City/Police/20220101
State-level datasets (e.g. Department of Corrections) exclude the city or county name:
CA/St/DOC/20220101
Dataset IDs double as identifiers as well as filepaths which, when used in combination with the base folders for different data processing stages, point to the location of production code for each dataset.
For definitions of macros that occur frequently in CJARS code, as well as documentation for the fields in Dataset.csv
, see the Glossary at the bottom of this readme.
Note: The code in this repository (including most of the macros described above) relies on a variety of resources only available on the CJARS secure server, including raw and processed data along with various utility files and supplemental data and crosswalks. The code is not intended to run outside of this environment and is provided for perusal only. All dataset-specific code has been redacted and replaced with a sample dataset with the dataset ID CA/St/DOC/20220101
.
Localization is the process of converting raw files (plaintext, csv, Excel, Microsoft Access, etc.) into Stata's .dta
format, along with any cleaning necessary to make the data usable. This is generally limited to things like restoring missing variable names and checking types, so that the converted data is as 'raw' as possible.
The master localization script in /1_localization/Utility_Programs/localize.ado
handles most raw files without intervention. For datasets that do not automatically convert successfully, a dataset-specific script, /1_localization/<dataset_id>/convert.do
, is used to handle individual files; once all files are converted, any additional custom processing on the entire batch of files is taken care of in the optional fix.do
.
Standardization is the process of extracting personally identifying information (PII) from the raw data and coding it into a standard CJARS roster format. On the level of the individual dataset, standardization is handled by /2_standardization/<dataset_id>/clean_pii.do
. A template clean_pii.do
script is provided that demonstrates the process of loading, cleaning, and extracting PII from the raw data.
As a sub-stage of standardization, race and sex are imputed based on name prevalence by region. No dataset-specific code is required; the imputation script /2_standardization/impute_race_gender.do
runs automatically for each dataset during production.
In order to track individuals across multiple criminal justice events, the entity resolution stage uses a probabilistic matching algorithm to link individual records and with a unique person-level identifier called a cjars_id
. The sub-stages of entity resolution perform this matching within a given dataset, extend this matching process to the state level, and then build state rosters and push the resulting unique identifiers to the original raw data, which are stripped of PII.
Harmonization is the process of bringing raw criminal justice data into the CJARS schema. For example, dates are split into separate year, month, and day fields and assigned to the appropriate variables for offense date, case filing date, disposition date, and so on. Offense grades are standardized, either from explicitly coded variables or from analysis of text descriptions, and a machine learning model matches raw offense descriptions to a uniform set of offense codes. Dataset-level harmonization scripts are located in /4_harmonization/1_prepared/<dataset_id>/clean_<table>.do
, and the provided template files correspond with the CJARS tables for adjudication, arrest, parole, probation, and incarceration.
After harmonization, another sub-stage combines the harmonized files of each type by state, and then across all states. The combine stage also performs deduplication to account for records that appear in more than one extract from the same data source.
codedir
- Path of root code repository
VALIDATE
- Set to
"0"
inmaster_code.do
to skip validation codes during harmonization for faster processing time
- Set to
crossdir
- Root directory for utility data such as crosswalks and documentation files
tempdir
- A temporary directory unique for each instance of Stata - primarily used to store data files during localization that require additional processing
formatdir
- Root output folder for storing formatted data files after successful localization. This macro is concatenated with dataset ID to specify dataset-specific output folder
cleanpiidir
- Root output folder for storing data files containing cleaned PII after standardization. This macro is concatenated with dataset ID to specify dataset-specific output folder that contains
cleaned_pii.dta
,data_level_roster.dta
,pii_var.txt
,pointer_file.dta
, andrecord_id_cjars_id_crosswalk.dta
.
- Root output folder for storing data files containing cleaned PII after standardization. This macro is concatenated with dataset ID to specify dataset-specific output folder that contains
modeldir
- Root code folder for entity resolution codes
rosterdir
- Root output folder for storing state-specific roster files after successful entity resolution run
anondir
- Root output folder for storing anonymized data files after merging in
cjars_id
from entity resolution and dropping all of the variables listed inpii_var.txt
from standardization
- Root output folder for storing anonymized data files after merging in
datasetid
- Unique identifier to record each data extract. See below for syntax used to generate each
datasetid
s
- Unique identifier to record each data extract. See below for syntax used to generate each
pii_sufficient
- Binary variable to indicate that the raw data contains full personally identifiable information (PII). At the minimum, the raw data should provide offender's first and last names, and full date of birth.
active_use
- Binary variable to indicate whether the dataset should be included in the production run. If set to
0
for inactive use, theinactive_reason
column should be filled in with a brief description
- Binary variable to indicate whether the dataset should be included in the production run. If set to
inactive_reason
- Filled in for inactive datasets to describe the reason for excluding them in production run
pivot_table
- Binary variable to indicate whether the data files are relational (set to
1
). If so,pivot_id
must be filled in with relevant variable names
- Binary variable to indicate whether the data files are relational (set to
pivot_id
- Variable name(s) used to link one data file to another. Primarily used to propagate
cjars_id
to non-roster data files (data files without PII) when generating anonymized files
- Variable name(s) used to link one data file to another. Primarily used to propagate
state
- State from where the data was sourced
census_region
- Census region where the state is in for metadata
geography
- Geography of where the data is from. For state-wide data,
geography
is set to the state name while county or city names are used for data from lower-level agencies
- Geography of where the data is from. For state-wide data,
metro
- Metropolitan statistical areas (MSA) of the data provider
public
- Variable to denote whether data is publicly available (set to
1
) or not (set to0
)
- Variable to denote whether data is publicly available (set to
source
- Mechanism for data acquisition (e.g. scraping, data use agreement, freedom of information act)
sde_intake
- Date when the dataset was loaded on to the Secure Data Enclave (SDE) server
*_assigned
- Abbreviation of CJARS staff working on the production code for
*
processing stage
- Abbreviation of CJARS staff working on the production code for
*_cd_start_dt
- Date when the person
*_assigned
started working on the production code
- Date when the person
*_reviewer
- Abbreviation of CJARS staff who reviewed the production code written by
*_assigned
- Abbreviation of CJARS staff who reviewed the production code written by
*_cd_cmpl_dt
- Date when the
*_reviewer
finishes reviewing the production code. These fields must be filled in to enable production run.
- Date when the
*_run
- Date of last production run for
*
processing stage. These variables are automatically filled by runningmaster_code.do
. If the*_run
date is earlier than*_cd_cmpl_dt
(e.g.localize_run
=20220301
andlocalize_cd_cmpl_dt
=20220331
),master_code.do
will re-run the code.
- Date of last production run for
impute_run
- Date of last production run for imputing gender and race. Automatically updated from production run via
impute_race_gender.do
- Date of last production run for imputing gender and race. Automatically updated from production run via
entity_resolution_run
- Date of last entity resolution run. Automatically updated from production run via
entity_resolution.do
- Date of last entity resolution run. Automatically updated from production run via
supercluster_flag
- Binary variable to indicate if supercluster of
cjars_id
was created from entity resolution
- Binary variable to indicate if supercluster of
alias_drop
- Binary variable to indiicate whether alias data should be dropped during entity resolution
state_roster_run
- Date of last time state roster was generated
cjars_id_state_push_run
- Date of last time anonymized files (
cjars_id
merged in) was generated for harmonization
- Date of last time anonymized files (
toc_vintage
- The specific version of text-based offense classification tool to use for classifying charge descriptions