Scripts to generate endpoint, drug, and socioeconomic-demographic matrices in FinRegistry
- Introduction
- Creating matrices from endpoints
- Creating matrices from drug purchases
- Creating matrices from other data sources
- Creating matrices from kanta lab data
Now allows to output multiple records per individual!
The same individual can appear for multiple times in your SampleFile
, with different dates for inclusion periods. The program will output for each inclusion period, each individual, with the first column as FINREGISTRYID
, followed by LowerAge
and UpperAge
indicating the lower and upper bound of inclusion period in terms of individual's age. The output will be sorted by the these three columns in the same order. The program code allows you to have at max 20 different inclusion periods for each individual in your SampleFile
. If that is not enough, you can also expand it by changing
#define MaxRec 20
to a larger number in row 12/13 of the source codes and recompile :D
All examples appear below are hypothetical and do not contain any real data
Now you can use one single parameter file that looks like below to run both wide endpoint generator and wide drug generator. In this case, some parameters will be shared between the two programs. Using separate parameter files still works if you prefer. Please see below for more details.
Param | Description | Type | Default |
---|---|---|---|
SampleFile |
List of FinRegistry IDs to include | file | None |
OutputPrefix |
Output path | str | None |
LongEndPtFile |
Longitudinal endpoints file path (input data) | str | None |
EndPtList |
List of endpoints to be included | file | None |
EndPtByYear |
Endpoint output file by year | bool | F |
EndPtOutputEventCount |
Endpoint output file with event counts | bool | F |
EndPtOutputBinary |
Endpoint output file with binary indicators | bool | F |
EndPtOutputAge |
Endpoint output file with age at onset | bool | F |
LongFile |
Detailed longitudinal file path (input data) | str | None |
DrugList |
List of drugs (ATC codes or truncated ATC codes) to be included | file | None |
DrugByYear |
Drug output file by year | bool | F |
DrugOutputEventCount |
Drug output file with event counts | bool | F |
DrugOutputBinary |
Drug output file with binary indicators | bool | F |
DrugOutputAge |
Drug output file with age at first purchase | bool | F |
RegSource |
Source registry to be considered for drugs | str | None |
DrugMultiplyPackage |
Drug output file with counts weighted by the number of packages | bool | T |
PadNoEvent |
Output zeros if the sample does not exist in the longitudinal file, if F then only samples appear in the longitudinal file will be in the output. | bool | F |
Creates wide matrix (sample x feature) for disease endpoints from longitudinal file.
This is still a (hopefully) working prototype, please let Zhiyu know if anything goes wrong when you are using it. She will try to work it out (hopefully).
To Compile, download the code and run
gcc MakeEndPtFile.c -o Where/and/what/you/want/it/to/be -lm -fPIC
To excute, run
./Where/and/what/you/want/it/to/be ParamFile
See below for and example of ParamFile
:
SampleFile SampleList OutputPrefix OutputPrefix LongEndPtFile LongitudinalEndPtFile EndPtList EndPtList EndPtByYear T/F EndPtOutputEventCount T/F EndPtOutputBinary T/F EndPtOutputAge T/F
SampleFile
is a headed file listing the samples you would like to include in the output. It should have four columns in the following order: FINREGISTRYID, DateOfBirth for this sample, Lowerbound of record inclusion date for this sample, Upperbound of record inclusion date for this sample. All dates should be in a yyyy-mm-dd
format. The output will only include records for the sample happening between Lowerbound date
- Upperbound date
, including the two ends of the window. This allows each sample to have different inclusion periods. The program should be able to accept space, comma or tab separated file (Please let Zhiyu know if it doesn't). See below for a demo of this file
FINREGISTRYID date_of_birth start_of_followup end_of_followup FRXXXXXX1 2001-01-01 2005-01-01 2020-12-31 FRXXXXXX2 1991-02-05 2000-01-01 2015-12-31 FRXXXXXX5 1984-06-13 2001-09-01 2004-12-31 FRXXXXXX8 2007-10-29 2005-12-01 2021-12-31 FRXXXXX10 1997-04-11 2001-04-01 2020-12-31 ...
OutputPrefix
is where you would like the output to be written to. The program should output a file named as OutputPrefix
.EndPt
.
LongEndPtFile
is the longitudinal file to be converted from that looks like /data/processed_data/endpointer/longitudinal_endpoints_no_omits_DF10_2022_09_29.txt.ALL
. It should have at least FINREGISTRYID
, ENDPOINT
and EVENT_AGE
colunms. EVENT_YEAR
is optional, but it will complain if this column is missing whereas some specific options are specified that requires year information. The file should be sorted by first FINREGISTRYID
and then EVENT_AGE
. To the best of Zhiyu's knowledge files generated by Andrius' code should be so already. If you do any preporcessing on the longitudinal please make sure it is still sorted.
Zhiyu is aware that this info is technically implied by EVENT_AGE, but she was too lazy to add that for this version. She will update this in the near future :)
EndPtList
is a headless file with one column listing the endpoints you would like to include. The program will look for exact match of enpoint names in the longitudinal file from the ENDPOINT
column.
PadNoEvent
is a boolean with which you specify if you want the samples with no events within the inclusion period to be padded with 0s. This parameter works for both endpoint and drug file generator. Please input T or F. F means that the output will only include rows with at least one non-zero. An example for the SampleFile
above, say if individual FRXXXXXX2 does not have any endpoints within the given EndPtList
during his inclusion period 2000-01-01 to 2015-12-31, under the default F for PadNoEvent
, the output will not include the coresponding row, ie as below:
FINREGISTRYID Endpt1 EndPt2 Endpt3 ... FRXXXXXX1 xx xx xx ... FRXXXXXX5 xx xx xx ... FRXXXXXX8 xx xx xx ... FRXXXXX10 xx xx xx ... ...
Whereas if T is selected, the output will be like
FINREGISTRYID Endpt1 EndPt2 Endpt3 ... FRXXXXXX1 xx xx xx ... FRXXXXXX2 0 0 0 ... FRXXXXXX5 xx xx xx ... FRXXXXXX8 xx xx xx ... FRXXXXX10 xx xx xx ... ...
EndPtByYear
is a boolean where you indicate if you would like to output sample info by year. Please input T or F. Output by year means that each row of the output will be an individuals record for a certain year within the inclusion period. Note that each row output only events occured in a specific year, and if nothing happens that year, the row of zero will not be written. ie. it can look like below with some gaps in year
FINREGISTRYID Year Endpt1 EndPt2 Endpt3 ... FRXXXXXX1 2001 xx xx xx ... FRXXXXXX1 2002 xx xx xx ... FRXXXXXX1 2005 xx xx xx ... FRXXXXXX2 2001 xx xx xx ... FRXXXXXX2 2004 xx xx xx ... ...
If chosen F, it will output one row for each sample without the Year
column and look like below
FINREGISTRYID Endpt1 EndPt2 Endpt3 ... FRXXXXXX1 xx xx xx ... FRXXXXXX2 xx xx xx ... FRXXXXXX5 xx xx xx ... FRXXXXXX8 xx xx xx ... FRXXXXX10 xx xx xx ... ...
EndPtOutputEventCount
is a boolean where you indicate if you would like to output the number of occurance for each endpoint. It counts the number of times the endpoint appears in the longitudinal file for a sample within the inclusion period of time. Zhiyu noticed that sometimes the same endpoint occures at certain time/age can appear more than once in the longitudinal file with different EVENT_TYPE (eg. HILMO, ERIK_AVO etc.). If that happens, the current code will count them as more than one occurance. She is not sure if it is the right way to handle (or if she should count all occurance at the same age as only one) and will double check with the team. Please also let her know what you think is the best. The output should look like this
FINREGISTRYID Endpt1_nEvent EndPt2_nEvent Endpt3_nEvent ... FRXXXXXX1 10 0 3 ... FRXXXXXX2 0 4 1 ... FRXXXXXX5 0 0 2 ... FRXXXXXX8 5 1 0 ... FRXXXXX10 0 0 3 ... ...
EndPtOutputBinary
is a boolean where you indicate if you would like to output binary indicator for each endpoint. It outputs one if the endpoint occures in the longitudinal file for a sample within the inclusion period of time and zero otherwise. The output will look like this
FINREGISTRYID Endpt1 EndPt2 Endpt3 ... FRXXXXXX1 1 0 1 ... FRXXXXXX2 0 1 1 ... FRXXXXXX5 0 0 1 ... FRXXXXXX8 1 1 0 ... FRXXXXX10 0 0 1 ... ...
EndPtOutputAge
is a boolean where you indicate if you would like to output the age of onset for each endpoint. It outputs sample's age at the earliet occurance of the endpoint in the longitudinal file for a sample within the inclusion period of time and zero otherwise. The ages are rounded to two decimal places. The output should look like this
FINREGISTRYID Endpt1_OnsetAge EndPt2_OnsetAge Endpt3_OnsetAge ... FRXXXXXX1 10.23 0 15.36 ... FRXXXXXX2 0 20.14 34.25 ... FRXXXXXX5 0 0 32.30 ... FRXXXXXX8 24.35 29.31 0 ... FRXXXXX10 0 0 41.23 ... ...
All these boolean output parameters are not exclusive. ie. you can output both binary and event count, and on top of them, age of onset, in a by-year manner. If so, the output will look something like below
FINREGISTRYID Year Endpt1_nEvent Endpt1 Endpt2_nEvent Endpt2 Endpt3_nEvent Endpt3 Endpt1_OnsetAge EndPt2_OnsetAge Endpt3_OnsetAge ... FRXXXXXX1 2001 8 1 3 1 0 0 10.23 10.18 0.00 ... FRXXXXXX1 2002 2 1 0 0 0 0 11.24 0.00 0.00 ... FRXXXXXX1 2005 0 0 1 1 2 1 0.00 14.13 14.67 ... FRXXXXXX2 2001 0 0 0 0 3 1 0.00 0.00 23.45 ... FRXXXXXX2 2004 2 1 0 0 4 1 26.14 0.00 26.73 ... ...
Please choose accordingly what you would like to output. The output file will be tab-seperated.
Creates wide matrix (sample x feature) for drug from detailed longitudinal file. Works similarly to the endpoint file generator above.
This is again just a prototype. Zhiyu has recently tested it on larger scale (~3M) samples and it was working for her. But please let her know if you find some of the many problems.
To Compile, download the code run
gcc MakeDrugFile.c -o Where/and/what/you/want/it/to/be -lm -fPIC
To excute, run
./Where/and/what/you/want/it/to/be ParamFile
See below for and example of ParamFile
:
SampleFile SampleList OutputPrefix OutputPrefix LongFile LongitudinalFile DrugList DrugList DrugByYear T/F DrugOutputEventCount T/F DrugOutputBinary T/F DrugOutputAge T/F RegSource PURCH DrugMultiplyPackage T/F
SampleFile
and OutputPrefix
are oarameters shared with the wide endpoint generator above. The program should output a file named as OutputPrefix
.Drug
. DrugByYear
, DrugOutputEventCount
, DrugOutputBinary
, and DrugOutputAge
work in the same way as EndPtByYear
, EndPtOutputEventCount
, EndPtOutputBinary
, and EndPtOutputAge
.
LongFile
here is a detailed longitudinal file to be converted from that looks like /data/processed_data/detailed_longitudinal/detailed_longitudinal.csv
. If should at least have FINREGISTRYID
, SOURCE
, EVENT_AGE
and a column named CODE1
as the feature name column. It will also try to find PVM
or EVENT_YRMNTH
and throw a complain if DrugByYear
option is chosen, which Zhiyu will fix at some point. A difference from the endpoint generator: this drug program allows to output occurance of drug purchase multiplied by number of packages. If the option is chosen, a column named CODE4
should be included indicating the number of packages. Or else this option will be override to default (false).
RegSource
is the source registry records to be considered, and should be PURCH
for drug matrix generation. Technically this program can also be used to generate wide matrix from other sources, as long as specified here and coresponding DrugList
is provided. In other cases, DrugList
may not specifically be "drugs", but any other register codes you would like to include as analyses features. The program will try to match listed codes with column CODE1 in the detailed longitudinal file.
DrugMultiplyPackage
is a boolean indicating if you want the output count of occurance to be weighted by number of packages. If T, then the event count will basically become the number of packages a sample purchased within the inclusion time period. Otherwise it should be the number of time he made purchase. Becareful using this when multiple drugs/ACT codes fall into one feature in the input drug list. Does it still make sense to count the total number of packages? Zhiyu is not very familiar with drug codes so please choose accordingly given your analyses goal.
DrugList
is a list of ATC codes or truncated ATC codes that you want to include in the output. Truncated ATC codes means the first n-digits of the ATC code. The program tries to match codes in CODE1
column of the input detailed longitudinal file from the beginning with each ATC code in the given list to find a match for the first n digits. n can vary for each code in the list. For example, you can input a drug list that looks like below
A02BC02 C07AB R0 J01 ...
where only A02BC02
is an ATC code of full length which will be match as exact. The other columns will be counting occurences of sample purchasing any drug whose code starts with the given items. ie. J01CA08
, J01CE02
, J01EA01
... all start with J01
so will be counted in that column.
A known "problem" which Zhiyu is working on fixing is that if you input a drug list that has ATC codes encompassing each other, eg. a list with both J01
and J01CE
, then the purchase of, say, J01CE02
or J01CE01
in this case, will only be counted in only one of these two columns, depending on which code is found first through binary search in your list. She thinks it's a rather rare case and is not sure if anyone would want do something like that, but please don't if you see this!
This script is still quite lightly tested, so please report any issues and/or suggestions to Tuomo.
This script creates FinRegistry matrices from other data sources than endpoints and drug purchases. See figure below for a graphical summary of the scope of the variables that can be included in the output file. All the currently available variables are listed in the file
documents/selected_variables_v2.csv
The script runs using only packages installed in the shared_env
environment of ePouta machines.
As the script is pure Python, it does not require installation or compilation. The easiest way to use the script is to download the code from this repository as a zip-file to your own computer, and then transfer it to ePouta following the instructions written in the Master document.
See sections below for more detailed instructions.
python /path/to/script/MakeRegFile.py -h
usage: MakeRegFile.py [-h] [--configfile CONFIGFILE] [--logfile LOGFILE]
optional arguments:
-h, --help show this help message and exit
--configfile CONFIGFILE
Full path to the configuration file.
--logfile LOGFILE Full path to the log file.
Similarly to the other matrix generation scripts above, a configuration file (must be tab-delimited) is required with the following entries:
CpiFile CpiFile MinimalPhenotypeFile MinimalPhenotypeFile MarriageHistoryFile MarriageHistoryFile PedigreeFile PedigreeFile LivingExtendedFile LivingExtendedFile SESFile SESFile EducationFile EducationFile SocialAssistanceFile SocialAssistanceFile PensionFile PensionFile BenefitsFile BenefitsFile IncomeFile IncomeFile RelativesFile RelativesFile SocialHilmoFile SocialHilmoFile BirthFile BirthFile SampleFile SampleList FeatureFile VariableList OutputFile OutputPrefix ByYear T/F OutputEventCount T/F OutputBinary T/F OutputAge T/F
See an example file from example/ses_config
to see which registry files are used as input. One should normally not need to change paths to the input files unless some registry file is updated to a newer version.
The SampleFile
specifies which individuals to include in the output and which follow-up periods to use for each of the individuals for collecting the variable values. Note that only data entries occurring within the individual-specific follow-up periods are used to construct the ouput. The same individual can appear in the SampleFile
multiple times as long as the follow-up periods are different (FINREGISTRYID and follow-up start and end dates define unique keys). Below you can see how this file should be structured (notice that the column headers must be exactly as specified here and the columns should be comma-delimited):
FINREGISTRYID,date_of_birth,start_of_followup,end_of_followup FRXXXXXX1,2001-01-01,2005-01-01,2020-12-31 FRXXXXXX2,1991-02-05,2000-01-01,2015-12-31 FRXXXXXX5,1984-06-13,2001-09-01,2004-12-31 FRXXXXXX8,2007-10-29,2005-12-01,2021-12-31 FRXXXXX10,1997-04-11,2001-04-01,2020-12-31 ...
Here, FeatureFile
is a file with one column listing all variables to use in the output. See example from documents/selected_variables_v2.csv
, which contains all the implemented features. If you don't need all of the features, you can make the code run faster by including only the rows that you need.
All other parameters work exactly as described above for generation of the drug and endpoint matrices except for, OutputEventCount
which has not been implemented yet.
Output matrices are formatted similarly as to what is described above for the drug and endpoint matrices. Output is written into the path defined in the config file. Notice that output is only written for variables that are included in the FeatureFile
. Also a log file is written including the config used to evoke the script and possible warnings. Checks performed are listed below.
- Checks that all input files can be read before starting preprocessing.
- Reports a warning in the log file if requested age ranges are outside the coverage of any of the registries (NOT IMPLEMENTED YET, USER NEEDS TO CHECK THEMSELVES!).
All C++ files are provided so that you can change/adapt/fix and compile them yourselves as well as already in the form of ready executables. Do note that you will likely need to compile them locally since some of the code is running with boost
that is not available in epouta.
After adjusting the config file, preparing the sample and the OMOP file you can run the full programm with the simple command
make run_kanta_lab_matrix
This does two steps, it first creates a file with the summary statistics for each pair of OMOP IDs and lab units for each individual. Where each set of <FINREGISTRYID
, OMOP_ID
, LAB_UNIT
> has it's own row.
exec/indv_sumstats config_file
Then it creates a single file based on the selected relevant summary statistics, i.e. the mean value. Here each row is a single individual and each column the i.e. mean value of a selected OMOP_ID
and LAB_UNIT
pair.
exec/kanta_lab_matrix config_file
This way you can also rerun the second step and choose a different summary stat value. Or you can only run the first step if you are interested in further detailed summary statistics.
You can find an example of a config file under configs/kanta_lab_matrix_mean.config
.
The Config file expects at least the following two entries (the delimiter can be tab, comma or semicolon):
KantaLabFile
: The complete path to the kanta lab data file. (so likely something like `/data/processed_data/kela_lab/kanta_lab_20xx-xx-xx.csv).ResDirPath
: The path to the results directory where you want your results saved to.
It is also a good idea to pass it:
ResFilePrefix
: The default will bekanta_lab_
. But maybe you will want it to be more specific.
Additionally needed depending on which step you are performing are:
SampleFile
: The complete path to the sample file as described here here and hereOmopFile
: The OMOP concepts you are interested in. This file will need at least as a first column the OMOP IDs and as a thirds column the lab units. Ideally you should create this file based on section Finding your OMOP IDs.RelevantSumstats
: The summary statistics you are interested in for each individual in the selected period for each combinationof OMOP ID and lab units chosen. Currently supported are:MEAN
,MEDIAN
,SD
,FIRST_QUANTILE
,THIRD_QUANTILE
,MIN
,MAX
. If you choose multiple summary statistics you can space separate them and the results will be written to separate files.
Not you can comment any row with a #
in front and it will be ignored by the config file reader.
Additionally, to the sample file you will need a set of OMOP concepts that you are interested in. To figure out those I have added a file that creates summary statstics for all of the OMOP concepts that you can then use to filter the most relevant ones for you. For example choosing the top 20 most common measurements.
You can create this list, using the following command:
./omop_sumstats config_file
You can add a minimum number the OMOP concepts should occur in the file. I actually recommend this as an initial screening because there is still a lot of mistakes in the data. For example you can use:
./omop_sumstats config_file 100
where each combination of OMOP concept and lab unit has to appear at least 100 times to be considered relevant. The file will be written to <ResDirPath>/<ResFilePrefix>_omop_sumstats.csv
. I will later add some script to further process these statistics, for now you find ready-made lists at /data/projects/project_kdetrois/omop_sumstats/