Skip to content

Scripts to generate endpoint, drug, and socioeconomic-demographic matrices in FinRegistry

License

Notifications You must be signed in to change notification settings

dsgelab/finregistry-matrices

Repository files navigation

finregistry-matrices

Scripts to generate endpoint, drug, and socioeconomic-demographic matrices in FinRegistry

Contents

Introduction

Now allows to output multiple records per individual! The same individual can appear for multiple times in your SampleFile, with different dates for inclusion periods. The program will output for each inclusion period, each individual, with the first column as FINREGISTRYID, followed by LowerAge and UpperAge indicating the lower and upper bound of inclusion period in terms of individual's age. The output will be sorted by the these three columns in the same order. The program code allows you to have at max 20 different inclusion periods for each individual in your SampleFile. If that is not enough, you can also expand it by changing

#define MaxRec 20 

to a larger number in row 12/13 of the source codes and recompile :D

All examples appear below are hypothetical and do not contain any real data

Now you can use one single parameter file that looks like below to run both wide endpoint generator and wide drug generator. In this case, some parameters will be shared between the two programs. Using separate parameter files still works if you prefer. Please see below for more details.

Param Description Type Default
SampleFile List of FinRegistry IDs to include file None
OutputPrefix Output path str None
LongEndPtFile Longitudinal endpoints file path (input data) str None
EndPtList List of endpoints to be included file None
EndPtByYear Endpoint output file by year bool F
EndPtOutputEventCount Endpoint output file with event counts bool F
EndPtOutputBinary Endpoint output file with binary indicators bool F
EndPtOutputAge Endpoint output file with age at onset bool F
LongFile Detailed longitudinal file path (input data) str None
DrugList List of drugs (ATC codes or truncated ATC codes) to be included file None
DrugByYear Drug output file by year bool F
DrugOutputEventCount Drug output file with event counts bool F
DrugOutputBinary Drug output file with binary indicators bool F
DrugOutputAge Drug output file with age at first purchase bool F
RegSource Source registry to be considered for drugs str None
DrugMultiplyPackage Drug output file with counts weighted by the number of packages bool T
PadNoEvent Output zeros if the sample does not exist in the longitudinal file, if F then only samples appear in the longitudinal file will be in the output. bool F

Go to top of page

MakeEndPtFile.c

Creates wide matrix (sample x feature) for disease endpoints from longitudinal file.

This is still a (hopefully) working prototype, please let Zhiyu know if anything goes wrong when you are using it. She will try to work it out (hopefully).

To Compile, download the code and run

gcc MakeEndPtFile.c -o Where/and/what/you/want/it/to/be -lm -fPIC

To excute, run

./Where/and/what/you/want/it/to/be ParamFile

See below for and example of ParamFile:

SampleFile   SampleList 
OutputPrefix   OutputPrefix 
LongEndPtFile  LongitudinalEndPtFile 
EndPtList  EndPtList 
EndPtByYear   T/F 
EndPtOutputEventCount   T/F 
EndPtOutputBinary   T/F 
EndPtOutputAge  T/F 

SampleFile is a headed file listing the samples you would like to include in the output. It should have four columns in the following order: FINREGISTRYID, DateOfBirth for this sample, Lowerbound of record inclusion date for this sample, Upperbound of record inclusion date for this sample. All dates should be in a yyyy-mm-dd format. The output will only include records for the sample happening between Lowerbound date - Upperbound date, including the two ends of the window. This allows each sample to have different inclusion periods. The program should be able to accept space, comma or tab separated file (Please let Zhiyu know if it doesn't). See below for a demo of this file

FINREGISTRYID date_of_birth start_of_followup end_of_followup
FRXXXXXX1 2001-01-01  2005-01-01  2020-12-31
FRXXXXXX2 1991-02-05  2000-01-01  2015-12-31
FRXXXXXX5 1984-06-13  2001-09-01  2004-12-31
FRXXXXXX8 2007-10-29  2005-12-01  2021-12-31
FRXXXXX10 1997-04-11  2001-04-01  2020-12-31
...

OutputPrefix is where you would like the output to be written to. The program should output a file named as OutputPrefix.EndPt.

LongEndPtFile is the longitudinal file to be converted from that looks like /data/processed_data/endpointer/longitudinal_endpoints_no_omits_DF10_2022_09_29.txt.ALL. It should have at least FINREGISTRYID, ENDPOINT and EVENT_AGE colunms. EVENT_YEAR is optional, but it will complain if this column is missing whereas some specific options are specified that requires year information. The file should be sorted by first FINREGISTRYID and then EVENT_AGE. To the best of Zhiyu's knowledge files generated by Andrius' code should be so already. If you do any preporcessing on the longitudinal please make sure it is still sorted.

      Zhiyu is aware that this info is technically implied by EVENT_AGE, but she was too lazy to add that for this version. She will update this in the near future :)

EndPtList is a headless file with one column listing the endpoints you would like to include. The program will look for exact match of enpoint names in the longitudinal file from the ENDPOINT column.

PadNoEvent is a boolean with which you specify if you want the samples with no events within the inclusion period to be padded with 0s. This parameter works for both endpoint and drug file generator. Please input T or F. F means that the output will only include rows with at least one non-zero. An example for the SampleFile above, say if individual FRXXXXXX2 does not have any endpoints within the given EndPtList during his inclusion period 2000-01-01 to 2015-12-31, under the default F for PadNoEvent, the output will not include the coresponding row, ie as below:

FINREGISTRYID  Endpt1  EndPt2  Endpt3  ...
FRXXXXXX1  xx  xx  xx  ...
FRXXXXXX5  xx  xx  xx  ...
FRXXXXXX8  xx  xx  xx  ...
FRXXXXX10  xx  xx  xx  ...
...

Whereas if T is selected, the output will be like

FINREGISTRYID  Endpt1  EndPt2  Endpt3  ...
FRXXXXXX1  xx  xx  xx  ...
FRXXXXXX2  0  0 0 ...
FRXXXXXX5  xx  xx  xx  ...
FRXXXXXX8  xx  xx  xx  ...
FRXXXXX10  xx  xx  xx  ...
...

EndPtByYear is a boolean where you indicate if you would like to output sample info by year. Please input T or F. Output by year means that each row of the output will be an individuals record for a certain year within the inclusion period. Note that each row output only events occured in a specific year, and if nothing happens that year, the row of zero will not be written. ie. it can look like below with some gaps in year

FINREGISTRYID Year  Endpt1  EndPt2  Endpt3  ...
FRXXXXXX1 2001  xx  xx  xx  ...
FRXXXXXX1 2002  xx  xx  xx  ...
FRXXXXXX1 2005  xx  xx  xx  ...
FRXXXXXX2 2001  xx  xx  xx  ...
FRXXXXXX2 2004  xx  xx  xx  ...
...

If chosen F, it will output one row for each sample without the Year column and look like below

FINREGISTRYID  Endpt1  EndPt2  Endpt3  ...
FRXXXXXX1  xx  xx  xx  ...
FRXXXXXX2  xx  xx  xx  ...
FRXXXXXX5  xx  xx  xx  ...
FRXXXXXX8  xx  xx  xx  ...
FRXXXXX10  xx  xx  xx  ...
...

EndPtOutputEventCount is a boolean where you indicate if you would like to output the number of occurance for each endpoint. It counts the number of times the endpoint appears in the longitudinal file for a sample within the inclusion period of time. Zhiyu noticed that sometimes the same endpoint occures at certain time/age can appear more than once in the longitudinal file with different EVENT_TYPE (eg. HILMO, ERIK_AVO etc.). If that happens, the current code will count them as more than one occurance. She is not sure if it is the right way to handle (or if she should count all occurance at the same age as only one) and will double check with the team. Please also let her know what you think is the best. The output should look like this

FINREGISTRYID  Endpt1_nEvent  EndPt2_nEvent  Endpt3_nEvent  ...
FRXXXXXX1  10  0  3  ...
FRXXXXXX2  0  4  1  ...
FRXXXXXX5  0  0  2  ...
FRXXXXXX8  5  1  0  ...
FRXXXXX10  0  0  3  ...
...

EndPtOutputBinary is a boolean where you indicate if you would like to output binary indicator for each endpoint. It outputs one if the endpoint occures in the longitudinal file for a sample within the inclusion period of time and zero otherwise. The output will look like this

FINREGISTRYID  Endpt1  EndPt2  Endpt3  ...
FRXXXXXX1  1  0  1  ...
FRXXXXXX2  0  1  1  ...
FRXXXXXX5  0  0  1  ...
FRXXXXXX8  1  1  0  ...
FRXXXXX10  0  0  1  ...
...

EndPtOutputAge is a boolean where you indicate if you would like to output the age of onset for each endpoint. It outputs sample's age at the earliet occurance of the endpoint in the longitudinal file for a sample within the inclusion period of time and zero otherwise. The ages are rounded to two decimal places. The output should look like this

FINREGISTRYID  Endpt1_OnsetAge  EndPt2_OnsetAge  Endpt3_OnsetAge  ...
FRXXXXXX1  10.23  0  15.36  ...
FRXXXXXX2  0  20.14  34.25  ...
FRXXXXXX5  0  0  32.30  ...
FRXXXXXX8  24.35  29.31  0  ...
FRXXXXX10  0  0  41.23  ...
...

All these boolean output parameters are not exclusive. ie. you can output both binary and event count, and on top of them, age of onset, in a by-year manner. If so, the output will look something like below

FINREGISTRYID  Year Endpt1_nEvent Endpt1  Endpt2_nEvent Endpt2 Endpt3_nEvent Endpt3  Endpt1_OnsetAge  EndPt2_OnsetAge  Endpt3_OnsetAge  ...
FRXXXXXX1 2001  8 1 3 1 0 0 10.23 10.18 0.00  ...
FRXXXXXX1 2002  2 1 0 0 0 0 11.24 0.00  0.00  ...
FRXXXXXX1 2005  0 0 1 1 2 1 0.00  14.13 14.67 ...
FRXXXXXX2 2001  0 0 0 0 3 1 0.00  0.00  23.45 ...
FRXXXXXX2 2004  2 1 0 0 4 1 26.14 0.00  26.73 ...
...

Please choose accordingly what you would like to output. The output file will be tab-seperated.

Go to top of page

MakeDrugFile.c

Creates wide matrix (sample x feature) for drug from detailed longitudinal file. Works similarly to the endpoint file generator above.

This is again just a prototype. Zhiyu has recently tested it on larger scale (~3M) samples and it was working for her. But please let her know if you find some of the many problems.

To Compile, download the code run

gcc MakeDrugFile.c -o Where/and/what/you/want/it/to/be -lm -fPIC

To excute, run

./Where/and/what/you/want/it/to/be ParamFile

See below for and example of ParamFile:

SampleFile   SampleList 
OutputPrefix   OutputPrefix 
LongFile  LongitudinalFile 
DrugList  DrugList 
DrugByYear   T/F 
DrugOutputEventCount   T/F 
DrugOutputBinary   T/F 
DrugOutputAge  T/F 
RegSource   PURCH 
DrugMultiplyPackage  T/F 

SampleFile and OutputPrefix are oarameters shared with the wide endpoint generator above. The program should output a file named as OutputPrefix.Drug. DrugByYear, DrugOutputEventCount, DrugOutputBinary, and DrugOutputAge work in the same way as EndPtByYear, EndPtOutputEventCount, EndPtOutputBinary, and EndPtOutputAge.

LongFile here is a detailed longitudinal file to be converted from that looks like /data/processed_data/detailed_longitudinal/detailed_longitudinal.csv. If should at least have FINREGISTRYID, SOURCE, EVENT_AGE and a column named CODE1 as the feature name column. It will also try to find PVM or EVENT_YRMNTH and throw a complain if DrugByYear option is chosen, which Zhiyu will fix at some point. A difference from the endpoint generator: this drug program allows to output occurance of drug purchase multiplied by number of packages. If the option is chosen, a column named CODE4 should be included indicating the number of packages. Or else this option will be override to default (false).

RegSource is the source registry records to be considered, and should be PURCH for drug matrix generation. Technically this program can also be used to generate wide matrix from other sources, as long as specified here and coresponding DrugList is provided. In other cases, DrugList may not specifically be "drugs", but any other register codes you would like to include as analyses features. The program will try to match listed codes with column CODE1 in the detailed longitudinal file.

DrugMultiplyPackage is a boolean indicating if you want the output count of occurance to be weighted by number of packages. If T, then the event count will basically become the number of packages a sample purchased within the inclusion time period. Otherwise it should be the number of time he made purchase. Becareful using this when multiple drugs/ACT codes fall into one feature in the input drug list. Does it still make sense to count the total number of packages? Zhiyu is not very familiar with drug codes so please choose accordingly given your analyses goal.

DrugList is a list of ATC codes or truncated ATC codes that you want to include in the output. Truncated ATC codes means the first n-digits of the ATC code. The program tries to match codes in CODE1 column of the input detailed longitudinal file from the beginning with each ATC code in the given list to find a match for the first n digits. n can vary for each code in the list. For example, you can input a drug list that looks like below

A02BC02
C07AB
R0
J01
...

where only A02BC02 is an ATC code of full length which will be match as exact. The other columns will be counting occurences of sample purchasing any drug whose code starts with the given items. ie. J01CA08, J01CE02, J01EA01 ... all start with J01 so will be counted in that column.

      A known "problem" which Zhiyu is working on fixing is that if you input a drug list that has ATC codes encompassing each other, eg. a list with both J01 and J01CE, then the purchase of, say, J01CE02 or J01CE01 in this case, will only be counted in only one of these two columns, depending on which code is found first through binary search in your list. She thinks it's a rather rare case and is not sure if anyone would want do something like that, but please don't if you see this!

Go to top of page

MakeRegFile.py

This script is still quite lightly tested, so please report any issues and/or suggestions to Tuomo.

This script creates FinRegistry matrices from other data sources than endpoints and drug purchases. See figure below for a graphical summary of the scope of the variables that can be included in the output file. All the currently available variables are listed in the file

documents/selected_variables_v2.csv

Alt text

Requirements & installation

The script runs using only packages installed in the shared_env environment of ePouta machines.

As the script is pure Python, it does not require installation or compilation. The easiest way to use the script is to download the code from this repository as a zip-file to your own computer, and then transfer it to ePouta following the instructions written in the Master document.

Usage

See sections below for more detailed instructions.

python /path/to/script/MakeRegFile.py -h
usage: MakeRegFile.py [-h] [--configfile CONFIGFILE] [--logfile LOGFILE]

optional arguments:
  -h, --help            show this help message and exit
  --configfile CONFIGFILE
                        Full path to the configuration file.
  --logfile LOGFILE     Full path to the log file.

Required input

Similarly to the other matrix generation scripts above, a configuration file (must be tab-delimited) is required with the following entries:

CpiFile  CpiFile 
MinimalPhenotypeFile  MinimalPhenotypeFile 
MarriageHistoryFile  MarriageHistoryFile 
PedigreeFile  PedigreeFile 
LivingExtendedFile  LivingExtendedFile 
SESFile  SESFile 
EducationFile  EducationFile 
SocialAssistanceFile  SocialAssistanceFile 
PensionFile  PensionFile 
BenefitsFile  BenefitsFile 
IncomeFile  IncomeFile 
RelativesFile  RelativesFile 
SocialHilmoFile  SocialHilmoFile 
BirthFile  BirthFile 
SampleFile   SampleList 
FeatureFile  VariableList 
OutputFile   OutputPrefix 
ByYear   T/F 
OutputEventCount   T/F 
OutputBinary   T/F 
OutputAge  T/F 

See an example file from example/ses_config to see which registry files are used as input. One should normally not need to change paths to the input files unless some registry file is updated to a newer version.

The SampleFile specifies which individuals to include in the output and which follow-up periods to use for each of the individuals for collecting the variable values. Note that only data entries occurring within the individual-specific follow-up periods are used to construct the ouput. The same individual can appear in the SampleFile multiple times as long as the follow-up periods are different (FINREGISTRYID and follow-up start and end dates define unique keys). Below you can see how this file should be structured (notice that the column headers must be exactly as specified here and the columns should be comma-delimited):

FINREGISTRYID,date_of_birth,start_of_followup,end_of_followup
FRXXXXXX1,2001-01-01,2005-01-01,2020-12-31
FRXXXXXX2,1991-02-05,2000-01-01,2015-12-31
FRXXXXXX5,1984-06-13,2001-09-01,2004-12-31
FRXXXXXX8,2007-10-29,2005-12-01,2021-12-31
FRXXXXX10,1997-04-11,2001-04-01,2020-12-31
...

Here, FeatureFile is a file with one column listing all variables to use in the output. See example from documents/selected_variables_v2.csv, which contains all the implemented features. If you don't need all of the features, you can make the code run faster by including only the rows that you need.

All other parameters work exactly as described above for generation of the drug and endpoint matrices except for, OutputEventCount which has not been implemented yet.

Output

Output matrices are formatted similarly as to what is described above for the drug and endpoint matrices. Output is written into the path defined in the config file. Notice that output is only written for variables that are included in the FeatureFile. Also a log file is written including the config used to evoke the script and possible warnings. Checks performed are listed below.

Checks

  • Checks that all input files can be read before starting preprocessing.
  • Reports a warning in the log file if requested age ranges are outside the coverage of any of the registries (NOT IMPLEMENTED YET, USER NEEDS TO CHECK THEMSELVES!).

Creating matrices from kanta lab data

Usage

All C++ files are provided so that you can change/adapt/fix and compile them yourselves as well as already in the form of ready executables. Do note that you will likely need to compile them locally since some of the code is running with boost that is not available in epouta.

After adjusting the config file, preparing the sample and the OMOP file you can run the full programm with the simple command

  make run_kanta_lab_matrix

This does two steps, it first creates a file with the summary statistics for each pair of OMOP IDs and lab units for each individual. Where each set of <FINREGISTRYID, OMOP_ID, LAB_UNIT> has it's own row.

  exec/indv_sumstats config_file

Then it creates a single file based on the selected relevant summary statistics, i.e. the mean value. Here each row is a single individual and each column the i.e. mean value of a selected OMOP_ID and LAB_UNIT pair.

  exec/kanta_lab_matrix config_file

This way you can also rerun the second step and choose a different summary stat value. Or you can only run the first step if you are interested in further detailed summary statistics.

Config File

You can find an example of a config file under configs/kanta_lab_matrix_mean.config.

The Config file expects at least the following two entries (the delimiter can be tab, comma or semicolon):

  • KantaLabFile: The complete path to the kanta lab data file. (so likely something like `/data/processed_data/kela_lab/kanta_lab_20xx-xx-xx.csv).
  • ResDirPath: The path to the results directory where you want your results saved to.

It is also a good idea to pass it:

  • ResFilePrefix: The default will be kanta_lab_. But maybe you will want it to be more specific.

Additionally needed depending on which step you are performing are:

  • SampleFile: The complete path to the sample file as described here here and here
  • OmopFile: The OMOP concepts you are interested in. This file will need at least as a first column the OMOP IDs and as a thirds column the lab units. Ideally you should create this file based on section Finding your OMOP IDs.
  • RelevantSumstats: The summary statistics you are interested in for each individual in the selected period for each combinationof OMOP ID and lab units chosen. Currently supported are: MEAN,MEDIAN,SD,FIRST_QUANTILE,THIRD_QUANTILE,MIN,MAX. If you choose multiple summary statistics you can space separate them and the results will be written to separate files.

Not you can comment any row with a # in front and it will be ignored by the config file reader.

Finding your OMOP IDs

Additionally, to the sample file you will need a set of OMOP concepts that you are interested in. To figure out those I have added a file that creates summary statstics for all of the OMOP concepts that you can then use to filter the most relevant ones for you. For example choosing the top 20 most common measurements.

You can create this list, using the following command:

  ./omop_sumstats config_file

You can add a minimum number the OMOP concepts should occur in the file. I actually recommend this as an initial screening because there is still a lot of mistakes in the data. For example you can use:

  ./omop_sumstats config_file 100

where each combination of OMOP concept and lab unit has to appear at least 100 times to be considered relevant. The file will be written to <ResDirPath>/<ResFilePrefix>_omop_sumstats.csv. I will later add some script to further process these statistics, for now you find ready-made lists at /data/projects/project_kdetrois/omop_sumstats/

Go to top of page

About

Scripts to generate endpoint, drug, and socioeconomic-demographic matrices in FinRegistry

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •