switch to long input format

CDCgov · Dec 18, 2024 · db76734 · db76734
1 parent 88c6b04
commit db76734
Show file tree

Hide file tree

Showing 14 changed files with 4,386 additions and 172 deletions.
diff --git a/.gitignore b/.gitignore
@@ -38,11 +38,9 @@
 # !your_data_file.csv
 # !your_data_directory/
 !input/people_test.csv
-!input/gi_trajectories.csv
-!tests/data/gi_trajectory.csv
-!tests/data/three_columns.csv
+!input/natural_history.csv
+!tests/data/natural_history.csv
 !tests/data/empty.csv
-!tests/data/one_column.csv
 !tests/data/column_size_changes.csv
 
 #####

diff --git a/docs/natural-history-inputs.md b/docs/natural-history-inputs.md
@@ -1,28 +1,41 @@
 # Natural history model inputs
 
-## Infectiousness over time
+## Overview
+We provide a way for reading in a user-specified natural history parameters (infectiousness over time
+or the generation interval, viral load over time, etc.). This CSV can be expanded to also include
+symptom onset and improvement times for a given natural history parameter set.
 
-We provide a way for reading in a user-specified infectiousness over time distribution (generation interval)
-and appropriately scheduling infection attempts based on the distribution. The user provides an input file
-that contains samples from the cumulative distribution function (CDF) of the generation interval (GI) over
-time at a specified $\Delta t$, describing the fraction of an individual's infectiousness that has passed
-by a given time. The input data are assumed to have a format where the columns represent the times since
-the infection attempt (so starting at $t = 0$) and the entries in each row describe the value of the GI
-CDF. Each row represents a potential trajectory of the GI CDF.
+By specifying all of these parameters from a CSV file, the user can provide any natural history parameters
+they want in a very flexible fashion. For instance, if natural history parameters are correlated (i.e.,
+generation interval and symptom improvement), this can be modeled by providing a joint parameter set
+in the CSV. Comparatively, if the parameters are uncorrelated, that can also be modeled by just having
+the CSV inputs be independent draws from a distribution.
+
+## Data input format
+
+Data are input in a long format. Columns include `id`, `time`, and `gi_cdf`. Future work includes expanding
+to include `viral_load`, `symptom_onset_time`, and `symptom_improvement_time`. Each `id` refers to a distinct
+sample from the natural history parameters at some `time` since the person is first infected. The `gi_cdf`
+column describes the fraction of infectiousness that has occured at a given `time` for a given parameter set.
+
+## Implementation
 
 People are assigned a trajectory number (row number) when they are infected. This allows for each person
-to have a different GI CDF if each of the trajectories are different. However, that trajectory number will
+to have a different GI CDF if each of the trajectories are different. That trajectory number will
 be used for also drawing the person's other natural history characteristics, such as their symptom onset
 and improvement times or viral load trajectory. This allows easily encoding correlation between natural
-history parameters (the user provides input CSVs where the first row in each CSV is from a joint sample
-of GI, symptom onset, symptom improvement, etc.) or allowing each of the parameters to be independent.
+history parameters (the user provides input CSVs where the various values are all a joint sample
+of natural history parameters.) or allowing each of the parameters to be independent.
 
-## Overall Assumptions
+## Assumptions
 1. There are no requirements on the number of trajectories fed to the model. Trajectory numbers are assigned
-to people uniformly and randomly. However, this means that an individual who is reinfected could have the exact
-same infectiousness trajectory as their last infection.
-2. There must be the same number of parameter sets for each parameter provided as an input CSV. For now, we are focusing
-only on GI, but we will soon expand our work to also include symptom onset and symptom improvement times.
-3. We have not yet crossed the barrier of how to separately treat individuals who are asymptomatic only. Are their
-GIs drawn from a separate CSV? Should their $R_i$ just be multiplied by a scalar? Part of the reason we are deferring
-this decision is because our previous isolation guidance work focused only on symptomatic individuals.
+to people uniformly and randomly. A user must provide enough trajectories that they provide a representative
+sample of the underlying natural history parameters.
+2. There must be the same number of values for each parameter provided in the input CSV. In other words, a user
+cannot provide 1000 GI trajectories but only 10 symptom improvement times. The user must ensure that all parameter
+sets are complete and do that either via assuming independent draws between parameter values or imposing a correlation.
+3. The current input structure lends itself to basically encoding the agent's history of disease as an input parameter.
+Is this a good idea? Is this putting too much burden on the user when there are things that could be done in Rust instead?
+Imagine an agent is asymptomatic and they have a different GI, the natural history CSV file needs to include both of those
+pieces of information to ensure the agent is properly modeled in the simulation. So, the user needs to figure out how to tie
+together clinical symptoms with natural history, and the model just simulates whatever correlations a user describes.
diff --git a/input/gi_trajectories.csv b/input/gi_trajectories.csv
diff --git a/input/input.json b/input/input.json
@@ -3,9 +3,8 @@
       "max_time": 200.0,
       "seed": 123,
       "r_0": 2.5,
-      "gi_trajectories_dt": 0.02,
       "report_period": 1.0,
       "synth_population_file": "input/people_test.csv",
-      "gi_trajectories_file": "input/gi_trajectories.csv"
+      "natural_history_inputs": "input/natural_history.csv"
     }
 }