From 9eba3b5f4ee37fc5d5cf19c78724c28a3d45a9ad Mon Sep 17 00:00:00 2001 From: Melissa Gymrek Date: Mon, 14 Feb 2022 13:44:10 -0800 Subject: [PATCH 01/10] initiating haptools readme pages --- .gitignore | 5 +++- README.md | 26 +++++++++++++++++-- haptools/karyogram/README.md | 3 +++ .../{visualization => karyogram}/karyogram.py | 0 .../{visualization => karyogram}/to_remove.py | 0 haptools/simgenotype/README.md | 3 +++ .../admix_storage.py | 0 .../sim_admixture.py | 0 haptools/simphenotype/README.md | 3 +++ 9 files changed, 37 insertions(+), 3 deletions(-) create mode 100644 haptools/karyogram/README.md rename haptools/{visualization => karyogram}/karyogram.py (100%) rename haptools/{visualization => karyogram}/to_remove.py (100%) create mode 100644 haptools/simgenotype/README.md rename haptools/{simulate => simgenotype}/admix_storage.py (100%) rename haptools/{simulate => simgenotype}/sim_admixture.py (100%) create mode 100644 haptools/simphenotype/README.md diff --git a/.gitignore b/.gitignore index 618f9c4a..11548fd4 100644 --- a/.gitignore +++ b/.gitignore @@ -9,4 +9,7 @@ __pycache__ # pytest cache .pytest_cache # poetry -dist/ \ No newline at end of file +dist/ + +# OSX +*.DS_Store* \ No newline at end of file diff --git a/README.md b/README.md index 35127436..23933681 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,29 @@ Please wait until we have published our first tagged release before using our code. # haptools -Simulate phenotypes for fine-mapping. Use real variants to simulate real, biological LD patterns. -The Snakemake pipeline in the `snakemake/` directory uses the results of the simulation to test several fine-mapping methods, including FINEMAP and SuSiE. + +Haptools is a collection of tools for simulating and analyzing genotypes and phenotypes while taking into account haplotype information. It is particularly designed for analysis of individuals with admixed ancestries, although the tools can also be used for non-admixed individuals. Homepage: https://haptools.readthedocs.io/ + +## Installation + +UNDER CONSTRUCTION + +## Haptools utilities + +Haptools consists of multiple utilities listed below. Click on a utility to see more detailed usage information. + +* [`haptools simgenome`](haptools/simgenotype/README.md): Simulate genotypes for admixed individuals under user-specified demographic histories. `haptools simgenome` takes as input a reference set of ancestry-labeled haplotypes and a demographic model and outputs a VCF file with local ancestry information annotated for each variant. It also outputs a list of local ancestry breakpoints which can be visualized using `haptools karyogram`. The output VCF file can be used as input to downstream tools such as `haptools simphenotype` to simulate phenotype information. + +* ['haptools simphenotype'](haptools/simphenotype/README.md): Simulate a complex trait, taking into account local ancestry- or haplotype- specific effects. 'haptools simphenotype' takes as input a VCF file and outputs simulated phenotypes for each sample. + +* [`haptools karyogram`](haptools/karyogram/README.md): Visualize a "chromosome painting" of local ancestry labels based on breakpoints output by `haptools simgenome`. + + +## Contributing + +If you are interested in contributing to `haptools`, please get in touch by submitting a Github issue or contacting us at mlamkin@ucsd.edu. + + + diff --git a/haptools/karyogram/README.md b/haptools/karyogram/README.md new file mode 100644 index 00000000..365abb96 --- /dev/null +++ b/haptools/karyogram/README.md @@ -0,0 +1,3 @@ +# Haptools karyogram + +UNDER CONSTRUCTION \ No newline at end of file diff --git a/haptools/visualization/karyogram.py b/haptools/karyogram/karyogram.py similarity index 100% rename from haptools/visualization/karyogram.py rename to haptools/karyogram/karyogram.py diff --git a/haptools/visualization/to_remove.py b/haptools/karyogram/to_remove.py similarity index 100% rename from haptools/visualization/to_remove.py rename to haptools/karyogram/to_remove.py diff --git a/haptools/simgenotype/README.md b/haptools/simgenotype/README.md new file mode 100644 index 00000000..48c1369c --- /dev/null +++ b/haptools/simgenotype/README.md @@ -0,0 +1,3 @@ +# Haptools simgenotype + +UNDER CONSTRUCTION \ No newline at end of file diff --git a/haptools/simulate/admix_storage.py b/haptools/simgenotype/admix_storage.py similarity index 100% rename from haptools/simulate/admix_storage.py rename to haptools/simgenotype/admix_storage.py diff --git a/haptools/simulate/sim_admixture.py b/haptools/simgenotype/sim_admixture.py similarity index 100% rename from haptools/simulate/sim_admixture.py rename to haptools/simgenotype/sim_admixture.py diff --git a/haptools/simphenotype/README.md b/haptools/simphenotype/README.md new file mode 100644 index 00000000..9c46b012 --- /dev/null +++ b/haptools/simphenotype/README.md @@ -0,0 +1,3 @@ +# Haptools simphenotype + +UNDER CONSTRUCTION \ No newline at end of file From 4d73de8338e7b783fbe00958db81b9969c189ca5 Mon Sep 17 00:00:00 2001 From: Melissa Gymrek Date: Mon, 14 Feb 2022 13:45:17 -0800 Subject: [PATCH 02/10] fixing minor readme typos --- README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 23933681..98234b07 100644 --- a/README.md +++ b/README.md @@ -17,9 +17,13 @@ UNDER CONSTRUCTION Haptools consists of multiple utilities listed below. Click on a utility to see more detailed usage information. -* [`haptools simgenome`](haptools/simgenotype/README.md): Simulate genotypes for admixed individuals under user-specified demographic histories. `haptools simgenome` takes as input a reference set of ancestry-labeled haplotypes and a demographic model and outputs a VCF file with local ancestry information annotated for each variant. It also outputs a list of local ancestry breakpoints which can be visualized using `haptools karyogram`. The output VCF file can be used as input to downstream tools such as `haptools simphenotype` to simulate phenotype information. +* [`haptools simgenome`](haptools/simgenotype/README.md): Simulate genotypes for admixed individuals under user-specified demographic histories. -* ['haptools simphenotype'](haptools/simphenotype/README.md): Simulate a complex trait, taking into account local ancestry- or haplotype- specific effects. 'haptools simphenotype' takes as input a VCF file and outputs simulated phenotypes for each sample. +`haptools simgenome` takes as input a reference set of ancestry-labeled haplotypes and a demographic model and outputs a VCF file with local ancestry information annotated for each variant. It also outputs a list of local ancestry breakpoints which can be visualized using `haptools karyogram`. The output VCF file can be used as input to downstream tools such as `haptools simphenotype` to simulate phenotype information. + +* [`haptools simphenotype`](haptools/simphenotype/README.md): Simulate a complex trait, taking into account local ancestry- or haplotype- specific effects. + +'haptools simphenotype' takes as input a VCF file and outputs simulated phenotypes for each sample. * [`haptools karyogram`](haptools/karyogram/README.md): Visualize a "chromosome painting" of local ancestry labels based on breakpoints output by `haptools simgenome`. From 2c9da5566f400380a6eb481ee8f3e2f422975f7d Mon Sep 17 00:00:00 2001 From: Melissa Gymrek Date: Mon, 14 Feb 2022 13:47:17 -0800 Subject: [PATCH 03/10] minor readme changes --- README.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 98234b07..4ea87402 100644 --- a/README.md +++ b/README.md @@ -19,14 +19,13 @@ Haptools consists of multiple utilities listed below. Click on a utility to see * [`haptools simgenome`](haptools/simgenotype/README.md): Simulate genotypes for admixed individuals under user-specified demographic histories. -`haptools simgenome` takes as input a reference set of ancestry-labeled haplotypes and a demographic model and outputs a VCF file with local ancestry information annotated for each variant. It also outputs a list of local ancestry breakpoints which can be visualized using `haptools karyogram`. The output VCF file can be used as input to downstream tools such as `haptools simphenotype` to simulate phenotype information. - -* [`haptools simphenotype`](haptools/simphenotype/README.md): Simulate a complex trait, taking into account local ancestry- or haplotype- specific effects. - -'haptools simphenotype' takes as input a VCF file and outputs simulated phenotypes for each sample. +* [`haptools simphenotype`](haptools/simphenotype/README.md): Simulate a complex trait, taking into account local ancestry- or haplotype- specific effects. `haptools simphenotype` takes as input a VCF file and outputs simulated phenotypes for each sample. * [`haptools karyogram`](haptools/karyogram/README.md): Visualize a "chromosome painting" of local ancestry labels based on breakpoints output by `haptools simgenome`. +Outputs produced by these utilities are compatible with each other. For example +`haptools simgenome` outputs a VCF file with local ancestry information annotated for each variant. The output VCF file can be used as input to `haptools simphenotype` to simulate phenotype information. `haptools simgenome` also outputs a list of local ancestry breakpoints which can be visualized using `haptools karyogram`. + ## Contributing From f65204b62005b2e0a42009e4591341177a9c6dd6 Mon Sep 17 00:00:00 2001 From: Melissa Gymrek Date: Mon, 14 Feb 2022 13:54:01 -0800 Subject: [PATCH 04/10] adding simgenotypes readme --- haptools/simgenotype/README.md | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/haptools/simgenotype/README.md b/haptools/simgenotype/README.md index 48c1369c..82631756 100644 --- a/haptools/simgenotype/README.md +++ b/haptools/simgenotype/README.md @@ -1,3 +1,22 @@ # Haptools simgenotype -UNDER CONSTRUCTION \ No newline at end of file +`haptools simgenotype` takes as input a reference set of haplotypes in VCF format and a user-specified admixture model. It outputs a VCF file with simulated genotype information for admixed genotypes, as well as a breakpoints file that can be used for visualization. + +## Basic usage + +``` +haptools simgenotype \ + --invcf REFVCF \ + --sample_info SAMPLEINFOFILE \ + --model MODELFILE \ + --map GENETICMAP \ + --out OUTPREFIX +``` + +Detailed information about each option, and example commands using publicly available files, are shown below. + +## Detailed usage + +## File formats + +## Examples \ No newline at end of file From 06fcaaaa2fdc9ae2078754803e59ebd764f027eb Mon Sep 17 00:00:00 2001 From: Michael Lamkin Date: Mon, 14 Feb 2022 22:43:57 -0800 Subject: [PATCH 05/10] Update README.md --- haptools/simgenotype/README.md | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/haptools/simgenotype/README.md b/haptools/simgenotype/README.md index 82631756..03e91667 100644 --- a/haptools/simgenotype/README.md +++ b/haptools/simgenotype/README.md @@ -19,4 +19,15 @@ Detailed information about each option, and example commands using publicly avai ## File formats -## Examples \ No newline at end of file +Model Format +``` +{num_samples} Admixed Pop_label1 Pop_label2 +{num_generations} {admixed_freq} {pop_label1_freq} {pop_label2_freq} +``` + +Map Format + +Outfile Format + + +## Examples From e41bfeedb309fef43bf8a4a0c642fce4784e2a9c Mon Sep 17 00:00:00 2001 From: Michael Lamkin Date: Mon, 14 Feb 2022 22:46:44 -0800 Subject: [PATCH 06/10] Update README.md --- haptools/simgenotype/README.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/haptools/simgenotype/README.md b/haptools/simgenotype/README.md index 03e91667..7197eaa4 100644 --- a/haptools/simgenotype/README.md +++ b/haptools/simgenotype/README.md @@ -20,9 +20,18 @@ Detailed information about each option, and example commands using publicly avai ## File formats Model Format + +Structure of model.dat file +``` +{num_samples} Admixed Pop_label1 Pop_label2 ... Pop_labeln +{num_generations} {admixed_freq} {pop_label1_freq} {pop_label2_freq} ... {pop_labeln_freq} +``` + +Example model.dat file + ``` -{num_samples} Admixed Pop_label1 Pop_label2 -{num_generations} {admixed_freq} {pop_label1_freq} {pop_label2_freq} +40 Admixed CEU YRI +6 0 0.2 0.8 ``` Map Format From 058b0e89b4ee0b362ea173870ecb5dee3288f772 Mon Sep 17 00:00:00 2001 From: Michael Lamkin Date: Mon, 14 Feb 2022 23:01:29 -0800 Subject: [PATCH 07/10] Update README.md --- haptools/simgenotype/README.md | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/haptools/simgenotype/README.md b/haptools/simgenotype/README.md index 7197eaa4..7708ba50 100644 --- a/haptools/simgenotype/README.md +++ b/haptools/simgenotype/README.md @@ -22,6 +22,11 @@ Detailed information about each option, and example commands using publicly avai Model Format Structure of model.dat file + +`num_samples` - Total number of samples to be output by the simulator (`num_samples*2` haplotypes) +`num_generations` - Number of generations to simulate admixture, must be > 0 +`*_freq` - Frequency of populations to be present in the simulated samples + ``` {num_samples} Admixed Pop_label1 Pop_label2 ... Pop_labeln {num_generations} {admixed_freq} {pop_label1_freq} {pop_label2_freq} ... {pop_labeln_freq} @@ -33,10 +38,35 @@ Example model.dat file 40 Admixed CEU YRI 6 0 0.2 0.8 ``` +Simulating 6 generations in this case implies the first generation has population freqs `Admixed=0, CEU=0.2, YRI=0.8` and the remaining 2-6 generations have population frequency `Admixed=1, CEU=0, YRI=0` Map Format +chr - chromosome of coordinate (1-22, X) +var - variant identifier +pos cM - Position in centimorgans +pos bp - Base pair coordinate + +``` +{chr}\t{var}\t{pos cM}\t{pos bp} +``` +Beagle Genetic Maps used in simulation (GRCh38): http://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/ + + Outfile Format +Sample Header - Name of sample following the structure `Sample_{number}_{hap}` eg. `Sample_10_1` for sample number 10 haplotype +pop - Population label corresponding to the index of the population in the dat file so in the example above CEU = 1, YRI = 2 +chr - chromosome (1-22, X) + +``` +Sample Header +{pop}\t{chr}\t{pos bp} +... +Sample Header 2 +... +``` ## Examples + +Example Command From 2c99a25b18915a9c36b14d80b82605305e929953 Mon Sep 17 00:00:00 2001 From: Michael Lamkin Date: Mon, 14 Feb 2022 23:12:34 -0800 Subject: [PATCH 08/10] Update README.md --- haptools/simgenotype/README.md | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/haptools/simgenotype/README.md b/haptools/simgenotype/README.md index 7708ba50..bf46ba71 100644 --- a/haptools/simgenotype/README.md +++ b/haptools/simgenotype/README.md @@ -17,6 +17,12 @@ Detailed information about each option, and example commands using publicly avai ## Detailed usage +`--invcf` - Input VCF file used to simulate specifiic haplotypes for resulting samples +`--sample_info` - File used to map samples in `REFVCF` to populations found in `MODELFILE` +`--model` - Parameters for simulating admixture across generations +`--map` - .map file used to determine recombination events during the simulation +`--out` - Output prefix of the structure `/path/to/output` which results in the vcf file `output.vcf.gz` and breakpoints file `output.bp` + ## File formats Model Format @@ -43,7 +49,7 @@ Simulating 6 generations in this case implies the first generation has populatio Map Format chr - chromosome of coordinate (1-22, X) -var - variant identifier +var - variant identifier pos cM - Position in centimorgans pos bp - Base pair coordinate @@ -56,7 +62,7 @@ Beagle Genetic Maps used in simulation (GRCh38): http://bochet.gcc.biostat.washi Outfile Format Sample Header - Name of sample following the structure `Sample_{number}_{hap}` eg. `Sample_10_1` for sample number 10 haplotype -pop - Population label corresponding to the index of the population in the dat file so in the example above CEU = 1, YRI = 2 +pop - Population label corresponding to the index of the population in the dat file so in the example above CEU = 1, YRI = 2 chr - chromosome (1-22, X) ``` @@ -70,3 +76,11 @@ Sample Header 2 ## Examples Example Command +``` +haptools simgenotype + --invcf 1000Genomes.vcf.gz \ + --sample_info /path/to/sampleinfo.csv \ + --model /path/to/model/file.dat \ + --map /path/to/plink/file/ \ + --out /path/to/output +``` From ec872bbf5a1b77c9f696d250fcfb01247a12e25d Mon Sep 17 00:00:00 2001 From: Michael Lamkin Date: Mon, 14 Feb 2022 23:13:14 -0800 Subject: [PATCH 09/10] Update README.md --- haptools/simgenotype/README.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/haptools/simgenotype/README.md b/haptools/simgenotype/README.md index bf46ba71..3117adb5 100644 --- a/haptools/simgenotype/README.md +++ b/haptools/simgenotype/README.md @@ -48,10 +48,10 @@ Simulating 6 generations in this case implies the first generation has populatio Map Format -chr - chromosome of coordinate (1-22, X) -var - variant identifier -pos cM - Position in centimorgans -pos bp - Base pair coordinate +`chr` - chromosome of coordinate (1-22, X) +`var` - variant identifier +`pos cM` - Position in centimorgans +`pos bp` - Base pair coordinate ``` {chr}\t{var}\t{pos cM}\t{pos bp} @@ -61,9 +61,9 @@ Beagle Genetic Maps used in simulation (GRCh38): http://bochet.gcc.biostat.washi Outfile Format -Sample Header - Name of sample following the structure `Sample_{number}_{hap}` eg. `Sample_10_1` for sample number 10 haplotype -pop - Population label corresponding to the index of the population in the dat file so in the example above CEU = 1, YRI = 2 -chr - chromosome (1-22, X) +`Sample Header` - Name of sample following the structure `Sample_{number}_{hap}` eg. `Sample_10_1` for sample number 10 haplotype +`pop` - Population label corresponding to the index of the population in the dat file so in the example above CEU = 1, YRI = 2 +`chr` - chromosome (1-22, X) ``` Sample Header From 94fa0fef55f28bea6325c1da66307ab21759a5b9 Mon Sep 17 00:00:00 2001 From: Michael Lamkin Date: Mon, 14 Feb 2022 23:13:57 -0800 Subject: [PATCH 10/10] Update README.md --- haptools/simgenotype/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/haptools/simgenotype/README.md b/haptools/simgenotype/README.md index 3117adb5..9f60b60d 100644 --- a/haptools/simgenotype/README.md +++ b/haptools/simgenotype/README.md @@ -61,7 +61,7 @@ Beagle Genetic Maps used in simulation (GRCh38): http://bochet.gcc.biostat.washi Outfile Format -`Sample Header` - Name of sample following the structure `Sample_{number}_{hap}` eg. `Sample_10_1` for sample number 10 haplotype +`Sample Header` - Name of sample following the structure `Sample_{number}_{hap}` eg. `Sample_10_1` for sample number 10 haplotype 1 `pop` - Population label corresponding to the index of the population in the dat file so in the example above CEU = 1, YRI = 2 `chr` - chromosome (1-22, X)