diff --git a/.gitignore b/.gitignore index 618f9c4a..11548fd4 100644 --- a/.gitignore +++ b/.gitignore @@ -9,4 +9,7 @@ __pycache__ # pytest cache .pytest_cache # poetry -dist/ \ No newline at end of file +dist/ + +# OSX +*.DS_Store* \ No newline at end of file diff --git a/README.md b/README.md index 35127436..4ea87402 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,32 @@ Please wait until we have published our first tagged release before using our code. # haptools -Simulate phenotypes for fine-mapping. Use real variants to simulate real, biological LD patterns. -The Snakemake pipeline in the `snakemake/` directory uses the results of the simulation to test several fine-mapping methods, including FINEMAP and SuSiE. + +Haptools is a collection of tools for simulating and analyzing genotypes and phenotypes while taking into account haplotype information. It is particularly designed for analysis of individuals with admixed ancestries, although the tools can also be used for non-admixed individuals. Homepage: https://haptools.readthedocs.io/ + +## Installation + +UNDER CONSTRUCTION + +## Haptools utilities + +Haptools consists of multiple utilities listed below. Click on a utility to see more detailed usage information. + +* [`haptools simgenome`](haptools/simgenotype/README.md): Simulate genotypes for admixed individuals under user-specified demographic histories. + +* [`haptools simphenotype`](haptools/simphenotype/README.md): Simulate a complex trait, taking into account local ancestry- or haplotype- specific effects. `haptools simphenotype` takes as input a VCF file and outputs simulated phenotypes for each sample. + +* [`haptools karyogram`](haptools/karyogram/README.md): Visualize a "chromosome painting" of local ancestry labels based on breakpoints output by `haptools simgenome`. + +Outputs produced by these utilities are compatible with each other. For example +`haptools simgenome` outputs a VCF file with local ancestry information annotated for each variant. The output VCF file can be used as input to `haptools simphenotype` to simulate phenotype information. `haptools simgenome` also outputs a list of local ancestry breakpoints which can be visualized using `haptools karyogram`. + + +## Contributing + +If you are interested in contributing to `haptools`, please get in touch by submitting a Github issue or contacting us at mlamkin@ucsd.edu. + + + diff --git a/haptools/karyogram/README.md b/haptools/karyogram/README.md new file mode 100644 index 00000000..365abb96 --- /dev/null +++ b/haptools/karyogram/README.md @@ -0,0 +1,3 @@ +# Haptools karyogram + +UNDER CONSTRUCTION \ No newline at end of file diff --git a/haptools/visualization/karyogram.py b/haptools/karyogram/karyogram.py similarity index 100% rename from haptools/visualization/karyogram.py rename to haptools/karyogram/karyogram.py diff --git a/haptools/visualization/to_remove.py b/haptools/karyogram/to_remove.py similarity index 100% rename from haptools/visualization/to_remove.py rename to haptools/karyogram/to_remove.py diff --git a/haptools/simgenotype/README.md b/haptools/simgenotype/README.md new file mode 100644 index 00000000..9f60b60d --- /dev/null +++ b/haptools/simgenotype/README.md @@ -0,0 +1,86 @@ +# Haptools simgenotype + +`haptools simgenotype` takes as input a reference set of haplotypes in VCF format and a user-specified admixture model. It outputs a VCF file with simulated genotype information for admixed genotypes, as well as a breakpoints file that can be used for visualization. + +## Basic usage + +``` +haptools simgenotype \ + --invcf REFVCF \ + --sample_info SAMPLEINFOFILE \ + --model MODELFILE \ + --map GENETICMAP \ + --out OUTPREFIX +``` + +Detailed information about each option, and example commands using publicly available files, are shown below. + +## Detailed usage + +`--invcf` - Input VCF file used to simulate specifiic haplotypes for resulting samples +`--sample_info` - File used to map samples in `REFVCF` to populations found in `MODELFILE` +`--model` - Parameters for simulating admixture across generations +`--map` - .map file used to determine recombination events during the simulation +`--out` - Output prefix of the structure `/path/to/output` which results in the vcf file `output.vcf.gz` and breakpoints file `output.bp` + +## File formats + +Model Format + +Structure of model.dat file + +`num_samples` - Total number of samples to be output by the simulator (`num_samples*2` haplotypes) +`num_generations` - Number of generations to simulate admixture, must be > 0 +`*_freq` - Frequency of populations to be present in the simulated samples + +``` +{num_samples} Admixed Pop_label1 Pop_label2 ... Pop_labeln +{num_generations} {admixed_freq} {pop_label1_freq} {pop_label2_freq} ... {pop_labeln_freq} +``` + +Example model.dat file + +``` +40 Admixed CEU YRI +6 0 0.2 0.8 +``` +Simulating 6 generations in this case implies the first generation has population freqs `Admixed=0, CEU=0.2, YRI=0.8` and the remaining 2-6 generations have population frequency `Admixed=1, CEU=0, YRI=0` + +Map Format + +`chr` - chromosome of coordinate (1-22, X) +`var` - variant identifier +`pos cM` - Position in centimorgans +`pos bp` - Base pair coordinate + +``` +{chr}\t{var}\t{pos cM}\t{pos bp} +``` +Beagle Genetic Maps used in simulation (GRCh38): http://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/ + + +Outfile Format + +`Sample Header` - Name of sample following the structure `Sample_{number}_{hap}` eg. `Sample_10_1` for sample number 10 haplotype 1 +`pop` - Population label corresponding to the index of the population in the dat file so in the example above CEU = 1, YRI = 2 +`chr` - chromosome (1-22, X) + +``` +Sample Header +{pop}\t{chr}\t{pos bp} +... +Sample Header 2 +... +``` + +## Examples + +Example Command +``` +haptools simgenotype + --invcf 1000Genomes.vcf.gz \ + --sample_info /path/to/sampleinfo.csv \ + --model /path/to/model/file.dat \ + --map /path/to/plink/file/ \ + --out /path/to/output +``` diff --git a/haptools/simulate/admix_storage.py b/haptools/simgenotype/admix_storage.py similarity index 100% rename from haptools/simulate/admix_storage.py rename to haptools/simgenotype/admix_storage.py diff --git a/haptools/simulate/sim_admixture.py b/haptools/simgenotype/sim_admixture.py similarity index 100% rename from haptools/simulate/sim_admixture.py rename to haptools/simgenotype/sim_admixture.py diff --git a/haptools/simphenotype/README.md b/haptools/simphenotype/README.md new file mode 100644 index 00000000..9c46b012 --- /dev/null +++ b/haptools/simphenotype/README.md @@ -0,0 +1,3 @@ +# Haptools simphenotype + +UNDER CONSTRUCTION \ No newline at end of file