From c617183df5f93efc510d9acb4afcb91a48b01605 Mon Sep 17 00:00:00 2001 From: Smetanin Alexander Date: Thu, 23 Apr 2020 21:51:23 +0300 Subject: [PATCH] more info in readme --- README.md | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index be7be77..daf4511 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,17 @@ # pylae Python for local ancestry estimation +## Requirements and installation: + +* Python 3.5+ is required +* bcftools +* (optionally) plink / plink2 + +Installing python requirements: +```bash +pip3 install -r requirements.txt +``` +## Usage: ### Data preparation stage: (will be performed by script itself in future) @@ -26,7 +37,7 @@ Note: fb is around 20 times slower. python3 src/process_individuals.py --mode fb --window-len 200 ..txt ``` -Example pipeline: +### Example pipeline: ```bash plink2 --bfile America.QuechuaCandelaria_3.txt_GENO --recode vcf --out America.QuechuaCandelaria_3_GENO @@ -56,6 +67,20 @@ Tsv (tab-separated) file with a list of all SNPs and probabilities that it came 3. `___stats.csv` Csv file with statistics that shows the fraction of windows assigned to each population. +## Algorithm explanation +Algorithm can be split into 4 stages: +* Data preparation +* Calculating probabilities of assigning each SNP to populations. + There are 3 modes in which it can be done, they are explained below. +* Choosing best population for each window with selected length (in SNPs). +In this stage we convert probabilities to information with entropy formula: +-p * log (p). Then this information (I) is summed in each window and the window +is assigned to population with max I. Pop = argmax(I) +* Calculating fraction of windows assigned to each population. + + +Depending on your needs you might need only one file or all of them. + ## Modes explanation ### 1. Bayes