Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Arkkienkeli authored Jul 5, 2021
1 parent 9177bdd commit 508b11c
Showing 1 changed file with 27 additions and 17 deletions.
44 changes: 27 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,19 +30,29 @@ cat <vcf_file> | bcftools view -c 1 -Ou | bcftools +fill-tags -Ou -- -S configs/
In case vcf file is (b)gzipped use samtools tabix.

### Script usage:
Currently supported mode: bayes.
Currently supported modes:
#### Bayes:

```bash
python3 src/bayesian_pipeline.py --sample <sample_name> --admixtures <admixture_vectors_file> --window-len 50 <group>.<sample>.txt
```
#### Bayes viterbi (used in the paper):

```
python3 src/bayes_viterbi.py --sample <sample_name> --admixtures <admixture_vectors_file> --window-len 50 <group>.<sample>.txt -m
```

`-m` option is used to switch "merged" window mode (windows will overlap by 1 SNP)


### Example pipeline:
```bash
plink2 --bfile America.QuechuaCandelaria_3.txt_GENO --recode vcf --out America.QuechuaCandelaria_3_GENO
plink2 --bfile sample.txt_GENO --recode vcf --out sample

cat America.QuechuaCandelaria_3_GENO.vcf | bcftools view -c 1 -Ou | bcftools +fill-tags -Ou -- -S vcf_groups.txt -t AF | bcftools query -H -f "%CHROM %POS %ID %AF_QuechuaCandelaria_3 %AF_Mediterranean %AF_NativeAmerican %AF_NorthEastAsian %AF_NorthernEuropean %AF_Oceanian %AF_SouthAfrican %AF_SouthEastAsian %AF_SouthWestAsian %AF_SubsaharanAfrican\n" > "QuechuaCandelaria_3.GA002786.txt"
cat sample.vcf | bcftools view -c 1 -Ou | bcftools +fill-tags -Ou -- -S vcf_groups.txt -t AF | bcftools query -H -f "%CHROM %POS %ID %AF_QuechuaCandelaria_3 %AF_Mediterranean %AF_NativeAmerican %AF_NorthEastAsian %AF_NorthernEuropean %AF_Oceanian %AF_SouthAfrican %AF_SouthEastAsian %AF_SouthWestAsian %AF_SubsaharanAfrican\n" > "population.sample.txt"

python3 src/bayesian_pipeline.py --window-len 50 "QuechuaCandelaria_3.GA002786.txt"
python3 src/bayesian_pipeline.py --window-len 50 "population.sample.txt"
```


Expand All @@ -60,6 +70,19 @@ Csv file with statistics that shows the fraction of windows assigned to each pop
Depending on your needs you might need only one file or all of them.

## Algorithm explanation

Using PyLAE with different genomes and/or sets of markers
A different set of putative ancestral populations or a different set of markers require
additional processing. First, we need to collect a database of putatively un-admixed individuals.
If there is an existing validated set of ancestry informative features, these markers should run the
admixture in supervised mode. For each self-reported ancestry, samples should be clustered
based on their admixture profiles to identify subgroups within each self-reported ancestry. These
subgroups are then examined using information about the studied population's history, and the
most representative subset is retained. Then, putative ancestral populations (from 15 to 20
individuals per group) are generated for every ancestry. The validity and stability of the ancestral
populations are evaluated using 1) PCA, 2) leave-one-out supervised admixture, and 3) by
application of supervised admixture to the original datase

Algorithm can be split into 4 stages:
* Data preparation
* Calculating probabilities of assigning each SNP to populations using naive bayes algorithm.
Expand All @@ -69,18 +92,5 @@ is assigned to population with max I. Pop = argmax(I)
* Calculating fraction of windows assigned to each population.




## Modes explanation
### 1. Bayes
Probability of assigning snp to population is calculated according to the Bayes formula:
<img src="https://render.githubusercontent.com/render/math?math=P(Population | SNP) = \frac{P(SNP | Population) \cdot P(Population)}{P(SNP)}">
Here,
<img src="https://render.githubusercontent.com/render/math?math=P(SNP | Population)"> can be estimated as frequency of SNP in selected Population.
<img src="https://render.githubusercontent.com/render/math?math=P(Population) = \frac{1}{|Populations|}"> - we take prior population probabilities from admixture vectors.
<img src="https://render.githubusercontent.com/render/math?math=P(SNP)"> can be estimated as average frequency of SNP among all populations or samples.



## Preprint about the method
https://www.biorxiv.org/content/10.1101/2020.11.13.380105v1

0 comments on commit 508b11c

Please sign in to comment.