Skip to content

Latest commit

 

History

History
44 lines (33 loc) · 1.75 KB

README.md

File metadata and controls

44 lines (33 loc) · 1.75 KB

pylae

Python for local ancestry estimation

Data preparation stage:

(will be performed by script itself in future)

  1. In case we have .bed .bim .fam files, we need to convert to vcf using plink:
plink2 --bfile <bfile_prefix> --recode vcf --out <vcf_file>
  1. Calculate snp frequencies for population groups using bcftools. User groups defined in file configs/vcf_groups.txt:
cat <vcf_file> | bcftools view -c 1 -Ou | bcftools +fill-tags -Ou -- -S configs/vcf_groups.txt -t AF | bcftools query -H -f "%CHROM %POS %ID %AF_<group> %AF_Mediterranean %AF_NativeAmerican %AF_NorthEastAsian %AF_NorthernEuropean %AF_Oceanian %AF_SouthAfrican %AF_SouthEastAsian %AF_SouthWestAsian %AF_SubsaharanAfrican\n" > <group>.<sample>.txt

In case vcf file is (b)gzipped use samtools tabix.

  1. Then use main script. Currently supported modes: bayes, fb. Note: fb is around 20 times slower.
python3 src/process_individuals.py --mode fb --window-len 200  <group>.<sample>.txt

Example pipeline:

plink2 --bfile America.QuechuaCandelaria_3.txt_GENO --recode vcf --out America.QuechuaCandelaria_3_GENO

cat America.QuechuaCandelaria_3_GENO.vcf | bcftools view -c 1 -Ou | bcftools +fill-tags -Ou -- -S vcf_groups.txt -t AF | bcftools query -H -f "%CHROM %POS %ID %AF_QuechuaCandelaria_3 %AF_Mediterranean %AF_NativeAmerican %AF_NorthEastAsian %AF_NorthernEuropean %AF_Oceanian %AF_SouthAfrican %AF_SouthEastAsian %AF_SouthWestAsian %AF_SubsaharanAfrican\n" > "QuechuaCandelaria_3.GA002786.txt"

python3 src/process_individuals.py --mode fb --window-len 200  "QuechuaCandelaria_3.GA002786.txt"

Estimated performance:

for vcf file with around 120k SNPs.

mode exec time, min ?
fb 20 ?
bayes 1 ?
softmax 0.5 ?