Python for local ancestry estimation
(will be performed by script itself in future)
- In case we have .bed .bim .fam files, we need to convert to vcf using plink:
plink2 --bfile <bfile_prefix> --recode vcf --out <vcf_file>
- Calculate snp frequencies for population groups using bcftools.
User groups defined in file
configs/vcf_groups.txt
:
cat <vcf_file> | bcftools view -c 1 -Ou | bcftools +fill-tags -Ou -- -S configs/vcf_groups.txt -t AF | bcftools query -H -f "%CHROM %POS %ID %AF_<group> %AF_Mediterranean %AF_NativeAmerican %AF_NorthEastAsian %AF_NorthernEuropean %AF_Oceanian %AF_SouthAfrican %AF_SouthEastAsian %AF_SouthWestAsian %AF_SubsaharanAfrican\n" > <group>.<sample>.txt
In case vcf file is (b)gzipped use samtools tabix.
- Then use main script. Currently supported modes: bayes, fb. Note: fb is around 20 times slower.
python3 src/process_individuals.py --mode fb --window-len 200 <group>.<sample>.txt
Example pipeline:
plink2 --bfile America.QuechuaCandelaria_3.txt_GENO --recode vcf --out America.QuechuaCandelaria_3_GENO
cat America.QuechuaCandelaria_3_GENO.vcf | bcftools view -c 1 -Ou | bcftools +fill-tags -Ou -- -S vcf_groups.txt -t AF | bcftools query -H -f "%CHROM %POS %ID %AF_QuechuaCandelaria_3 %AF_Mediterranean %AF_NativeAmerican %AF_NorthEastAsian %AF_NorthernEuropean %AF_Oceanian %AF_SouthAfrican %AF_SouthEastAsian %AF_SouthWestAsian %AF_SubsaharanAfrican\n" > "QuechuaCandelaria_3.GA002786.txt"
python3 src/process_individuals.py --mode fb --window-len 200 "QuechuaCandelaria_3.GA002786.txt"
for vcf file with around 120k SNPs.
mode | exec time, min | ? |
---|---|---|
fb | 20 | ? |
bayes | 1 | ? |
softmax | 0.5 | ? |