combine multiple donor vcf #33

AnjaliC4 · 2021-11-27T00:43:28Z

Hello,

I have 2 pooled snATAC donors and their individual bulk RNA-seq data. To demultiplex, my strategy was to call common variant SNPs for the two individuals separately using CellSNP, followed by using their cellsnp.cells.vcf as donor-vcf to demultiplex pooled samples. However, I am unsure how to combine the two donor vcf files generated from cellsnp to provide them as --donor vcf files for vireo.

What Ive tried:
cellsnp-lite $DIR_snRNA_BAM/$value_*.bam -I $value -R genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz --minMAF 0.1 --minCOUNT 100 -p 32 --cellTAG None --UMItag None --genotype --gzip --minMAPQ 30 --inclFLAG 2
bctools index Sample1.cells.vcf.gz
bcftools index Sample2.cells.vcf.gz
bcftools merge - Sample1.cells.vcf.gz Sample2.cells.vcf.gz -O combined.vcf -Oz.

Does this make sense?

Also, how can I match the demultiplexed donors to their original donor.vcf file?
And, If the information for only one of the donors (out of 2) is available - do you suggest using --forcelearnGT ?

Thanks

huangyh09 · 2021-11-29T14:25:49Z

Hi,

Thanks for the question. bcftools merge should work. Alternatively, you can re-run cellsnp-lite, but use two bam files together as file1.bam,file2.bam. See examples by just treating them as two smart-seq files: https://cellsnp-lite.readthedocs.io/en/latest/manual.html#mode-1b-well-based-single-cells-or-bulk

In your case, as you have the genotype for each donor, you can specify it in the vireo command line by adding -d $DONOR_GT_FILE, then you will have the donor id. If only one donor is available, maybe better to run without donor VCF but check to similarity between donors to further confirm it, see example:
https://github.com/single-cell-genetics/vireo/blob/master/examples/donor_match.ipynb

Yuanhua

AnjaliC4 · 2021-11-29T16:33:46Z

Hi Yuanhua, thanks for your response.
To clarify, I do not have the genotype of each donor. I have pooled donors for scATAC-seq and individual scRNA-seq. I want to use CellSNP to call SNPs on scRNA (by treating them as bulk) and use those to deconvolute pooled scATAC.
I restricted the CellSNP run to 1000 genome variants in scATAC and scRNA - which unfortuantely could not deconvolve many cells.
What do you think may be the best strategy here?

Thanks very much.

huangyh09 · 2021-11-30T05:56:01Z

Thanks for clarifying. It sounds fine to genotype donors by using scRNA, and you can still use multiple scRNA bam files as bulk samples to get a combined donor VCF file.

Your second question might be more worrying, as the covered SNPs in scATAC may not be sufficient if the sequencing coverage is not enough. Usually, people find the scACAT coverage is okay for demultiplexing. In your case, maybe you could perform the de novo deconvolution without donor VCF, so that you can use most SNPs in your scATAC, then compare it back to scRNA based donors, with the notebook mentioned above.

Y

AnjaliC4 · 2021-12-01T04:41:48Z

Thanks Yuanhua,

Could you also please suggest to me how to subset the 1000genome vcf file you provide to remove chrX from it. I tried using vcftools --not-chr (not sure if this worked well)

Thanks very much.

AnjaliC4 · 2021-12-02T16:57:53Z

Hi, I actually encountered another problem. I am trying to merge a few GT_donors.vireo.vcf.gz files for some analysis. For that, I tried using bcftools reheader to rename samples in this vcf (which works fine). However, bcftools merge needs indexed file, and bcftools index - it throws error. I tried bcftools sort to position-sort the vcf -which does not work either>
Here is what I did:
bcftools reheader -s name.txt GT_donors.vireo.vcf.gz > $filename.vireo.vcf.gz ##to provide meaningful names
bcftools index vireo.vcf.gz
E::hts_idx_push] Unsorted positions on sequence #1: 13400717 followed by 13392592
index: failed to create index
bcftools sort vireo.vcf.gz > vireo.sort.vcf.gz
Writing to /tmp/bcftools-sort.LB7lyY
[W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse_info] INFO 'AD' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'DP' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'OTH' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'GT' at 1:791559 is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'AD' at 1:791559 is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'DP' at 1:791559 is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'PL' at 1:791559 is not defined in the header, assuming Type=String
Error encountered while parsing the input at 1:791559
Cleaning

How can I properly merge multiple GT_donors.vireo.vcf.gz files?

huangyh09 · 2021-12-05T00:55:36Z

Thanks for sharing this. Unfortunately, the potential incompatibility is still an outstanding issue and may need further tests in future releases. For the moment, you may try reheader it first and then try sort and/or index.

YH

AnjaliC4 · 2021-12-05T07:54:32Z

Hello,
I did what you suggested and added header information from cellSNP.cells.vcf to GT_donor.vcf and changed sample name. Then I tried merging mutliple GT_donor.vcf files using bcftools merge -Oz to produce multi-sample vcf. This is giving me the following error: (even though all the GT vcf files produced were from cellSNP at 1000genome common variants)
The REF prefixes differ: T vs G (1,1)
Failed to merge alleles at 7:341966

The file eventually merges anyway somehow, however upon using vireoSNP.vcf.match_VCF_samples(), it gives this error:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.7.9/lib/python3.7/gzip.py in read(self, size)
491 break
492 if buf == b"":
--> 493 raise EOFError("Compressed file ended before the "
494 "end-of-stream marker was reached")

I tried sorting the merged file with bcftools sort. That gave me following error:

Writing to /tmp/bcftools-sort.woMldC
[E::bgzf_read_block] Failed to read BGZF block data at offset 20966172 expected 6154 bytes; hread returned 5330
[E::vcf_parse_format] Number of columns at 2:47025126 does not match the number of samples (37 vs 84)
Error encountered while parsing the input
Cleaning

I think the file does not get created properly.
There should be an easy way to merge these files. My end goal is to align multiple GT_donor.files to another vcf file of genotypes. If there is no easy solution to my above problem, is it possible to add multiple GT_donor vcf files to vireoSNP.vcf.match_VCF_samples()

Please help. Thanks.

huangyh09 · 2021-12-05T13:31:57Z

It looks the ref alleles are different from different samples. The ref allele from cellsnp-vireo pipeline should give the same as 1000genome. Not sure what ref allele in your other samples.

Otherwise, you may consider bcftools +fixref to synchronise them.

hxj5 mentioned this issue Aug 15, 2023

scRNA-seq from different donors as a genotype-vcf input for vireo single-cell-genetics/cellsnp-lite#100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

combine multiple donor vcf #33

combine multiple donor vcf #33

AnjaliC4 commented Nov 27, 2021 •

edited

Loading

huangyh09 commented Nov 29, 2021

AnjaliC4 commented Nov 29, 2021 •

edited

Loading

huangyh09 commented Nov 30, 2021

AnjaliC4 commented Dec 1, 2021 •

edited

Loading

AnjaliC4 commented Dec 2, 2021

huangyh09 commented Dec 5, 2021

AnjaliC4 commented Dec 5, 2021 •

edited

Loading

huangyh09 commented Dec 5, 2021

combine multiple donor vcf #33

combine multiple donor vcf #33

Comments

AnjaliC4 commented Nov 27, 2021 • edited Loading

huangyh09 commented Nov 29, 2021

AnjaliC4 commented Nov 29, 2021 • edited Loading

huangyh09 commented Nov 30, 2021

AnjaliC4 commented Dec 1, 2021 • edited Loading

AnjaliC4 commented Dec 2, 2021

huangyh09 commented Dec 5, 2021

AnjaliC4 commented Dec 5, 2021 • edited Loading

huangyh09 commented Dec 5, 2021

AnjaliC4 commented Nov 27, 2021 •

edited

Loading

AnjaliC4 commented Nov 29, 2021 •

edited

Loading

AnjaliC4 commented Dec 1, 2021 •

edited

Loading

AnjaliC4 commented Dec 5, 2021 •

edited

Loading