Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combine multiple donor vcf #33

Open
AnjaliC4 opened this issue Nov 27, 2021 · 8 comments
Open

combine multiple donor vcf #33

AnjaliC4 opened this issue Nov 27, 2021 · 8 comments

Comments

@AnjaliC4
Copy link

AnjaliC4 commented Nov 27, 2021

Hello,

I have 2 pooled snATAC donors and their individual bulk RNA-seq data. To demultiplex, my strategy was to call common variant SNPs for the two individuals separately using CellSNP, followed by using their cellsnp.cells.vcf as donor-vcf to demultiplex pooled samples. However, I am unsure how to combine the two donor vcf files generated from cellsnp to provide them as --donor vcf files for vireo.

What Ive tried:
cellsnp-lite $DIR_snRNA_BAM/$value_*.bam -I $value -R genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz --minMAF 0.1 --minCOUNT 100 -p 32 --cellTAG None --UMItag None --genotype --gzip --minMAPQ 30 --inclFLAG 2
bctools index Sample1.cells.vcf.gz
bcftools index Sample2.cells.vcf.gz
bcftools merge - Sample1.cells.vcf.gz Sample2.cells.vcf.gz -O combined.vcf -Oz.

Does this make sense?

Also, how can I match the demultiplexed donors to their original donor.vcf file?
And, If the information for only one of the donors (out of 2) is available - do you suggest using --forcelearnGT ?

Thanks

@huangyh09
Copy link
Collaborator

Hi,

Thanks for the question. bcftools merge should work. Alternatively, you can re-run cellsnp-lite, but use two bam files together as file1.bam,file2.bam. See examples by just treating them as two smart-seq files: https://cellsnp-lite.readthedocs.io/en/latest/manual.html#mode-1b-well-based-single-cells-or-bulk

In your case, as you have the genotype for each donor, you can specify it in the vireo command line by adding -d $DONOR_GT_FILE, then you will have the donor id. If only one donor is available, maybe better to run without donor VCF but check to similarity between donors to further confirm it, see example:
https://github.com/single-cell-genetics/vireo/blob/master/examples/donor_match.ipynb

Yuanhua

@AnjaliC4
Copy link
Author

AnjaliC4 commented Nov 29, 2021

Hi Yuanhua, thanks for your response.
To clarify, I do not have the genotype of each donor. I have pooled donors for scATAC-seq and individual scRNA-seq. I want to use CellSNP to call SNPs on scRNA (by treating them as bulk) and use those to deconvolute pooled scATAC.
I restricted the CellSNP run to 1000 genome variants in scATAC and scRNA - which unfortuantely could not deconvolve many cells.
What do you think may be the best strategy here?

Thanks very much.

@huangyh09
Copy link
Collaborator

Thanks for clarifying. It sounds fine to genotype donors by using scRNA, and you can still use multiple scRNA bam files as bulk samples to get a combined donor VCF file.

Your second question might be more worrying, as the covered SNPs in scATAC may not be sufficient if the sequencing coverage is not enough. Usually, people find the scACAT coverage is okay for demultiplexing. In your case, maybe you could perform the de novo deconvolution without donor VCF, so that you can use most SNPs in your scATAC, then compare it back to scRNA based donors, with the notebook mentioned above.

Y

@AnjaliC4
Copy link
Author

AnjaliC4 commented Dec 1, 2021

Thanks Yuanhua,

Could you also please suggest to me how to subset the 1000genome vcf file you provide to remove chrX from it. I tried using vcftools --not-chr (not sure if this worked well)

Thanks very much.

@AnjaliC4
Copy link
Author

AnjaliC4 commented Dec 2, 2021

Hi, I actually encountered another problem. I am trying to merge a few GT_donors.vireo.vcf.gz files for some analysis. For that, I tried using bcftools reheader to rename samples in this vcf (which works fine). However, bcftools merge needs indexed file, and bcftools index - it throws error. I tried bcftools sort to position-sort the vcf -which does not work either>
Here is what I did:
bcftools reheader -s name.txt GT_donors.vireo.vcf.gz > $filename.vireo.vcf.gz ##to provide meaningful names
bcftools index vireo.vcf.gz
E::hts_idx_push] Unsorted positions on sequence #1: 13400717 followed by 13392592
index: failed to create index
bcftools sort vireo.vcf.gz > vireo.sort.vcf.gz
Writing to /tmp/bcftools-sort.LB7lyY
[W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse_info] INFO 'AD' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'DP' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'OTH' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'GT' at 1:791559 is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'AD' at 1:791559 is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'DP' at 1:791559 is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'PL' at 1:791559 is not defined in the header, assuming Type=String
Error encountered while parsing the input at 1:791559
Cleaning

How can I properly merge multiple GT_donors.vireo.vcf.gz files?

@huangyh09
Copy link
Collaborator

Thanks for sharing this. Unfortunately, the potential incompatibility is still an outstanding issue and may need further tests in future releases. For the moment, you may try reheader it first and then try sort and/or index.

YH

@AnjaliC4
Copy link
Author

AnjaliC4 commented Dec 5, 2021

Hello,
I did what you suggested and added header information from cellSNP.cells.vcf to GT_donor.vcf and changed sample name. Then I tried merging mutliple GT_donor.vcf files using bcftools merge -Oz to produce multi-sample vcf. This is giving me the following error: (even though all the GT vcf files produced were from cellSNP at 1000genome common variants)
The REF prefixes differ: T vs G (1,1)
Failed to merge alleles at 7:341966

The file eventually merges anyway somehow, however upon using vireoSNP.vcf.match_VCF_samples(), it gives this error:
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.7.9/lib/python3.7/gzip.py in read(self, size)
491 break
492 if buf == b"":
--> 493 raise EOFError("Compressed file ended before the "
494 "end-of-stream marker was reached")

I tried sorting the merged file with bcftools sort. That gave me following error:

Writing to /tmp/bcftools-sort.woMldC
[E::bgzf_read_block] Failed to read BGZF block data at offset 20966172 expected 6154 bytes; hread returned 5330
[E::vcf_parse_format] Number of columns at 2:47025126 does not match the number of samples (37 vs 84)
Error encountered while parsing the input
Cleaning

I think the file does not get created properly.
There should be an easy way to merge these files. My end goal is to align multiple GT_donor.files to another vcf file of genotypes. If there is no easy solution to my above problem, is it possible to add multiple GT_donor vcf files to vireoSNP.vcf.match_VCF_samples()

Please help. Thanks.

@huangyh09
Copy link
Collaborator

It looks the ref alleles are different from different samples. The ref allele from cellsnp-vireo pipeline should give the same as 1000genome. Not sure what ref allele in your other samples.

Otherwise, you may consider bcftools +fixref to synchronise them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants