-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with Salmon build: It removes identical transcript sequences #214
Comments
Hi Kristoffer, The duplicate transcript issue is a frustrating one. It came to our attention when we noticed that ensembl often annotated transcripts on patch / haplotype contigs that were identical and unlikely to be different from more "canonical" transcripts in any way. Further, these transcripts are indistinguishable from the quantification perspective. That being said, the removal of sequence duplicate transcripts is optional in Salmon. If you pass Best, |
That is frustrating. But I have to agree the haplotype problem is probably the large one of the two... |
Yea. Both are frustrating, which is why we spam warning messages to the console when we remove duplicates. Sorry if this default behavior caused you any trouble, but hopefully its easy to recover these quants without rerunning anything using the map of collapsed transcripts. |
Actually it's fairly easy with GENCODE annotation as they do not have haplotypes in the general annotation - so we can just use the |
Hi, |
Hi @rbenel, Can you post here the output of salmon's indexing phase? Does it mention discarding anything? Presumably, we can just do a set difference on the iskform sets to see qhats happening. --Rob |
Hi @rob-p,
I reproduced the warnings from the initial run w/o the
|
Could you also specify exactly which of the GENCODE files you are using? |
Yes, it is in the previous post.. https://www.gencodegenes.org/releases/current.html -> PRI. |
Could you post one of your output quant.sf files? I can investigate. |
Hi, Here is link to dropbox, https://www.dropbox.com/s/herbw9te1g9sgv2/quant.sf?dl=0 |
Hi @rbenel, This is quite interesting. So I downloaded both the Gencode transcriptome (all transcript sequences) and the annotation you point out (PRI --- comprehensive gene annotation). There are a few transcripts present in the latter but not the former:
Specifically, these are not dropped by salmon. They are not in the input reference transcriptome file. So it looks like Gencode includes these in the GTF, but not in the transcriptome fasta. I looked at the first few, and nothing immediately jumps out as to why Gencode would have dropped them from the fasta file. Do these transcript names have any special significance to you? If you really want to include them, one option would be to use the GTF + the genome, and a tool like |
GENCODE provide 1 FASTA File called "Transcript sequences" which "only" contains the "CHR" (chromosomal) regions. GENCODE provide many GTF files (specifically 9). The GTF file corresponding to the FASTA file mentioned above is the "Comprehensive gene annotation" from the "CHR" regions (aka chromosomal regions) (which is the first on their list). You have downloaded the "Pri" (third entry) which is the normal chromosomes (Chr) as well as as well as scaffolds. which explain the 68 extra transcripts. Specifically the scaffolds included in "Pri" but not in "Chr" are:
So the solution is as @rob-p suggested:
Cheers |
Thank you both! I need to look into those transcripts, to see if anything looks important. |
This comment has been minimized.
This comment has been minimized.
Hi all, [Step 1 of 4] : counting k-mers **So using gffread I created a transcripts.fa file: using this new transcripts.fa I run again the above mentioned salmon index with decoy command, but the warning message was shown up again:** [Step 1 of 4] : counting k-mers **My concern is would it make problem for rest of downstream analysis? Thanks, |
Hi @Tima-Ze, This should not cause any trouble with downstream analysis. The indexing procedure is simply informing you that these transcripts (about which you are being warned) are shorter than the seed length used for alignment. This means that it simply won't be possible for fragments to align to these transcripts, and so they will always have a 0 abundance in the resulting |
We just discovered that Salmon build removes/collapses identical transcripts. This is very problematic that Salmon does this as many genes are duplicated throughout the genome. By concatenating them in the build index one of these is arbitrary selected (the others removed) meaning all downstream analysis will assume all expression originate from one genomic region instead of many.
In the most recent Gencode mouse release this problem affects 1563 sequences annotated as 13812 and covers all transcript types (incl 840 protein coding - although the major once are lincRNA (n=3658) and snoRNAs (n=2622)).
We strongly believe that if one want to analyse these duplicated regions jointly this should be done just like one would sum all transcripts from a particular gene to get the gene expression.
The text was updated successfully, but these errors were encountered: