-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation update: GTF/GFF and biotype annotation #1086
Comments
In addition to the points raised above, it would be great to have some guidance on the version of the genome/annotation that the pipeline expects/recommends. This is especially true for GENCODE which provides complete and primary assembly genomes, and reference chromosome, primary assembly and complete annotations, each with a basic and comprehensive version. I tried to work out what is pulled from iGenomes if you just provide a genome name but didn't have much luck. The STAR documentation recommends the primary assembly genome and the "most comprehensive" annotation (presumably for the primary assembly). This is in contrast to the GENCODE webpage which suggests the basic version should be used by most people. For Salmon, it is less clear but most examples seem to use the GENCODE transcripts FASTA which only covers the reference chromosomes. ENSEMBL provides a similar cDNA FASTA file but I'm not sure exactly what is included here and it seems to contain ~50,000 fewer transcripts. Kallisto seems to recommend the ENSEMBL cDNA FASTA or the primary assembly genome with a GTF using kallisto | bustools. Following both these recommendations could maybe result in some differences for STAR-Salmon vs pseudoalignment Salmon when both a GTF and transcripts FASTA are provided (depending on how files are passed around) as one would use the GTF for the primary assembly while the other would use a transcripts FASTA just for the reference chromosomes. There seem to be only around ~60 extra transcripts that are present on the additional parts of the primary assembly though so not a big difference and could be avoided by just providing the GTF. Sorry it that's too much information. I'd gone down a bit of a rabbit hole with this and thought it would be good to write it somewhere for future reference. |
No need to excuse for too much information! Actually, it is fantastic, that you put the effort in to research and document it. But it would actually even better to write it down directly in the documentation of the pipeline. How would you feel about adding this to the pipeline's documentation? |
I'm happy to try and contribute some text but maybe it would be good for people more familiar with the pipeline than me to confirm what is recommended. This is what I think so far: General advice
GENCODE
ENSEMBL
Both
This is based on human and is probably similar for mouse but I'm less sure about other species. If there is a general consensus these are the recommendations I'm happy to write this up properly and add a section to the docs. |
Thank you foe taking this on! I agree with all of your suggestions. A couple of additional points:
|
nice work all- would be lovely to make all this explicit in the docs |
I opened a draft PR which tries to incorporate what was said here. I figured it would be easier to give more specific comments where there is some actual text to look at. More comments/contributions welcome! |
I think, this issue can be closed, since the documentation has since been improved significantly. Thanks! |
Description of feature
One of the issues that repeatedly causes confusion amongst users of the rnaseq pipeline is the reference transcriptome annotation to be provided as GTF or GFF3 (which is then converted to GTF by the pipeline) and where to best obtain them from.
Since the proposal for a new pipeline / subworkflow for reference bundle preparation is somewhat stalled, I believe some updated to the pipeline's documentation can't harm. Specifically, we need:
Error: no valid ID found for GFF
record that occurs if a GTF file is provided that containsgene
entries with emptytranscript_id ""
fields, like those they are recently distributed by RefSeq and Ensembl. It should be mentioned that they must be preprocessed by deleting the respective entries withgrep -v 'transcript_id ""' original.gtf > filtered.gtf
to work with the pipeline.Since this is a documentation-only task, I believe it is well suited for the Hackathon.
Thanks!
The text was updated successfully, but these errors were encountered: