Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document FeatureCounts requirements to GTF more appropriately #144

Closed
apeltzer opened this issue Dec 14, 2018 · 13 comments
Closed

Document FeatureCounts requirements to GTF more appropriately #144

apeltzer opened this issue Dec 14, 2018 · 13 comments
Assignees

Comments

@apeltzer
Copy link
Member

We recently had a project with a non-standard organism project, where we had to download genome and GFF3 from NCBI instead of using the ENSEMBL ones. This caused featureCounts to not being able to create appropriate counts, as the gene_id was for example missing in that GTF/GFF.

Proper format is:

https://github.com/nf-core/test-datasets/blob/rnaseq/reference/genes.gtf

@ggabernet will post an example of a GFF that didn't work well. I will then take care of writing down some docs on how to make sure the GFF/GTF works fine for an analysis...

@apeltzer apeltzer self-assigned this Dec 14, 2018
@ggabernet
Copy link
Member

Here is an example of part of a gtf that did not work for us:

ref_Amel_HAv3.1_top_level_head20.txt

@apeltzer
Copy link
Member Author

There are multiple possibilities:

a.) Having a possibility to edit the options supplied to featurecountsdirectly for users
b.) Document that we always need gene_id gene_biotype to be present in the GFF/GTF

a.) Would also require us to adapt featureCounts merging processes in general, e.g. providing this option to the merge_featureCounts process. Could be not straightforward, but would be possible.
b.) Is easy to do, and we could even have a quick check on the GTF/GFF in the beginning of the pipeline to check for the feature existence in the provided GTF/GFF. That would then cause at least an early stop with a more meaningful error message :-)

@ewels
Copy link
Member

ewels commented Dec 14, 2018

iGenomes also has NCBI and UCSC references, they're just not listed in the iGenomes config. We should probably add these. I think that they're normalised for a lot of stuff like this.

@apeltzer
Copy link
Member Author

Once there are some opinions in @ewels looking at u :-P , I'll have a go!

@ewels
Copy link
Member

ewels commented Dec 14, 2018

Can you check with for example:

s3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf

See if you get the same problem?

Use https://ewels.github.io/AWS-iGenomes/ to get all required s3 URLs.

@ggabernet
Copy link
Member

ggabernet commented Dec 14, 2018

iGenomes doesn't have the species I'm working with unfortunately :(
For one of the species I found an ENSEMBL version now though, so if this one runs through then the problem really was the GTF format. For some other species I don't even have an ENSEMBL version.

@ewels
Copy link
Member

ewels commented Dec 14, 2018

Ah sorry, I was speed reading and didn't pick up on the non-model organism bit. GFF is a horrible format for exactly this reason, it's not really a specified format.

The good news is that now I'm sat down and reading properly, I realise that this is a problem that we already came across ages ago and built in a feature to handle. So your fix is already part of the pipeline! It's even got documentation: https://github.com/nf-core/rnaseq/blob/master/docs/usage.md#featurecounts-extra-gene-names

I guess that suppling the option --fcExtraAttributes gene when running the pipeline will fix the issue.

@ewels
Copy link
Member

ewels commented Dec 14, 2018

ps. This is where it's used:

rnaseq/main.nf

Line 923 in e837637

def extraAttributes = params.fcExtraAttributes ? "--extraAttributes ${params.fcExtraAttributes}" : ''

From the SubRead documentation:

−−extraAttributes
Extract extra attribute types from the provided GTF annotation and include them in the counting output. These attribute types will not be used to group features. If more than one attribute type is provided they should be separated by comma (in Rsubread featureCountsits value is a character vector).

@ggabernet
Copy link
Member

ggabernet commented Dec 17, 2018

Hi Phil, thank you for your answers. This indeed pointed to the solution of the problem, even though not fully. Due to the different annotation in the GTF file, I had to change the featureCount call to -g Parent.

https://github.com/ggabernet/rnaseq/blob/57c4475415b38994b50c6630f856d67f39605b57/main.nf#L932

Would it be a possibility to provide a parameter that allows changing the term for this call, in a similar way as for extra attributes?

@ewels
Copy link
Member

ewels commented Dec 17, 2018

Absolutely - we can set this to biotype by default but use a params variable to make it customisable. @apeltzer, are you able to PR this?

@apeltzer
Copy link
Member Author

YUp, will do

@apeltzer
Copy link
Member Author

This is now possible in the dev branch of the pipeline :-)

@ggabernet
Copy link
Member

Perfect, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants