Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensembl GRCm39 v109 GTF / ...cds.all.fa Parse Issue #18

Open
MikeWLloyd opened this issue May 1, 2023 · 0 comments
Open

Ensembl GRCm39 v109 GTF / ...cds.all.fa Parse Issue #18

MikeWLloyd opened this issue May 1, 2023 · 0 comments

Comments

@MikeWLloyd
Copy link

I am trying to use your tool with GRCm39. I have been able to successfully build with m38, and GRCh38. However, for GRCm39 using ensembl v109, an error is thrown:

Command and error:

singularity run how_are_we_stranded_here_1.0.1--pyhfa5458b_0.sif check_strandedness --gtf /<REDACTED_PATH>/ensembl/v109/Mus_musculus.GRCm39.109.gtf -r1 RNASEQ.R1.FASTQ.gz_filtered -r2 RNAEQ.R1.FASTQ.gz_filtered -fa  /<REDACTED_PATH>/ensembl/GRCm39/Mus_musculus.GRCm39.cds.all.fa -p
Results stored in: stranded_test_RNASEQ.R1.FASTQ_filtered_trimmed
converting gtf to bed
running command: gtf2bed --gtf  /<REDACTED_PATH>/ensembl/v109/Mus_musculus.GRCm39.109.gtf --bed RNASEQ.R1.FASTQ_filtered/Mus_musculus.GRCm39.109.bed
Checking if fasta headers and bed file transcript_ids match...
Can't find transcript ids from /<REDACTED_PATH>/ensembl/GRCm39/Mus_musculus.GRCm39.cds.all.fa in stranded_test_RNASEQ.R1.FASTQ_filtered/Mus_musculus.GRCm39.109.bed
Trying to converting fasta header format to match transcript ids to the BED file...
running command: sed 's/[|]/ /g' /<REDACTED_PATH>/ensembl/GRCm39/Mus_musculus.GRCm39.cds.all.fa > stranded_test_RNASEQ.R1.FASTQ_filtered/transcripts.fa
Can't find any of the first 10 BED transcript_ids in fasta file... Check that these match

Here is the GTF, FASTA direct from Ensembl, and converted bed:

head of GTF:

$ head ensembl/v109/Mus_musculus.GRCm39.109.gtf
#!genome-build GRCm39
#!genome-version GRCm39
#!genome-date 2020-06
#!genome-build-accession GCA_000001635.9
#!genebuild-last-updated 2022-11
1	havana	gene	108344807	108347562	.	+	.	gene_id "ENSMUSG00000104478"; gene_version "2"; gene_name "Gm38212"; gene_source "havana"; gene_biotype "TEC";
1	havana	transcript	108344807	108347562	.	+	.	gene_id "ENSMUSG00000104478"; gene_version "2"; transcript_id "ENSMUST00000194081"; transcript_version "2"; gene_name "Gm38212"; gene_source "havana"; gene_biotype "TEC"; transcript_name "Gm38212-201"; transcript_source "havana"; transcript_biotype "TEC"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA (assigned to previous version 1)";
1	havana	exon	108344807	108347562	.	+	.	gene_id "ENSMUSG00000104478"; gene_version "2"; transcript_id "ENSMUST00000194081"; transcript_version "2"; exon_number "1"; gene_name "Gm38212"; gene_source "havana"; gene_biotype "TEC"; transcript_name "Gm38212-201"; transcript_source "havana"; transcript_biotype "TEC"; exon_id "ENSMUSE00001337335"; exon_version "2"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA (assigned to previous version 1)";
1	havana	gene	6980784	6981446	.	+	.	gene_id "ENSMUSG00000104385"; gene_version "2"; gene_name "Gm7449"; gene_source "havana"; gene_biotype "processed_pseudogene";
1	havana	transcript	6980784	6981446	.	+	.	gene_id "ENSMUSG00000104385"; gene_version "2"; transcript_id "ENSMUST00000194393"; transcript_version "2"; gene_name "Gm7449"; gene_source "havana"; gene_biotype "processed_pseudogene"; transcript_name "Gm7449-201"; transcript_source "havana"; transcript_biotype "processed_pseudogene"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA (assigned to previous version 1)";

head of CDS FASTA from ensembl

$ head ensembl/GRCm39/Mus_musculus.GRCm39.cds.all.fa
>ENSMUST00000178537.2 cds chromosome:GRCm39:6:41510135:41510146:1 gene:ENSMUSG00000095668.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trbd1 description:T cell receptor beta, D region 1 [Source:MGI Symbol;Acc:MGI:4439571]
GGGACAGGGGGC
>ENSMUST00000178862.2 cds chromosome:GRCm39:6:41519097:41519110:1 gene:ENSMUSG00000094569.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trbd2 description:T cell receptor beta, D region 2 [Source:MGI Symbol;Acc:MGI:4439727]
GGGACTGGGGGGGC
>ENSMUST00000196221.2 cds chromosome:GRCm39:14:54350925:54350933:1 gene:ENSMUSG00000096749.3 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trdd1 description:T cell receptor delta diversity 1 [Source:MGI Symbol;Acc:MGI:4439547]
ATGGCATAT
>ENSMUST00000177564.2 cds chromosome:GRCm39:14:54359683:54359698:1 gene:ENSMUSG00000096176.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trdd2 description:T cell receptor delta diversity 2 [Source:MGI Symbol;Acc:MGI:4439546]
ATCGGAGGGATACGAG
>ENSMUST00000179520.2 cds chromosome:GRCm39:12:113394148:113394158:-1 gene:ENSMUSG00000094028.2 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd4-1 description:immunoglobulin heavy diversity 4-1 [Source:MGI Symbol;Acc:MGI:4439801]
CTAACTGGGAC

head of converted BED.

$ head Mus_musculus.GRCm39.109.bed
1	108344806	108347562	ENSMUST00000194081	0	+	108344806	108347562	0	1	2756,	0,
1	6980783	6981446	ENSMUST00000194393	0	+	6980783	6981446	0	1	663,	0,
1	75368774	75373007	ENSMUST00000132100	0	-	75368774	75373007	0	2	315,157,	0,4076,
1	108540066	108540244	ENSMUST00000185509	0	-	108540066	108540244	0	1	178,	0,
1	6986782	6993812	ENSMUST00000194605	0	+	6986782	6993812	0	2	437,189,	0,6841,
1	6999982	7000012	ENSMUST00000191703	0	+	6999982	7000012	0	1	30,	0,
1	108697864	108699733	ENSMUST00000191467	0	+	108697864	108699733	0	1	1869,	0,
1	43781120	43783055	ENSMUST00000186289	0	-	43781120	43783055	0	2	146,70,	0,1865,
1	43782743	43783012	ENSMUST00000185910	0	-	43782743	43783012	0	1	269,	0,
1	7068440	7068765	ENSMUST00000193423	0	+	7068440	7068765	0	1	325,	0,

Looking at this, it seems that Ensembl no long converts all transcripts to the cds.all.fa file. I can remake this file for all transcripts in the GTF, but it would be great if the tool worked with Ensembl resources. Is it possible to make the parse more flexible for cases like this?

@MikeWLloyd MikeWLloyd changed the title Ensembl v109 Ensembl GRCm39 v109 GTF / ...cds.all.fa Parse Issue May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant