Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatible file for NCBI submission? #69

Closed
michoug opened this issue Aug 12, 2021 · 7 comments
Closed

Compatible file for NCBI submission? #69

michoug opened this issue Aug 12, 2021 · 7 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@michoug
Copy link

michoug commented Aug 12, 2021

Hi,
I'm in the process of submitting annotated genomes with Bakta to the NCBI, hence while checking for the quality and errors of annotation (via table2asn and the gff3 file https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#run), I encountered several issues with it.
I understand if it's not something this tool will be compatible with as it can be quite tricky but anyhow, here is a list of some of the issues that I had that may be addressed in the gff file:

  • add a gene line
contig_1	Prodigal	CDS	3	179	.	-	0	ID=DOCECA_00005;locus_tag=DOCECA_00005;product=hypothetical protein
contig_1	Bakta	gene	3	179	.	-	0	ID=DOCECA_00005;locus_tag=DOCECA_00005
  • remove commas in the "product=" category with the exception of EC numbers

Commas that are intended to be part of a name should be encoded (%2C) according to the GFF3 specifications. However, literal commas should only be included when they are part of enzymatic names. Semi-colons generally should not be included in product names.

Best
Greg

@michoug michoug added the enhancement New feature or request label Aug 12, 2021
@oschwengers oschwengers added this to the v1.1 milestone Aug 12, 2021
@oschwengers
Copy link
Owner

Hi @michoug , thanks a lot for reporting this. So far we've tested the submission only for ENA. Of course, we're keen to make NCBI submissions as smooth as possible, too.

I'll encode the products as requested in the GFF3 specifications.

For the 1st and 3rd point, I think it might be best to add a --compliant option in line with the Prokka option to explicitly activate this behavior that might not be desired in other situations.

Is this a complete list of all issues you encountered?
Also, could you provide an exemplary line of commands you've used to generate the submission files? This could be helpful for other users to go through this process. Maybe I'll add a section to the readme, as well.

oschwengers added a commit that referenced this issue Aug 12, 2021
oschwengers added a commit that referenced this issue Aug 12, 2021
oschwengers added a commit that referenced this issue Aug 12, 2021
oschwengers added a commit that referenced this issue Aug 12, 2021
oschwengers added a commit that referenced this issue Aug 12, 2021
oschwengers added a commit that referenced this issue Aug 12, 2021
@oschwengers oschwengers self-assigned this Aug 12, 2021
@michoug
Copy link
Author

michoug commented Aug 13, 2021

Hi,
Thanks for the super-fast response. The issues highlighted here are the main ones (e.g FATAL), there are others that depend more on the names of the products (see attached for a list for a genome)
Issues.txt

Here is the command that I used to generate submission files:

  • First, you need a template submission file
  • Then, the software table2asn
  • The command was for Linux:
    table2asn_GFF.Linux -M n -J -c w -t template.sbt -l paired-ends -j "[organism=Pseudomonas sp][strain=E102] [gcode=11]" -i E102_bakta/E102.fna -f E102_bakta/E102.gff3 -o E102_bakta/E102.sqn -Z

Here the link for the documentation (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#run)

@oschwengers
Copy link
Owner

Thanks for the detailed information - that helps a lot.
I've already addressed the lacking gene and product encoding issues.

However, fixing the Dbxrefs and fatal product descriptions might take somewhat longer. But I've put this on the list for the upcoming 1.1 version which will hopefully be released in the next weeks.

@oschwengers
Copy link
Owner

interesting side effect: adhering to the GFF3 comma encoding convention (%2C) leads to FATAL: SUSPECT_PRODUCT_NAMES: 62 features contain '%'. Any idea how that could be bypassed? Or is this something that maybe shoulf be reported upstream to be fixed in the table2asn_GFF tool?

@oschwengers
Copy link
Owner

oschwengers commented Aug 20, 2021

Hi @michoug , I've added a couple of fixes and improvements for GFF3 based GenBank submissions via table2asn_GFF.
All of the points you've raised above should be addressed and all issues should be solved. If this is not the case, please do not hesitate to reach out and re-open this issue.

I'll release v1.1.0 containing these improvements soon - most certainly next week.

Please let me know if there are any further issues - I'm looking forward to your feedback.
Thanks again for reporting and
best regards!

oschwengers added a commit that referenced this issue Aug 23, 2021
oschwengers added a commit that referenced this issue Aug 25, 2021
@michoug
Copy link
Author

michoug commented Aug 30, 2021

Hi,
Congrats for all the fast work, I have a few others "issues" that may be eventually addressed, even though I'm well aware that this process sometimes is a bottomless pit and quite tricky to automatize...

SUSPECT_PRODUCT_NAMES: 8 features May contain plural
E141.sqn:CDS	Urea carboxylase without Allophanate hydrolase 2 domains	lcl|contig_1:c493999-492260	GKKCFE_02155
E141.sqn:CDS	Phosphotransferase system, HPr-related proteins	lcl|contig_1:c658214-657810	GKKCFE_02965
E141.sqn:CDS	Hemolysins-related protein containing CBS domains	lcl|contig_1:c830356-829115	GKKCFE_03775
E141.sqn:CDS	Phage tail assembly chaperone proteins, E, or 41 or 14	lcl|contig_1:952781-953356	GKKCFE_04360
E141.sqn:CDS	Peptidoglycan/LPS O-acetylase OafA/YrhL, contains acyltransferase and SGNH-hydrolase domains	lcl|contig_1:c1007564-1006416	GKKCFE_04650
E141.sqn:CDS	Diguanylate cyclase with PAS/PAC and GAF sensors	lcl|contig_1:1171567-1172943	GKKCFE_05445


SUSPECT_PRODUCT_NAMES: 31 features contain 'unknown'
E141.sqn:CDS	Family of unknown function (DUF6124)	lcl|contig_1:c109230-108889	GKKCFE_00500
E141.sqn:CDS	Family of unknown function (DUF6124)	lcl|contig_1:254342-254698	GKKCFE_01120
E141.sqn:CDS	Family of unknown function (DUF6124)	lcl|contig_1:580095-580460	GKKCFE_02580

SUSPECT_PRODUCT_NAMES: 34 features contains three or more numbers together that may be identifiers more appropriate in note
E141.sqn:CDS	Uvs098	lcl|contig_1:252015-252467	GKKCFE_01095
E141.sqn:CDS	UPF0313 protein PSPTO_4928	lcl|contig_1:302226-304526	GKKCFE_01330
E141.sqn:CDS	L-pipecolate oxidase (1537)	lcl|contig_1:320431-321714	GKKCFE_01405
E141.sqn:CDS	HI0933-like protein	lcl|contig_1:c490707-489466	GKKCFE_02145
E141.sqn:CDS	Putative hydro-lyase B723_09185	lcl|contig_1:c496428-495622	GKKCFE_02165
E141.sqn:CDS	UPF0114 protein C7528_102400	lcl|contig_1:554275-554763	GKKCFE_02435
E141.sqn:CDS	UPF0225 protein CD58_06560	lcl|contig_1:c1018229-1017732	GKKCFE_04695
E141.sqn:CDS	UPF0276 protein SAMN03159293_01947	lcl|contig_1:c1039843-1038974	GKKCFE_04820


SUSPECT_PRODUCT_NAMES: 188 features contain underscore
E141.sqn:CDS	GBBH-like_N domain-containing protein	lcl|contig_1:c125879-125502	GKKCFE_00600
E141.sqn:CDS	FAD_binding_3 domain-containing protein	lcl|contig_1:c168453-167206	GKKCFE_00760
E141.sqn:CDS	ABC_trans_aux domain-containing protein	lcl|contig_1:261845-262549	GKKCFE_01150
E141.sqn:CDS	MotA_ExbB domain-containing protein	lcl|contig_1:272991-273842	GKKCFE_01195
E141.sqn:CDS	UPF0313 protein PSPTO_4928	lcl|contig_1:302226-304526	GKKCFE_01330
E141.sqn:CDS	Peripla_BP_6 domain-containing protein	lcl|contig_1:322080-323216	GKKCFE_01410
E141.sqn:CDS	Znf/thioredoxin_put domain-containing protein	lcl|contig_1:c389672-388437	GKKCFE_01700
E141.sqn:CDS	Cupin_3 domain-containing protein	lcl|contig_1:c469729-469385	GKKCFE_02035
E141.sqn:CDS	ZT_dimer domain-containing protein	lcl|contig_1:c476803-475910	GKKCFE_02080

SUSPECT_PRODUCT_NAMES: 1 feature contains '(TC'
E141.sqn:CDS	Sodium/proton antiporter, CPA1 family (TC 2A36)	lcl|contig_1:c3087019-3085778	GKKCFE_13940

SUSPECT_PRODUCT_NAMES: 1 feature contains 'FOG'
E141.sqn:CDS	FOG: TPR repeat, SEL1 subfamily	lcl|contig_1:c4136402-4136001	GKKCFE_18665

FATAL: SUSPECT_PRODUCT_NAMES: 1 feature contains '?'
E141.sqn:CDS	ABC transporter, substrate-binding protein (Cluster 15, trp?)	lcl|contig_1:4026495-4027427	GKKCFE_18180

FATAL: SUSPECT_PRODUCT_NAMES: 2 features contain '@'
E141.sqn:CDS	Deblocking aminopeptidase @ Cyanophycinase 2	lcl|contig_1:c1423448-1422258	GKKCFE_06635
E141.sqn:CDS	Maleylacetoacetate isomerase @ Glutathione S-transferase, zeta	lcl|contig_1:c4755920-4755285	GKKCFE_21485

SUSPECT_PRODUCT_NAMES: Use short product name instead of descriptive phrase
SUSPECT_PRODUCT_NAMES: 1 feature ends with 'activity'
E141.sqn:CDS	HD-like signal output (HDOD) domain, no enzymatic activity	lcl|contig_1:5955114-5956325	GKKCFE_27025

SUSPECT_PRODUCT_NAMES: 4 features Is longer than 100 characters. Remove descriptive phrases or synonyms from product names. Keep valid long product names, eg long enzyme names
E141.sqn:CDS	Multicopper oxidase with three cupredoxin domains (Includes cell division protein FtsP and spore coat protein CotA)	lcl|contig_1:819899-821275	GKKCFE_03735
E141.sqn:CDS	GTP pyrophosphokinase, (P)ppGpp synthetase I / Guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase	lcl|contig_1:c4197461-4195215	GKKCFE_18975
E141.sqn:CDS	Glyoxylate reductase / Glyoxylate reductase / Hydroxypyruvate reductase 2-ketoaldonate reductase, broad specificity	lcl|contig_1:4747621-4748592	GKKCFE_21440
E141.sqn:CDS	Glycine betaine/carnitine/choline ABC transporter, periplasmic glycine betaine/carnitine/choline-binding protein	lcl|contig_1:4859680-4860582	GKKCFE_21975

SUSPECT_PRODUCT_NAMES: 1 feature contains 'possibly'
E141.sqn:CDS	Membrane protein TerC, possibly involved in tellurium resistance	lcl|contig_1:c5854787-5854020	GKKCFE_26505

SUSPECT_PRODUCT_NAMES: 3 features contain 'gene'
E141.sqn:CDS	Yibq gene product, putative divergent polysaccharide deacetylase	lcl|contig_1:c43395-42619	GKKCFE_00250
E141.sqn:CDS	ABC transporter in pyoverdin gene cluster, ATP-binding component	lcl|contig_1:3868307-3869059	GKKCFE_17350
E141.sqn:CDS	YebG, DNA damage-inducible gene in SOS regulon, expressed in stationary phase	lcl|contig_1:4752788-4753048	GKKCFE_21470

BAD_GENE_NAME: 6 genes contain suspect phrase or characters
E141.sqn:Gene	5_ureB_sRNA	lcl|contig_1:346126-346411	GKKCFE_01530
E141.sqn:Gene	epd,gap,gapA	lcl|contig_1:c1070158-1069157	GKKCFE_04950
E141.sqn:Gene	Bacteria_small_SRP	lcl|contig_1:1650897-1650993	GKKCFE_07575
E141.sqn:Gene	RNaseP_bact_a	lcl|contig_1:c4698601-4698249	GKKCFE_21195
E141.sqn:Gene	epd,gap,gapA	lcl|contig_1:c5325075-5324020	GKKCFE_24085
E141.sqn:Gene	Pseudomon-1	lcl|contig_1:5829418-5829534	GKKCFE_26390

@oschwengers
Copy link
Owner

oschwengers commented Aug 31, 2021

Hi,
I've tried to address as many SUSPECT_PRODUCT_NAMES as possible:

  • contains '?'
  • contains '@'
  • contains 'FOG'
  • containa underscore -> underscore in domain names
  • SUSPECT_PRODUCT_NAMES: features contain 'unknown' -> replace product by "DUF....-containing protein"
  • features contains three or more numbers together.... -> replace product by "UPF....-containing protein"

These are the low hanging fruits. All the other remaining issues are way more complex to fix - if they can be handled in an automatic manner at all.
I'll try to add some more "fix&replace" rules from time to time and I'm open to all sorts of ideas, suggestions and improvements from the community!
Thanks for all the reports! I'll release a patch version soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants