This document is intended for INSDC submitters and describes how to prepare submission files using DFAST. Please check the latest submission guideline at the INSDC partners.
For beginners not familiar with the data submission procedure, we recommend using our DFAST web service, which offers you a graphical user interface to create submission files to DDBJ.
The genome submission to the INSDC can be devided into two categories, WGS mainly for a draft genome and non-WGS mainly for a complete genome. DFAST can generate submission files for both types of categories.
WGS genomes have three types of submission format depending on the presence of sequencing gap.
- Gapless contigs
A standard format of WGS submissoin - Gapped sequences (scaffolds)
A newer style for gapped-sequence submission. Basically, it is same as above excepting that sequencing gaps (runs of Ns) must be annotated using assembly_gap features. - Gapless contigs and AGP file
An older style for gapped-sequence submissoin. AGP file describes the ordering and orientation of contigs within scaffold.
Note that DFAST does not support this type of submission as DFAST does not generate an AGP file.
In a non-WGS genome, each chromosome is in a single sequence. There still can be gaps within the sequences, and plasmids still can be in multiple pieces. Each sequence in the genome must be assigned to a chromosome or a plasmid. DFAST requires additional parameters for non-WGS genomes. See later [section](#For a complete genome).
In an INSDC flat file, each sequence has at least, and normally only, one source feature that spans the entire sequence and describes the biological source of the sequence.
The source must have "organism" and "mol_type" as mandatory qualifiers,
and the value of "mol_type" is always "genomic DNA" for genome sequence data.
For microbial genome submission, it is recommended to use "strain" qualifier to describe the strain from which the sequence derived.
In addition, qualifiers such as "serovar", "type_material", and "culture_collection"
should be used if necessary.
These source modifiers can be specified by GENOME_SOURCE_INFORMATION
in the configuration file,
which includes following attributes. Instead of specifying them in the configuration file, you can set them by providing command line options with the same name as source attributes, i.e. use --organism
to override the attribute organism
in the config.
- organism
Scientific name for the organism from which the genome was obtained (mandatory for INSDC submission). The value specified in the config file can be overriden by the command line option--organism
- strain
Strain name (recommended for microbial genome submission) The value specified in the config file can be overriden by the command line option--strain
The following 3 attributes are only valid for a comeplete (non-WGS) genome. Values corresponding to each sequence must be separated with commas (,) or semicolons (;), and the number of the values must match the number of sequences.
-
seq_names
The sequence names to be used in a INSDC flat file. e.g."Chromosome, pXXX, pYYY"
The sequence name must not include any blanks. The plasmid name should start with lower case 'p' unless it is not known. In that case use 'unnamed', or ‘unnamed1’ & ‘unnamed2’ for distinct unnamed plasmids. If you setuse_original_name = True
, the leading non-space letters in the query FASTA file will be taken as a sequence name, and the value specified byseq_names
will be ignored. -
seq_topologies
Linear or circular. If not specified, the sequence will be treated as a linear sequence. Usecircular
orc
to denote a circular sequence. For example,"c,c,"
means the first and second replicons are circular, and the third one is linear (="circular,circular,linear"
). -
seq_types
Chromosome or plasmid assignment. If not specified, the sequence will be treated as a chromosome. Useplasmid
orp
to assign as a plasmid. For example,",p,p"
means the first replicon is a chromosome, and the second and third replicons are plasmids (="chromosome,plasmid,plasmid"
). If a sequence is assigned as a plasmid, a "plasmid" qualifier with its plasmid name will appear in the source feature.
-
additional_modifiers
(optional)
If you need additional source modifiers, you can specify them following the format like"isolation_source=plant; culture_collection=JCM:XXXX; type_material=type strain of XXXX; note=ABC; note=DEF"
. The values specified by this attribute will be assigned to all the entries in the genome.[Acceptable modifiers]
bio_material, collected_by, collection_date, country, culture_collection, ecotype, identified_by, isolation_source, host, note, serotype, serovar, sub_species, sub_strain, type_material, varietySee also Source Modifier List at the NCBI web site.
Locus tags are identifiers that are systematically assigned to every gene in a genome.
A locus tag is represented by the combination of a prefix (locus_tag_prefix) and a tag value,
separated by an under score "_", eg A1C_00001.
Locus_tag_prefix can contain only alpha-numeric characters of at least 3 characters long,
and it should be registered at the INSDC in prior to genome submission.
Usually, they are in a sequential order on the genome,
and you can also leave gaps so that you can later add a new gene between gaps.
For more information, please read the guideline.
You can adjust how locus tags should be assigned using the following parameters in LOCUS_TAG_SETTINGS
.
locus_tag_prefix
: locus_tag_prefixstep
: The gap value between locus tags.use_separate_tags
: If set toTrue
, locus_tags are assigned separately by feature type.symbols
: Default setting = {"CDS": "", "rRNA": "r", "tRNA": "t", "tmRNA": "tm"}, This has two functions. First, genes whose feature type is included in the setting are assigned with locus_tag. Locus tags are not assigned to features not in the setting, such as assembly_gap and repeat_region. Second, ifuse_separate_tags
is setTrue
, the symbol specified by the setting will be added to tag values. e.g. ABC_r0001 for rRNA, ABC_t0020 for tRNA.
DFAST generates an annotation file (tab-separated) and a sequence file (FASTA-formatted), which will be required to submit the genome through DDBJ Mass Submission System (MSS).
-
Before submission, you need to register BioProject and BioSample via the DDBJ submission portal D-way. If necessary, raw sequence data should be deposited in Sequence Read Archive (SRA). A unique Locus Tag Prefix is required for submitting an annotated genome, which can be registered during the BioProject or BioSample registration process.
-
Run DFAST to create submission files. To create valid files, make sure source modifiers and locus tags are specified properly. Then, fill in the COMMON entry part of the annotation file with metadata such as contact information, references, and accession numbers of related databases (BioProject, BioSample, SRA).
-
DDBJ offers data validation tools to submitters. Use the tools to validate your submission files before sending them to DDBJ.
By default, the COMMON entry section of an annotation file is left blank, so that users can fill with the information later. Alternatively, if you provide a text file describing metadata to metadata_file
in the config, DFAST can generate a fully qualified DDBJ submission file including metadata.
Alternatively, you can do it with the --metadata_file
option.
- metadata_file
A tab-separated text file with a key and a value in each line. It is recommended to use a sample metadata file ($DFAST_APP_ROOT/example/sample.metadata.txt
) as a template.
See also$DFAST_APP_ROOT/example/description.metadata.txt
for more information.
- Minimal example
This shows a minimal example to create DDBJ data submission files for a draft genome.
All the metadata are specified in Execute this in$DFAST_APP_ROOT
.
dfast --genome example/test.genome.fna --metadata example/sample.metadata.draft.txt --locus_tag_prefix LOOC260
- More complicated example
The below is an example of a command to generate compliant submission files for a complete genome consisting of one chromosome and 2 plasmids. Sequence name, topology, and type for each sequence should be specified using the
--seq_names
,--seq_topologies
,--seq_types
options.
dfast -g example/sample.lactobacillus.fna --complete t --organism "Lactobacillus hokkaidonensis" \
--strain LOOC260 --seq_names "Chromosome,pXXXX,pYYYY" --seq_topologies c,c,l --seq_types c,p,p \
--additional_modifiers "culture_collection=JCM:18461; isolation_source=silage; note=You can add a comment; note=You can add another comment; collection_date=2017-06-26" --metadata_file example/sample.metadata.complete.txt \
--locus_tag_prefix LH260 --step 10 --use_separate_tags t --out LHLOOC
Instead of using --organism
, --strain
, --additional_modifiers
options, they can be specified in the metadata file as shown in the example for a draft genome.
GenBank submission is no longer supported. Use NCBI PGAP.
- Register the genome project with the BioProject and the BioSample databases.
- Create a submission template file here.
- Run DFAST to create two input files for the tbl2asn program, a feature table file (.tbl) and a sequence file (.fsa).
Set
center_name
(or--center_name
) to specify your genome center tags, or you can change them later. - Run tbl2asn to generate a .sqn file for submission to GenBank.
tbl2asn -a s -t template.sbt -i xxxxx.fsa -w xxxxx.cmt -V vb -Z discrepancy.txt
# xxxxx.cmt is an optional structured-comment file
- Upload the files from NCBI Submission Portal.
For more detail, please see the Submission Guideine, instruction manual for tbl2asn and the Genbank Submissions Handbook.
The below is an example of a command to generate GenBank submission files for a draft genome of E. coli.
dfast -g your_genome.fa --organism "Escherichia coli" --strain Sakai \
--additional_modifiers "serovar=O157:H7; sub_strain=RIMD 0509952; country=Japan:Osaka, Sakai" \
--locus_tag_prefix ECS --center_name NIG