Skip to content

Latest commit

 

History

History
69 lines (54 loc) · 3.82 KB

README.md

File metadata and controls

69 lines (54 loc) · 3.82 KB

nex2tbl

nex2tbl is an R tool aimed to help with submission of protein-coding DNA sequences to GenBank. Such sequences are commonly submitted through BankIt portal, where a Feature Table File (*.tbl file) is prompted if the user uploads multiple records. Manual preparation of the tbl file can be a laborious task, especially if the sequences include multiple introns or start from different codon positions. nex2tbl takes aligned sequences and creates a minimum essential tbl file with 2 feature keys (gene and CDS) and 5 qualifiers (gene, product, codon_start, transl_table, and partial aka </>) that are altogether enough for GenBank to correctly translate DNA into amino acids.

Usage

  • Make sure that ape and plyr packages are installed in your R environment.

  • Load the script.

source("https://raw.githubusercontent.com/Mycology-Microbiology-Center/nex2tbl/main/nex2tbl.R")
  • Specify input and output file names, as well as user-defined variables. Example:
nex2tbl(
  INPUT_NEX = "exons-introns_CODON_START-2_RPB1.nex",
  OUTPUT_TBL = "exons-introns_CODON_START-2_RPB1.tbl",
  GENE = "rpb1",
  PRODUCT = "RNA polymerase II largest subunit",
  CODON_START = 2,
  TRANSL_TABLE = 1,
  FULL_GENE = FALSE
)
  • Execute this script, and resulting tbl file will appear in your working directory.

Documentation

Input for the tool is an alignment of the submitted sequences of one gene in the nexus format (*.nex, example). Intron positions should be specified in the end of the file as column spans in a single charset called intron, like this:

BEGIN SETS;
charset intron = 202-256 394-451;
END;

In addition, the user must specify the following variables:

  • GENE - gene name, e.g., "rpb1".
  • PRODUCT - name of the produced protein, e.g., "RNA polymerase II largest subunit".
  • CODON_START - indicates the offset at which the first complete codon of a coding region can be found in the alignment. It is specified in relation to the first column of the first exon (which is not necessarily in the beginning of alignment!) and can only take values 1, 2, or 3. On the example below, the first complete codon (TCC, in green) starts in the 3rd column of the first exon, therefore CODON_START will be 3. To define this variable the user must know the coding frame of alignment beforehand.

start_codon_example

  • TRANSL_TABLE - defines the genetic code table used, by default is 1 - universal genetic code table.
  • FULL_GENE - can be FALSE or TRUE depending on whether the sequence covers the whole coding region of a protein. Usually it is not the case, and then locations of the first and last regions (assumed to be incomplete) will be indicated with < and > before the numbers. If TRUE, GenBank expects CODON_START to be 1.

Output example for a single sequence

>Features seq4
<1	>2119	gene
			gene	rpb2
<1	74	CDS
128	1087
1144	>2119
			product	RNA polymerase II second largest subunit
			codon_start	3
			transl_table	1

Notes

  • In exons, length of gaps must be multiple of three (e.g. ---), or else the reading frame will be broken and the output will be wrong.
  • Intron-only sequences are not supported - if they are present in the alignment, warnings will be shown and such sequences will be absent in the tbl.
  • If charsets are not specified, whole aligment will be treated as a single exon.

Credits

  • Code: Vladimir Mikryukov
  • Idea: Anton Savchenko and Iryna Yatsiuk