-
Notifications
You must be signed in to change notification settings - Fork 4
Further information on r2d annotate
R2Dtool can annotate positions in isoform-space with transcript-specific metatranscript coordinates and absolute and relative distances to annotated transcript landmarks. This page describes the algorithm used to annotate transcriptomic coordinates with various genomic and transcript-specific information.
The algorithm takes transcriptomic coordinates and uses transcript annotation data to provide detailed information about the position, including gene details, transcript properties, and splice site distances.
- GTF file containing transcript annotations
- Input file with transcriptomic coordinates
- Output file path (optional, defaults to stdout)
- Flags for header presence and version information in transcript IDs
A tab-separated file containing the original input data along with additional annotation columns:
- gene_id
- gene_name
- transcript_biotype
- tx_len (transcript length)
- cds_start
- cds_end
- tx_end
- transcript_metacoordinate
- abs_cds_start
- abs_cds_end
- up_junc_dist (upstream junction distance)
- down_junc_dist (downstream junction distance)
1. Parse GTF file to create transcript annotations
2. Generate splice site information for all transcripts
3. Process input file line by line:
a. Extract transcript ID and coordinate
b. Retrieve transcript information
c. Calculate various metrics (CDS positions, metacoordinates, etc.)
d. Determine distances to nearest splice sites
e. Write annotated information to output
Algorithm: run_annotate
Input:
- matches: command-line argument matches
- has_header: boolean indicating if input file has a header
- has_version: boolean indicating if transcript IDs include version numbers
Output:
- Annotated file with additional transcript information
1. Extract gtf_file, input_file, and output_file from matches
2. annotations ← read_annotation_file(gtf_file, true, has_version)
3. splice_sites ← generate_splice_sites(annotations)
4. Open input_file for reading
5. Open output_file (or stdout) for writing
6. If has_header:
Read and process header, adding new column names
7. For each line in input_file:
a. Split line into fields
b. transcript_id ← Extract from fields[0] (remove version if necessary)
c. tx_coord ← Parse fields[1] as integer
d. If transcript_id exists in annotations:
i. Retrieve transcript information
ii. Calculate tx_len, cds_start, cds_end, tx_end
iii. (rel_pos, abs_cds_start, abs_cds_end) ← calculate_meta_coordinates(tx_coord, utr5_len, cds_len, utr3_len)
iv. (up_junc_dist, down_junc_dist) ← splice_site_distances(tx_coord, splice_sites[transcript_id])
v. Construct output line with all calculated values
e. Else:
Construct output line with "NA" for all additional fields
f. Write constructed line to output_file
8. Close input and output files
9. Remove temporary splice sites file
Return: Success or error status
- The algorithm handles both versioned and non-versioned transcript IDs.
- It calculates transcript metacoordinates to provide context within the transcript structure.
- Splice site distances are computed to give information about proximity to exon-intron boundaries.
- The implementation uses parallel processing (Rayon) for generating splice site information, which improves performance for large datasets.
- Error handling and "NA" values are used for missing or incalculable data points.
Input:
- Transcript ID: ENST00000456328.2
- Position: 1000
Process:
- Look up ENST00000456328.2 in annotations
- Calculate transcript metrics (length, CDS positions, etc.)
- Determine metacoordinate for position 1000
- Find nearest upstream and downstream splice sites
- Compile all information into output line
Output:
"ENST00000456328.2 1000 ... [original columns] ... ENSG00000123456 GENE1 protein_coding 2000 500 1800 2000 1.5 500 -800 150 300"
Complete source code for R2Dtool annotate is available at https://github.com/comprna/R2Dtool/blob/main/src/annotate.rs