-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSTRG Tag, what does it refer to? #95
Comments
Gene numbers (the first number following MSTRG) are assigned incrementally, in each sample, in the order that their transcripts are generated. They generally follow the location ordering of the "bundles" (clumps of overlapping read alignments or transcripts in the case of But these are technical details, it's just a way of ensuring that each gene/locus has a unique identifier -- so from a practical perspective, they are no better than if they were randomly assigned. They are obviously different between samples so they should not be used as a way to identify genes across samples -- use the genomic location instead, and/or attributes like ref_gene_id if available.
Yes, after the gene names are assigned as described above, all the overlapping transcripts are numbered incrementally (the 2nd number). The naming convention for transcripts is: MSTRG.gene#.transcript#
Not sure what you mean but obviously some of the assembled transcripts may have been assembled only partially (due to low expression, for example) so they can indeed also be just "fragments" of real transcripts for that gene.. It seems like you are trying to compare results from two separate |
Thank you so much.
so if they are "fragments" of real transcripts for that gene, then why are there "fragment" counts for each of the "fragmented" transcripts in each sample? For instance, MSTRG.42854.3 has counts of 10 in Sample 1, does it mean, for "fragment" MSTRG.42854.3, sample 1 has three of such "fragments"? Then what are the criteria for these fragments to group to form these counts?
I am not sure if I could merge the two sets of data, one is an unpaired RNA Seq data, the other one is a paired-end RNA Seq data, data1 was a set of samples ran in 2015 and data2 was ran in 2016. That's why they were ran separately and only when correlation analysis is done then shall consider if they could be merged for further analysis. |
The "fragments" in the counts (coverage estimates) are not the same thing with the "fragmented transcripts" that we discussed here (I guess "partially assembled transcripts" would've been a better term, to avoid such a confusion). Please read about the definition of "fragment" and FPKM in the RNA-Seq assembly papers. The StringTie paper and previous papers also explain the various methods of estimating coverage for transcripts etc. I'd suggest that the "issues" section of GitHub is not the proper forum for posting (or asking for) tutorials or introductory materials on topics which are much better covered in the published papers or other online reference materials. |
Dear developer,
I was trying to match biological replicates from two different experiment of the same kind using the gene count output from prepDE.py and realised that a lot of genes were lost during the matching process (in which I matched gene ids as well as gene names) which makes my data looked really weird because of the lost of information. For instance:
In data set 1: gene UTP6 was tagged with MSTRG.43179
In data set 2: gene UTP6 was tagged with MSTRG.42854
I am not sure how does the software assign the MSTRG tag numbers to each isoform of a gene, i.e. do you use a random series of numbers to associate to every gene's isoforms and assign randomly during each run in Stringtie? If so, if a gene only has one isoform, then could a MSTRG tag for this gene i in data set A is actually the same isoform of gene i in data set B even though their MSTRG tags are different?
Also frankly, I do not really understand the difference between the two files: gene count and transcript count output from prepDE.py (or actually any other software which produces gene count and transcript count files).
Say in gene count file in a dataset for gene UTP6, its associated MSTRG tag is MSTRG.42854,
whereas in transcript file, its associated tags are:
MSTRG.42854.3
MSTRG.42854.6
To this point, as much as I understand, in gene count file, it consolidates all transcripts with the same MSTRG number (without the last digit after second decimal) together and form the gene count of that MSTRG number in gene count file.
But what I am confused about is that, in transcript file, does it mean that MSTRG.42854**.3** and MSTRG.42854**.6** are two different transcripts of the same gene? Or they are the segments of the same transcript for this particular gene (and counts associated with them are counts of segments actually in transcript file?)?
Also, I just realised that for the same MSTRG tag in different data set (not ran together simultaneously), the gene name associated with it (as reference to merged gtf output file from Stringtie) is different. Not only different, they are of different genes, but tagged with the same MSTRG ID. For instance, I have here:
C1D gene tagged with MSTRG.54609 in dataset 1 and,
GTF3C2 gene also tagged with MSTRG.54609 but in dataset 2.
I hope you could address my concerns asap, thank you so much.
The text was updated successfully, but these errors were encountered: