Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSTRG Tag, what does it refer to? #95

Closed
eudoraleer opened this issue Feb 9, 2017 · 3 comments
Closed

MSTRG Tag, what does it refer to? #95

eudoraleer opened this issue Feb 9, 2017 · 3 comments

Comments

@eudoraleer
Copy link

eudoraleer commented Feb 9, 2017

Dear developer,

I was trying to match biological replicates from two different experiment of the same kind using the gene count output from prepDE.py and realised that a lot of genes were lost during the matching process (in which I matched gene ids as well as gene names) which makes my data looked really weird because of the lost of information. For instance:

In data set 1: gene UTP6 was tagged with MSTRG.43179
In data set 2: gene UTP6 was tagged with MSTRG.42854

I am not sure how does the software assign the MSTRG tag numbers to each isoform of a gene, i.e. do you use a random series of numbers to associate to every gene's isoforms and assign randomly during each run in Stringtie? If so, if a gene only has one isoform, then could a MSTRG tag for this gene i in data set A is actually the same isoform of gene i in data set B even though their MSTRG tags are different?

Also frankly, I do not really understand the difference between the two files: gene count and transcript count output from prepDE.py (or actually any other software which produces gene count and transcript count files).

Say in gene count file in a dataset for gene UTP6, its associated MSTRG tag is MSTRG.42854,
whereas in transcript file, its associated tags are:

MSTRG.42854.3
MSTRG.42854.6

To this point, as much as I understand, in gene count file, it consolidates all transcripts with the same MSTRG number (without the last digit after second decimal) together and form the gene count of that MSTRG number in gene count file.

But what I am confused about is that, in transcript file, does it mean that MSTRG.42854**.3** and MSTRG.42854**.6** are two different transcripts of the same gene? Or they are the segments of the same transcript for this particular gene (and counts associated with them are counts of segments actually in transcript file?)?

Also, I just realised that for the same MSTRG tag in different data set (not ran together simultaneously), the gene name associated with it (as reference to merged gtf output file from Stringtie) is different. Not only different, they are of different genes, but tagged with the same MSTRG ID. For instance, I have here:

C1D gene tagged with MSTRG.54609 in dataset 1 and,
GTF3C2 gene also tagged with MSTRG.54609 but in dataset 2.

I hope you could address my concerns asap, thank you so much.

@eudoraleer eudoraleer changed the title MSTRG Tag, what does it refers to? MSTRG Tag, what does it refer to? Feb 9, 2017
@gpertea
Copy link
Owner

gpertea commented Feb 9, 2017

do you use a random series of numbers to associate to every gene's isoforms and assign randomly during each run in Stringtie?

Gene numbers (the first number following MSTRG) are assigned incrementally, in each sample, in the order that their transcripts are generated. They generally follow the location ordering of the "bundles" (clumps of overlapping read alignments or transcripts in the case of --merge). This ordering breaks however during multi-threaded processing (when one thread can finish before others which are still processing a previous "bundle", so whichever thread finishes first will "grab" the next available gene# etc.

But these are technical details, it's just a way of ensuring that each gene/locus has a unique identifier -- so from a practical perspective, they are no better than if they were randomly assigned. They are obviously different between samples so they should not be used as a way to identify genes across samples -- use the genomic location instead, and/or attributes like ref_gene_id if available.

does it mean that MSTRG.42854**.3** and MSTRG.42854**.6** are two different transcripts of the same gene?

Yes, after the gene names are assigned as described above, all the overlapping transcripts are numbered incrementally (the 2nd number). The naming convention for transcripts is: MSTRG.gene#.transcript#
(while for genes is just MSTRG.gene#)
Again, these numbers only make sense within a single sample -- since the gene# in one sample has nothing to do with the same gene# in another sample (and the transcripts for each gene might have been assembled differently etc.)

Or they are the segments of the same transcript for this particular gene?

Not sure what you mean but obviously some of the assembled transcripts may have been assembled only partially (due to low expression, for example) so they can indeed also be just "fragments" of real transcripts for that gene..

It seems like you are trying to compare results from two separate stringtie --merge runs -- which is a bit unusual. The recommended DE analysis pipeline was supposed to use a single, common super-set of transcripts assembled across all samples/experiments (so stringtie --merge should've been run with all the outputs from all the samples/experiments at once).
If however for some reasons you are forced to make this late comparison between counts generated with different reference transcript sets (which might only be possible at gene level, hoping that the gene boundaries haven't changed between the two experiments), you cannot rely on MSTRG.gene# identifiers but instead I'd suggest converting those gene IDs into locations on the genome (or some common reference annotation gene IDs/symbols, though such will not be available for "novel" genes).

@eudoraleer
Copy link
Author

Thank you so much.

Not sure what you mean but obviously some of the assembled transcripts may have been assembled only partially (due to low expression, for example) so they can indeed also be just "fragments" of real transcripts for that gene..

so if they are "fragments" of real transcripts for that gene, then why are there "fragment" counts for each of the "fragmented" transcripts in each sample? For instance, MSTRG.42854.3 has counts of 10 in Sample 1, does it mean, for "fragment" MSTRG.42854.3, sample 1 has three of such "fragments"? Then what are the criteria for these fragments to group to form these counts?

It seems like you are trying to compare results from two separate stringtie --merge runs -- which is a bit unusual. The recommended DE analysis pipeline was supposed to use a single, common super-set of transcripts assembled across all samples/experiments (so stringtie --merge should've been run with all the outputs from all the samples/experiments at once).
If however for some reasons you are forced to make this late comparison between counts generated with different reference transcript sets (which might only be possible at gene level, hoping that the gene boundaries haven't changed between the two experiments), you cannot rely on MSTRG.gene# identifiers but instead I'd suggest converting those gene IDs into locations on the genome (or some common reference annotation gene IDs/symbols, though such will not be available for "novel" genes).

I am not sure if I could merge the two sets of data, one is an unpaired RNA Seq data, the other one is a paired-end RNA Seq data, data1 was a set of samples ran in 2015 and data2 was ran in 2016. That's why they were ran separately and only when correlation analysis is done then shall consider if they could be merged for further analysis.

@gpertea
Copy link
Owner

gpertea commented Feb 9, 2017

The "fragments" in the counts (coverage estimates) are not the same thing with the "fragmented transcripts" that we discussed here (I guess "partially assembled transcripts" would've been a better term, to avoid such a confusion).

Please read about the definition of "fragment" and FPKM in the RNA-Seq assembly papers. The StringTie paper and previous papers also explain the various methods of estimating coverage for transcripts etc. I'd suggest that the "issues" section of GitHub is not the proper forum for posting (or asking for) tutorials or introductory materials on topics which are much better covered in the published papers or other online reference materials.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants