Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To identify differential alternative splicing between two groups #320

Open
jingxian555 opened this issue Sep 10, 2023 · 14 comments
Open

To identify differential alternative splicing between two groups #320

jingxian555 opened this issue Sep 10, 2023 · 14 comments

Comments

@jingxian555
Copy link

Thanks for your tools! There are 2 sample groups with 3 BAM files per group,but I am counfused with the results, how to filter for differential alternative splicing between the experimental group and the control group? I'm looking at this result file [AS_Event].MATS.JC.txt, can I directly filter differential alternative splicing based on FDR? Thanks!

The executed script is as follows:
python /home/sunpf/my_data/software/miniconda3/envs/new_rmats/rMATS/rmats.py --b1 u8.txt --b2 WT.txt --gtf Arabidopsis_thaliana.TAIR10.57.gtf -t paired --readLength 150 --novelSS --od out --tmp tmp

@EricKutschera
Copy link
Contributor

Yes, you can filter on FDR to find significant differential alternative splicing events between the two groups. This post mentions some potential filters for other columns in the output #183 (comment)

@jingxian555
Copy link
Author

Thank you for your suggestions. Here is the code of python for advanced filtering, Is my understanding correct? And are these filtering criteria necessary? After filtering based on these criteria, there are only 30 significant differential alternative splicing events remaining. Can I just filter based on FDR < 0.05 to find significant differential alternative splicing events between the two groups? Thanks!

fout = open('significant.SE.MATS.JC.txt', 'w')
with open('SE.MATS.JC.txt','r') as fin:
    line = fin.readline()
    for line in fin:
        line = line.strip()
        arr = line.split('\t')
        IJC_SAMPLE_1 = arr[12]
        SJC_SAMPLE_1 = arr[13]
        IJC_SAMPLE_2 = arr[14]
        SJC_SAMPLE_2 = arr[15]
        FDR = arr[-4]
        IncLevel1 = arr[-3]
        IncLevel2 = arr[-2]

        IJC_SAMPLE_1_l = [int(IJC1) for IJC1 in IJC_SAMPLE_1.split(',')]
        IJC_SAMPLE_1_avg = sum(IJC_SAMPLE_1_l)/len(IJC_SAMPLE_1_l)

        SJC_SAMPLE_1_l = [int(SJC1) for SJC1 in SJC_SAMPLE_1.split(',')]
        SJC_SAMPLE_1_avg = sum(SJC_SAMPLE_1_l)/len(SJC_SAMPLE_1_l)

        IJC_SAMPLE_2_l = [int(IJC2) for IJC2 in IJC_SAMPLE_2.split(',')]
        IJC_SAMPLE_2_avg = sum(IJC_SAMPLE_2_l)/len(IJC_SAMPLE_2_l)

        SJC_SAMPLE_2_l = [int(SJC2) for SJC2 in SJC_SAMPLE_2.split(',')]
        SJC_SAMPLE_2_avg = sum(SJC_SAMPLE_2_l)/len(SJC_SAMPLE_2_l)

        IncLevel1_l = [float(I1) for I1 in IncLevel1.split(',') if I1 != 'NA']
        IncLevel1_avg = sum(IncLevel1_l)/len(IncLevel1_l)
        IncLevel1_range = max(IncLevel1_l) - min(IncLevel1_l)

        IncLevel2_l = [float(I2) for I2 in IncLevel2.split(',') if I2 != 'NA']
        IncLevel2_avg = sum(IncLevel2_l)/len(IncLevel2_l)
        IncLevel2_range = max(IncLevel2_l) - min(IncLevel2_l)

        if (IJC_SAMPLE_1_avg + SJC_SAMPLE_1_avg >= 10) and (IJC_SAMPLE_2_avg + SJC_SAMPLE_2_avg >= 10) and (float(FDR) < 0.05) and (IncLevel1_range > 0.05) and (0.05 < IncLevel1_avg < 0.95) and (IncLevel2_range > 0.05) and (0.05 < IncLevel2_avg < 0.95):
            fout.write(line + '\n')

fout.close()

@EricKutschera
Copy link
Contributor

Yes, you can filter just based on FDR

For the code, I think the checks for the inclusion levels mentioned in that other post were intended to be done with all samples. Instead of requiring the PSI value to differ within a sample group like IncLevel1_range > 0.05, I think the original intention was to check that the PSI value had some variation when looking at all samples

@jingxian555
Copy link
Author

I apologize, but I'm having difficulty understanding your request. Can you just change the code? Thank you so much!

@jingxian555
Copy link
Author

I saw in the article that evaluate splicing defects by comparing the percentage spliced-in (PSI) index of AS events (P < 0.05 and delta PSI > 3%) , Is delta PSI the same as abs(IncLevelDifference)?Thank you~

@EricKutschera
Copy link
Contributor

I'm not sure what article you're referring to, but IncLevelDifference is a difference in PSI values. Taking the absolute value of IncLevelDifference and calling that delta PSI is reasonable

For the code change

fout = open('significant.SE.MATS.JC.txt', 'w')
with open('SE.MATS.JC.txt','r') as fin:
    line = fin.readline()
    for line in fin:
        line = line.strip()
        arr = line.split('\t')
        IJC_SAMPLE_1 = arr[12]
        SJC_SAMPLE_1 = arr[13]
        IJC_SAMPLE_2 = arr[14]
        SJC_SAMPLE_2 = arr[15]
        FDR = arr[-4]
        IncLevel1 = arr[-3]
        IncLevel2 = arr[-2]

        IJC_SAMPLE_1_l = [int(IJC1) for IJC1 in IJC_SAMPLE_1.split(',')]
        IJC_SAMPLE_1_avg = sum(IJC_SAMPLE_1_l)/len(IJC_SAMPLE_1_l)

        SJC_SAMPLE_1_l = [int(SJC1) for SJC1 in SJC_SAMPLE_1.split(',')]
        SJC_SAMPLE_1_avg = sum(SJC_SAMPLE_1_l)/len(SJC_SAMPLE_1_l)

        IJC_SAMPLE_2_l = [int(IJC2) for IJC2 in IJC_SAMPLE_2.split(',')]
        IJC_SAMPLE_2_avg = sum(IJC_SAMPLE_2_l)/len(IJC_SAMPLE_2_l)

        SJC_SAMPLE_2_l = [int(SJC2) for SJC2 in SJC_SAMPLE_2.split(',')]
        SJC_SAMPLE_2_avg = sum(SJC_SAMPLE_2_l)/len(SJC_SAMPLE_2_l)

        IncLevel1_l = [float(I1) for I1 in IncLevel1.split(',') if I1 != 'NA']
        IncLevel2_l = [float(I2) for I2 in IncLevel2.split(',') if I2 != 'NA']

        all_IncLevel = IncLevel1_l + IncLevel2_l
        IncLevel_avg = sum(all_IncLevel)/len(all_IncLevel)
        IncLevel_range = max(all_IncLevel) - min(all_IncLevel)

        if (IJC_SAMPLE_1_avg + SJC_SAMPLE_1_avg >= 10) and (IJC_SAMPLE_2_avg + SJC_SAMPLE_2_avg >= 10) and (float(FDR) < 0.05) and (IncLevel_range > 0.05) and (0.05 < IncLevel_avg < 0.95):
            fout.write(line + '\n')

fout.close()

@jingxian555
Copy link
Author

Thank you for your previous advice; it was very helpful.
Now, I have a new question. Is the '--b1' parameter for the BAM file of the experimental group? For example, in this "SE.MATS.JC.txt" file, does IncLevelDifference > 0 indicate a higher probability of SE events occurring in group b1 compared to b2? Thank you!

The executed script is as follows:
python /home/sunpf/my_data/software/miniconda3/envs/new_rmats/rMATS/rmats.py --b1 u8.txt --b2 WT.txt --gtf Arabidopsis_thaliana.TAIR10.57.gtf -t paired --readLength 150 --novelSS --od out --tmp tmp

@EricKutschera
Copy link
Contributor

From the README: https://github.com/Xinglab/rmats-turbo/tree/v4.2.0#output

IncLevelDifference: average(IncLevel1) - average(IncLevel2)

IncLevelDifference > 0 indicates that IncLevel1 > IncLevel2. Since IncLevel1 corresponds to --b1 and IncLevel2 to --b2, that would mean a higher inclusion level in --b1 as compared to --b2. For SE events a higher inclusion level means that the exon is included more often

Talking about events occurring in one group or the other can be confusing as mentioned in this post: Xinglab/rmats2sashimiplot#68 (comment)

@jingxian555
Copy link
Author

Thank you. Is rMATS used for alternative splicing analysis with uniquely mapped reads in BAM files?

@EricKutschera
Copy link
Contributor

Yes, rMATS only uses uniquely mapped reads. rMATS will write a section to stdout saying how many reads were filtered out for different reasons. In that section NOT_NH_1 means the read was filtered out because it was not uniquely mapped. Here's a related post: #293

@jingxian555
Copy link
Author

Thank you!

  1. The gene AT5G50100 has only one transcript with 8 exonic regions. Why is there a discrepancy with the coordinate information in RI.MATS.JC.txt?
    exonic regions:
AT5G50100

RI.MATS.JC.txt:
PSI

  1. Below is the image generated by rmats2sashimiplot, where the numbers on the image represent the read count on the junctions. Can you provide more detailed explanations?
AS

@EricKutschera
Copy link
Contributor

The rmats command you posted before included --novelSS. With that option rmats can detect events with splice sites that are not in the --gtf: #277 (comment)

rmats2sashimiplot doesn't show counts for novel junctions: https://github.com/Xinglab/rmats2sashimiplot/blob/v3.0.0/src/MISO/misopy/sashimi_plot/plot_utils/plot_gene.py#L148
This post discusses differences in read counts between rmats and rmats2sashimiplot
Xinglab/rmats2sashimiplot#33 (comment)

@jingxian555
Copy link
Author

Thanks! I removed the --novelSS from the rmats command and filtered significant RI using these filters:

  1. Average PSI (IncLevel) within 0.05 and 0.95
  2. Average total read count (inclusion count + skipping count) >= 10
  3. max(PSI) – min(PSI) > 0.05
  4. FDR < 0.05
  5. abs(IncLevelDifference) > 0.2

The significant RI events I found all follow a pattern. For example, in the image below, the red boxes indicate significant RI. However, these significant RI regions belong to exonic regions in another transcript. That is, there are overlapping regions of exons and introns in different transcripts. Is this due to differences in exonic regions causing differences in intron retention?
RI

The executed script is as follows:
python /home/sunpf/my_data/software/miniconda3/envs/new_rmats/rMATS/rmats.py --b1 u8.txt --b2 WT.txt --gtf Arabidopsis_thaliana.TAIR10.57.gtf -t paired --readLength 150 --od out --tmp tmp

@EricKutschera
Copy link
Contributor

rMATS relies on the --gtf for RI events. Unless --novelSS is used, all of the RI events reported by rMATS should have the intron region overlap an annotated transcript in an exon region. This post has some details: #17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants