Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended filtering of rMATS results #440

Open
ankebusch opened this issue Oct 17, 2024 · 2 comments
Open

Recommended filtering of rMATS results #440

ankebusch opened this issue Oct 17, 2024 · 2 comments

Comments

@ankebusch
Copy link

Hi,

I'm trying to understand which filters I could or should apply to the rMATS results. Reading your recent paper in Nature Protocols and previously discussed issues (e.g. #320, #183), I understand that I should filter based on FDR, abs(deltaPSI), average PSIs and average coverage, e.g. as mentioned in your paper:

  1. average read count >=10 in both sample groups
  2. filter out events with average PSI value <0.05 or >0.95 in both sample groups
  3. FDR <= 0.01
  4. abs(deltaPSI) >= 0.05

In order to find a good and maybe universal set of filters, I have the following questions:

  • Is the statistical model rMATS applies considering the uncertainty coming from low coverage events when calculating p-values?
  • Should the coverage filter (1. above) be chosen depending on the experiment or kept at 10 for most datasets?
  • Filter 2. above as currently implemented in rmats_filtering.py
...
and min(x.averagePsiSample1, x.averagePsiSample2) <= maxPSI
and max(x.averagePsiSample1, x.averagePsiSample2) >= minPSI
...

is making sure that the average PSIs of the two groups are not both <0.05 or both >0.95. Is filter 2. really needed when filter 4. is applied? Would you instead recommend to filter as suggested in #183 by 0.05 <= average PSI of all samples in the comparison <= 0.95?

Thanks a lot for your help and best,
Anke.

@EricKutschera
Copy link
Contributor

Yes, the coverage affects the p-value calculation. Here's some output from the statistical model showing the p-value changing as the coverage changes (the PSI value stays the same):

ID	IJC_SAMPLE_1	SJC_SAMPLE_1	IJC_SAMPLE_2	SJC_SAMPLE_2	IncFormLen	SkipFormLen	PValue
0	1,1,1	2,2,2	1,1,1	1,1,1	150	100	0.550549161109
1	10,10,10	20,20,20	10,10,10	10,10,10	150	100	0.045763908041
2	100,100,100	200,200,200	100,100,100	100,100,100	150	100	2.77461456366e-07

I think the coverage filter of 10 reads is reasonable for most datasets. It should avoid the issue that this post shows where the p-value can change a lot for a small change in read counts when the total read count for that event is low: https://groups.google.com/g/rmats-user-group/c/2PJ6DWFu1m8/m/0J0eY3XlAAAJ

I agree that filter 2 doesn't remove anything that wouldn't already be removed by filter 4. I think it would be best to just remove filter 2. The filter 0.05 <= average PSI of all samples in the comparison <= 0.95 is based on a situation where there were many samples but the samples weren't divided into two groups. If you have two groups then using a filter on deltaPSI seems good enough

@ankebusch
Copy link
Author

Hi Eric,

Thanks a lot for your explanations.

Best,
Anke.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants