Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison of rMATs results and BAM visualization in IGV #183

Open
huangwb8 opened this issue Feb 25, 2022 · 9 comments
Open

Comparison of rMATs results and BAM visualization in IGV #183

huangwb8 opened this issue Feb 25, 2022 · 9 comments

Comments

@huangwb8
Copy link

I want to validate rMATs results in IGV with BAM files. Here is an example of SE.MATS.JC.txt:

GeneID | geneSymbol | chr | strand | exonStart_0base | exonEnd | upstreamES | upstreamEE | downstreamES | downstreamEE | IJC_SAMPLE_1 | SJC_SAMPLE_1 | IJC_SAMPLE_2 | SJC_SAMPLE_2 | IncFormLen | SkipFormLen | PValue | FDR | IncLevel1 | IncLevel2 | IncLevelDifference
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
ENSG00000108848 | LUC7L3 | chr17 | + | 50746537 | 50746706 | 50745905 | 50746003 | 50750500 | 50756219 | 4,0,0 | 186,281,217 | 3,1,2 | 0,0,0 | 298 | 149 | 7.5816E-12 | 3.17887E-09 | 0.011,0.0,0.0 | 1.0,1.0,1.0 | -0.996
ENSG00000108848 | LUC7L3 | chr17 | + | 50746565 | 50746706 | 50745905 | 50746003 | 50750500 | 50756219 | 0,480,0 | 186,281,217 | 441,338,1 | 0,0,0 | 290 | 149 | 0 | 0 | 0.0,0.467,0.0 | 1.0,1.0,1.0 | -0.844
ENSG00000108848 | LUC7L3 | chr17 | + | 50746547 | 50746706 | 50745905 | 50746003 | 50750500 | 50756219 | 94,98,69 | 186,281,217 | 75,79,83 | 0,0,0 | 298 | 149 | 0 | 0 | 0.202,0.148,0.137 | 1.0,1.0,1.0 | -0.838
ENSG00000108848 | LUC7L3 | chr17 | + | 50746515 | 50746706 | 50745905 | 50746003 | 50750500 | 50756219 | 106,118,70 | 186,281,217 | 102,86,76 | 0,0,0 | 298 | 149 | 0 | 0 | 0.222,0.174,0.139 | 1.0,1.0,1.0 | -0.822
ENSG00000108848 | LUC7L3 | chr17 | + | 50746544 | 50746706 | 50745905 | 50746003 | 50750500 | 50756219 | 379,368,335 | 186,281,217 | 412,315,327 | 0,0,0 | 298 | 149 | 0 | 0 | 0.505,0.396,0.436 | 1.0,1.0,1.0 | -0.554
ENSG00000108848 | LUC7L3 | chr17 | + | 50746673 | 50746702 | 50745905 | 50746003 | 50750500 | 50750641 | 135,714,401,223 | 186,281,217 | 140,311,861,251 | 0,0,0 | 178 | 149 | 0 | 0 | 0.859,0.811,0.825 | 1.0,1.0,1.0 | -0.168
ENSG00000108848 | LUC7L3 | chr17 | + | 50746562 | 50746702 | 50745905 | 50746003 | 50750500 | 50750641 | 145,391,539,812,875 | 186,281,217 | 143,701,293,912,611 | 0,0,0 | 289 | 149 | 0 | 0 | 0.976,0.966,0.968 | 1.0,1.0,1.0 | -0.03
ENSG00000108848 | LUC7L3 | chr17 | + | 50746565 | 50746702 | 50745905 | 50746003 | 50750500 | 50750641 | 143,441,587,412,662 | 186,281,217 | 148,191,329,112,433 | 0,0,0 | 286 | 149 | 0 | 0 | 0.976,0.967,0.968 | 1.0,1.0,1.0 | -0.03
ENSG00000108848 | LUC7L3 | chr17 | + | 50746457 | 50746702 | 50745905 | 50746003 | 50750500 | 50750641 | 156,931,645,913,899 | 186,281,217 | 154,491,388,613,628 | 0,0,0 | 298 | 149 | 0 | 0 | 0.977,0.967,0.97 | 1.0,1.0,1.0 | -0.029
ENSG00000108848 | LUC7L3 | chr17 | + | 50746515 | 50746702 | 50745905 | 50746003 | 50750500 | 50750641 | 157,991,657,713,967 | 186,281,217 | 155,511,397,213,703 | 0,0,0 | 298 | 149 | 0 | 0 | 0.977,0.967,0.97 | 1.0,1.0,1.0 | -0.029
ENSG00000108848 | LUC7L3 | chr17 | + | 50746532 | 50746702 | 50745905 | 50746003 | 50750500 | 50750641 | 156,931,645,913,898 | 186,281,217 | 154,511,388,613,628 | 0,0,0 | 298 | 149 | 0 | 0 | 0.977,0.967,0.97 | 1.0,1.0,1.0 | -0.029
ENSG00000108848 | LUC7L3 | chr17 | + | 50746534 | 50746702 | 50745905 | 50746003 | 50750500 | 50750641 | 156,931,646,113,897 | 186,281,217 | 154,521,388,613,630 | 0,0,0 | 298 | 149 | 0 | 0 | 0.977,0.967,0.97 | 1.0,1.0,1.0 | -0.029
ENSG00000108848 | LUC7L3 | chr17 | + | 50746537 | 50746702 | 50745905 | 50746003 | 50750500 | 50750641 | 156,971,645,913,897 | 186,281,217 | 154,521,388,713,629 | 0,0,0 | 298 | 149 | 0 | 0 | 0.977,0.967,0.97 | 1.0,1.0,1.0 | -0.029
ENSG00000108848 | LUC7L3 | chr17 | + | 50746540 | 50746702 | 50745905 | 50746003 | 50750500 | 50750641 | 156,931,645,913,897 | 186,281,217 | 154,491,388,813,628 | 0,0,0 | 298 | 149 | 0 | 0 | 0.977,0.967,0.97 | 1.0,1.0,1.0 | -0.029
ENSG00000108848 | LUC7L3 | chr17 | + | 50746547 | 50746702 | 50745905 | 50746003 | 50750500 | 50750641 | 157,871,655,713,966 | 186,281,217 | 155,241,396,513,710 | 0,0,0 | 298 | 149 | 0 | 0 | 0.977,0.967,0.97 | 1.0,1.0,1.0 | -0.029
ENSG00000108848 | LUC7L3 | chr17 | + | 50746544 | 50746702 | 50745905 | 50746003 | 50750500 | 50750641 | 160,721,682,714,232 | 186,281,217 | 158,611,420,113,954 | 0,0,0 | 298 | 149 | 0 | 0 | 0.977,0.968,0.97 | 1.0,1.0,1.0 | -0.028
ENSG00000108848 | LUC7L3 | chr17 | + | 50737286 | 50737479 | 50736959 | 50737026 | 50738136 | 50738199 | 668,882,943 | 13,11,5 | 876,553,791 | 1,0,2 | 298 | 149 | 0.001876626 | 0.043623311 | 0.963,0.976,0.99 | 0.998,1.0,0.995 | -0.021
ENSG00000108848 | LUC7L3 | chr17 | + | 50746541 | 50746706 | 50745719 | 50746003 | 50750500 | 50756219 | 316,063,302,328,791 | 186,281,217 | 313,882,713,328,969 | 0,0,0 | 298 | 149 | 0.000102592 | 0.004900245 | 0.988,0.983,0.985 | 1.0,1.0,1.0 | -0.015
ENSG00000108848 | LUC7L3 | chr17 | + | 50751206 | 50751362 | 50750746 | 50750852 | 50752199 | 50752550 | 16,7,15 | 289,244,227 | 11,12,3 | 461,595,398 | 298 | 149 | 0.000376926 | 0.013227071 | 0.027,0.014,0.032 | 0.012,0.01,0.004 | 0.016
ENSG00000108848 | LUC7L3 | chr17 | + | 50751226 | 50751362 | 50750746 | 50750852 | 50752199 | 50752550 | 14,7,15 | 289,244,227 | 11,10,3 | 461,595,398 | 285 | 149 | 0.000498415 | 0.016345559 | 0.025,0.015,0.033 | 0.012,0.009,0.004 | 0.016
ENSG00000108848 | LUC7L3 | chr17 | + | 50751296 | 50751375 | 50750746 | 50750852 | 50752199 | 50752709 | 35,23,13 | 289,244,227 | 20,28,16 | 461,595,398 | 228 | 149 | 0.000179583 | 0.007499167 | 0.073,0.058,0.036 | 0.028,0.03,0.026 | 0.028
ENSG00000108848 | LUC7L3 | chr17 | + | 50744651 | 50744699 | 50743671 | 50743810 | 50745719 | 50746003 | 8,248,001,025 | 18,23,34 | 838,563,779 | 37,36,39 | 197 | 149 | 0.000381976 | 0.013349 | 0.972,0.963,0.958 | 0.945,0.922,0.938 | 0.029
ENSG00000108848 | LUC7L3 | chr17 | + | 50738140 | 50738199 | 50737286 | 50737479 | 50740305 | 50740345 | 20,14,21 | 122,162,137 | 14,6,8 | 209,135,176 | 208 | 149 | 0.000235282 | 0.009226337 | 0.105,0.058,0.099 | 0.046,0.031,0.032 | 0.051
ENSG00000108848 | LUC7L3 | chr17 | + | 50738111 | 50738199 | 50737286 | 50737479 | 50740305 | 50740345 | 24,20,30 | 122,162,137 | 20,8,13 | 209,135,176 | 237 | 149 | 0.000155351 | 0.006719808 | 0.11,0.072,0.121 | 0.057,0.036,0.044 | 0.055
ENSG00000108848 | LUC7L3 | chr17 | + | 50738122 | 50738199 | 50737286 | 50737479 | 50740305 | 50740345 | 24,19,29 | 122,162,137 | 15,8,13 | 209,135,176 | 226 | 149 | 3.56967E-05 | 0.002154182 | 0.115,0.072,0.122 | 0.045,0.038,0.046 | 0.06
ENSG00000108848 | LUC7L3 | chr17 | + | 50738078 | 50738199 | 50737286 | 50737479 | 50740305 | 50740345 | 25,21,37 | 122,162,137 | 17,8,13 | 209,135,176 | 270 | 149 | 2.57854E-06 | 0.000252306 | 0.102,0.067,0.13 | 0.043,0.032,0.039 | 0.062
ENSG00000108848 | LUC7L3 | chr17 | + | 50751296 | 50751362 | 50750746 | 50750852 | 50752199 | 50752585 | 102,78,63 | 289,244,227 | 81,85,60 | 461,595,398 | 215 | 149 | 5.88369E-08 | 1.03062E-05 | 0.197,0.181,0.161 | 0.109,0.09,0.095 | 0.082
ENSG00000108848 | LUC7L3 | chr17 | + | 50738136 | 50738199 | 50737286 | 50737479 | 50740305 | 50740345 | 128,212,177 | 122,162,137 | 131,78,101 | 209,135,176 | 212 | 149 | 1.28104E-08 | 2.70397E-06 | 0.424,0.479,0.476 | 0.306,0.289,0.287 | 0.166

It seem that some SE events of the LUC7L3 gene differ from samples. Next, I visualized 17:50,743,291-50,753,867 in IGV. Here is the plot:

image-20220225121559001

It appears that there are no difference between these two samples.

My rMATS code is just like:

nohup python rmats.py \
    --b1 ${path_rmats_res}/Sample_1.txt \
    --b2 ${path_rmats_res}/Sample_2.txt \
    --gtf ${path_gtf} \
    -t paired \
    --libType fr-secondstrand --readLength 150 \
    --variable-read-length \
    --novelSS --mil 50 --mel 500 \
    --allow-clipping \
    --nthread 20 \
    --od ${path_rmats_res}/1vs2 \
    --tmp ${path_rmats_res}/1vs2/tmp \
    > ${path_rmats_log}/rmats_1vs2.log 2>&1 &

Information of genome & gtf: ENSEMBL; Homo_sapiens.GRCh38; release104

Any suggestions?

@EricKutschera
Copy link
Contributor

The SE.MATS.JC.txt output you posted has a lot of events in that region and it looks like the events share a lot of the same coordinates. Some pairs of events only differ by a few nt. Differences that small are hard to see in the IGV screenshot you posted

@huangwb8
Copy link
Author

@EricKutschera

Thanks! Here is the results of rMATS in my project: https://github.com/huangwb8/test_file/tree/master/rmats/DM.vs.DB

I'm looking for some obviously different events like this:

image-20220227072403828
Apparently, P value is not a good parameter to find out this kind.

Any sugguestions?

@EricKutschera
Copy link
Contributor

Filtering on the PValue or FDR column is a good start. You could also try filtering on the count columns (IJC_SAMPLE_1, SJC_SAMPLE_1, IJC_SAMPLE_2, SJC_SAMPLE_2) and the inclusion level columns (IncLevel1, IncLevel2, IncLevelDifference)

This paper (https://doi.org/10.1016/j.ajhg.2020.06.002) uses these filters which you could use as a starting point:

  • Average PSI (IncLevel) within 0.05 and 0.95
  • Average total read count (inclusion count + skipping count) >= 10
  • max(PSI) – min(PSI) > 0.05

To get events similar to the plots you posted you may want a filter like: abs(IncLevelDifference) > 0.2 to filter for events with a large change

@huangwb8
Copy link
Author

huangwb8 commented Mar 1, 2022

@EricKutschera Thanks for your nice advices! Please do not close this issue, and I would update my test in the future.

@huangwb8
Copy link
Author

huangwb8 commented Mar 1, 2022

@EricKutschera Hello!

Here I give an example, which should be a very significant event according to your advices:

ID...1          GeneID geneSymbol   chr strand exonStart_0base
203105 ENSG00000135486    HNRNPA1 chr12      +        54282826
203023 ENSG00000135486    HNRNPA1 chr12      +        54282799

exonEnd upstreamES upstreamEE downstreamES downstreamEE ID...12
54283231   54282572   54282662     54283826     54283967  203105
54283231   54282572   54282662     54283826     54283967  203023

IJC_SAMPLE_1 SJC_SAMPLE_1   IJC_SAMPLE_2    SJC_SAMPLE_2    IncFormLen
1,0,0        9,15,12        131,105,110     11,22,14        298
14,11,16     9,15,12        155,121,119     11,22,14        298

SkipFormLen       PValue          FDR          IncLevel1
149               0.000000e+00  0.000000e+00   0.053,0.0,0.0
149               4.774123e-10  1.408861e-07   0.437,0.268,0.4

IncLevel2            IncLevelDifference
0.856,0.705,0.797    -0.768
0.876,0.733,0.81     -0.438

When I try to visualize it in IGV with 12:54,282,572-54,283,967, It looks like:
image-20220301184757849

with chr12:54282826-54283231, It looks like:
image-20220301184848698

with chr12:54282572-54282662, it looks like:
image-20220301185020994

with chr12:54283826-54283967, it looks like:
image-20220301185051868

Is it something wrong?

@EricKutschera
Copy link
Contributor

I looked at the SE.MATS.JC.txt file from https://github.com/huangwb8/test_file/tree/master/rmats/DM.vs.DB and looked at events for ENSG00000135486. I couldn't find the exact events from your last post, but there are a lot of events for that gene where the individual replicates have inclusion and/or skipping counts > 10,000. I'm not sure exactly what the y-axis value means in the IGV screenshot, but that is also > 10,000. Since the events you posted only have 10s or 100s of junction reads compared to the other nearby events with 10,000s of junction reads, the events are not noticeable in IGV. It's not clear if those are real events or just noise. You could try filtering for events with higher junction counts

@huangwb8
Copy link
Author

huangwb8 commented Mar 4, 2022

@EricKutschera
Thanks for nice reply! I'm sorry but the result comes from the SE.MATS.JC.txt of https://github.com/huangwb8/test_file/tree/master/rmats/DM.vs.DC. Here is the code of R for advanced filtering. I'm not sure how to code with max(PSI) – min(PSI) > 0.05. Is it abs(IncLevelDifference)>0.05? Any suggestions?

library(dplyr)

x <- read.table('SE.MATS.JC.txt',sep = '\t',header = T)

# function: Average total read count
aver_total_rc <- function(vt){
  # vt='9142,7523,7819'
  return(mean(as.numeric(unlist(strsplit(vt,split = ',')))))
}

# Average total read count
x$atrc_1 <- apply(x,1,function(v)aver_total_rc(v["IJC_SAMPLE_1"])) + apply(x,1,function(v)aver_total_rc(v["SJC_SAMPLE_1"]))
x$atrc_2 <- apply(x,1,function(v)aver_total_rc(v["IJC_SAMPLE_2"])) + apply(x,1,function(v)aver_total_rc(v["SJC_SAMPLE_2"]))

# Difference of read counts
x$diff_IJC <- apply(x,1,function(v)aver_total_rc(v["IJC_SAMPLE_1"])) - apply(x,1,function(v)aver_total_rc(v["IJC_SAMPLE_2"]))
x$diff_SJC <- apply(x,1,function(v)aver_total_rc(v["SJC_SAMPLE_1"])) - apply(x,1,function(v)aver_total_rc(v["SJC_SAMPLE_2"]))

# Filter
x2 <- dplyr::filter(
  x,
  FDR < 0.1,
  IncLevel1>0.05 & IncLevel1 < 0.95,
  IncLevel2>0.05 & IncLevel2 < 0.95,
  atrc_1 >=10 & atrc_2 >=10,
  # abs(IncLevelDifference)>0.05,
  abs(diff_IJC)>=10000|abs(diff_SJC)>=10000,
  !chr %in% c('chrX', 'chrY')
) %>% arrange(desc(abs(IncLevelDifference)))

@EricKutschera
Copy link
Contributor

IncLevelDifference is average(IncLevel1) - average(IncLevel2). You can get the max and min PSI values directly from IncLevel1 and IncLevel2. Here is some code based on the code you posted. I'm not sure my code is super efficient, but it seems to work

numeric_from_comma_string <- function(comma_string) {
    return(as.numeric(unlist(strsplit(comma_string, split=','))))
}

min_psi_from_row <- function(row) {
    sample_1_psi_values <- numeric_from_comma_string(row["IncLevel1"])
    sample_2_psi_values <- numeric_from_comma_string(row["IncLevel2"])
    min_sample_1 <- min(sample_1_psi_values)
    min_sample_2 <- min(sample_2_psi_values)
    return(min(min_sample_1, min_sample_2))
}

max_psi_from_row <- function(row) {
    sample_1_psi_values <- numeric_from_comma_string(row["IncLevel1"])
    sample_2_psi_values <- numeric_from_comma_string(row["IncLevel2"])
    max_sample_1 <- max(sample_1_psi_values)
    max_sample_2 <- max(sample_2_psi_values)
    return(max(max_sample_1, max_sample_2))
}

x$min_psi <- apply(x, 1, min_psi_from_row)
x$max_psi <- apply(x, 1, max_psi_from_row)
x$psi_diff <- x$max_psi - x$min_psi

@huangwb8
Copy link
Author

huangwb8 commented Mar 5, 2022

@EricKutschera Your code could run~Thanks!

I finally found the root reason of this annoying situation. I'm new of IGV and I use it in a wrong way to my destination.

Here I give an example:

Before Group autoscale, IGV scales based on one track itself and there is no difference between groups:

before_group_scale

After Group autoscale, IGV scales based on all tracks and the difference is obvious and expected:

after_group_scale

I think the latter is what I want. Exactly, it's still quite difficult to find out appearant events like :

test_file

Maybe there is no such obvious events between groups. Or I should try more strick conditons of the rMATs results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants