Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix critical issue with mergeFeatures #478

Merged
merged 6 commits into from
May 23, 2024

Conversation

Daenarys8
Copy link
Collaborator

Ping #419 @antagomir @TuomasBorman

  • The default value of rank in mergeFeaturesByRank should be NULL
  • Add na.rm parameter to mergeFeatures
  • Change the default to onRankOnly=TRUE in mergeFeaturesByRank
  • Respect the rowData order

TODO

  • check other functions if they have similar sorting issues
  • Add fix on reviews from this PR

@antagomir
Copy link
Member

up

@antagomir
Copy link
Member

up - @Daenarys8

R/agglomerate.R Outdated Show resolved Hide resolved
R/merge.R Outdated Show resolved Hide resolved
@TuomasBorman
Copy link
Contributor

This does not yet take into account the row order. In mergeRows, the order of data is in alphabetical order. In agglomerateByRanks the order follows the order of rowData. As discussed, the order based on rowData might be the best, but the implementation is harder without big advantages.

So I suggest that we modify agglomerateByRank so that it outputs the data in alphabetical order. I think that is the most straightforward and requires only one additional line in the end of the function.

Copy link
Contributor

@TuomasBorman TuomasBorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

@Daenarys8
Copy link
Collaborator Author

alphabetical order

by single line, are we referring to?
x <- sort(x)

@Daenarys8 Daenarys8 force-pushed the criticalmerge branch 6 times, most recently from d205d74 to 297aa81 Compare May 6, 2024 16:20
R/agglomerate.R Outdated Show resolved Hide resolved
@TuomasBorman
Copy link
Contributor

To recap:

1

Behavior of functions that are using agglomerateByRank internally is unexpected. Why? Because if user does not provide rank, the data is agglomerated to the highest rank by default. However, that should not be the case. For example, agglomerateByPrevalence should not do rank-agglomeration by default.

I think our approach is now incorrect. Instead of changing the default value of rank in agglomerateByRank to NULL, we should modify agglomerateByPrevalence.

-->

In agglomererateByPrevalence, catch rank parameter from ..., If rank is specified, call agglomerateByRank.

mergeRows and mergeCols drop off those instances that have NA values in grouping variable. That is because grouping vector is converted to factor. To include instances with NA values as "NA group", NA values could be converted to "NA". -->

this behavior should be controlled somehow, na.rm parameter could be perfect for that, but we should check if na.rm is used somewhere else in the function (fed with ... to some other function)

The default value of onRankOnly in agglomerateByRank should be TRUE by default.

In the end of the agglomerateByRank function, order the data alphabetically based on rownames.

@TuomasBorman
Copy link
Contributor

Sorry for hassle. I think most of the points are not yet implemented.

R/merge.R Outdated Show resolved Hide resolved
R/merge.R Outdated Show resolved Hide resolved
@Daenarys8 Daenarys8 force-pushed the criticalmerge branch 3 times, most recently from 789082f to c3f267c Compare May 8, 2024 16:57
@Daenarys8 Daenarys8 requested a review from TuomasBorman May 8, 2024 17:34
R/estimateDiversity.R Outdated Show resolved Hide resolved
R/merge.R Outdated Show resolved Hide resolved
R/merge.R Outdated Show resolved Hide resolved
R/merge.R Outdated Show resolved Hide resolved
R/merge.R Outdated Show resolved Hide resolved
R/merge.R Outdated Show resolved Hide resolved
R/merge.R Outdated Show resolved Hide resolved
R/merge.R Outdated Show resolved Hide resolved
R/merge.R Outdated Show resolved Hide resolved
@Daenarys8 Daenarys8 force-pushed the criticalmerge branch 2 times, most recently from c9cc064 to faa51b2 Compare May 21, 2024 02:53
@TuomasBorman
Copy link
Contributor

Can you run examples of Leo that initiated this #419 and print the output to here?

@TuomasBorman
Copy link
Contributor

Also bump the versions and add NEWS

@Daenarys8
Copy link
Collaborator Author

To confirm fix, we will recall the example @antagomir used to raise the issue #419 .
Let's first prepare example data.

library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, assay.type="counts", method="relabundance")

Here we merge features by prevalence, and return the result at the family level.

nrow(mergeFeaturesByPrevalence(tse, rank="Family", assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
[1] 21

Here we merge features first by Family level grouping, then by prevalence.

altExp(tse, "Family") <- mergeFeaturesByRank(tse, rank="Family")
nrow(mergeFeaturesByPrevalence(altExp(tse, "Family"), assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
[1] 21

Same happens when we treat Family as a group, rather thank rank:

altExp(tse, "Family2") <- mergeFeatures(tse, f="Family")
nrow(mergeFeaturesByPrevalence(altExp(tse, "Family2"), assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
[1] 21

Finally, checking the prevalences manually yields the same numbers.

sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
[1] 20
sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
[1] 20

@Daenarys8 Daenarys8 force-pushed the criticalmerge branch 2 times, most recently from 996dbc7 to 0493a7f Compare May 21, 2024 06:57
@antagomir
Copy link
Member

Thanks!

The last two examples yield a different number (20) - shouldn't it be the same?

Can we have these in unit tests, to make sure it will remain stable also in future releases?

@TuomasBorman
Copy link
Contributor

I believe mergeFeaturesByPrevalence() creates a group called others. "others" have all the features under the thresholds. This might explain the behavior.

Can you @Daenarys8 check and create unit tests. They should explain this behavior. For instance if you compare these, you could add comment "other group was removed. It is added by agglomerateByPrevalence function to collect features under threshold." or something like that so that we know in the future what is happening and that it is desired behavior

@antagomir
Copy link
Member

The code should also show how altExp(tse, "Family") was created. There are different options (onRankOnly etc).

@Daenarys8
Copy link
Collaborator Author

   data(GlobalPatterns, package="mia")
    tse <- GlobalPatterns
    tse <- transformAssay(tse, assay.type="counts", method="relabundance")

Other group not present

    altExp(tse, "Family") <- agglomerateByRank(tse, rank="Family")
    altExp(tse, "Family1") <- agglomerateByRank(tse, rank="Family", onRankOnly = TRUE)
    altExp(tse, "Family2") <- agglomerateByRank(tse, rank="Family", onRankOnly = FALSE)
    altExp(tse, "Family3") <- agglomerateByRank(tse, rank="Family", onRankOnly = TRUE, na.rm = TRUE)
    altExp(tse, "Family4") <- agglomerateByRank(tse, rank="Family", onRankOnly = TRUE, na.rm = FALSE)
    altExp(tse, "Family5") <- agglomerateByVariable(tse, f="Family", MARGIN = 'row')

In the following, other group is added by agglomerateByPrevalence function to collect features under threshold

    actual <- agglomerateByPrevalence(tse, rank="Family", assay.type="relabundance", 
                                         detection  = 0.5/100, prevalence  = 20/100)
    actual0 <- agglomerateByPrevalence(altExp(tse, "Family"), assay.type="relabundance", 
                                       detection  = 0.5/100, prevalence  = 20/100)
    actual1 <- agglomerateByPrevalence(altExp(tse, "Family1"), assay.type="relabundance", 
                                       detection  = 0.5/100, prevalence  = 20/100)
    actual2 <- agglomerateByPrevalence(altExp(tse, "Family2"), assay.type="relabundance", 
                                       detection  = 0.5/100, prevalence  = 20/100)
    actual3 <- agglomerateByPrevalence(altExp(tse, "Family3"), assay.type="relabundance", 
                                       detection  = 0.5/100, prevalence  = 20/100)
    actual4 <- agglomerateByPrevalence(altExp(tse, "Family4"), assay.type="relabundance", 
                                       detection  = 0.5/100, prevalence  = 20/100)
    actual5 <- agglomerateByPrevalence(altExp(tse, "Family5"), assay.type="relabundance", 
                                       detection  = 0.5/100, prevalence  = 20/100)
> nrow(actual)
[1] 21
> nrow(actual1)
[1] 21
> nrow(actual2)
[1] 27

actual2 create groups based on the full taxonomic hierarchy up to family level while the previous creates the factor group with groups at the family level.

> nrow(actual3)
[1] 20
> nrow(actual4)
[1] 21
> nrow(actual5)
[1] 21
> 
> sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
[1] 20
> sum(rowMeans(assay(altExp(tse, "Family1"), "relabundance") > 0.5/100) > 0.2)
[1] 20
> sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
[1] 26
> sum(rowMeans(assay(altExp(tse, "Family3"), "relabundance") > 0.5/100) > 0.2)
[1] 19
> sum(rowMeans(assay(altExp(tse, "Family4"), "relabundance") > 0.5/100) > 0.2)
[1] 20
> sum(rowMeans(assay(altExp(tse, "Family5"), "relabundance") > 0.5/100) > 0.2)
[1] 20
> 

@Daenarys8 Daenarys8 force-pushed the criticalmerge branch 2 times, most recently from e5b3461 to abc285c Compare May 22, 2024 10:16
Daenarys8 added 5 commits May 23, 2024 06:39
Signed-off-by: Daena Rys <[email protected]>
Signed-off-by: Daena Rys <[email protected]>
Signed-off-by: Daena Rys <[email protected]>
Signed-off-by: Daena Rys <[email protected]>
Signed-off-by: Daena Rys <[email protected]>
@TuomasBorman TuomasBorman merged commit b627edc into microbiome:devel May 23, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants