Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add binning capability to AUPRO #1145

Merged

Conversation

Yann-CV
Copy link
Contributor

@Yann-CV Yann-CV commented Jun 20, 2023

Description

Metrics computation can often be more memory efficient and faster by using predefined thresholds provided as input. This binned computation will be a bit less accurate though, however it can be really useful when evaluating really big datasets.

The current implementation of the AUPRO is not allowing the binned computation, here is a proposal assuming that the target can only contain {0, 1}:

  • Add an optional thresholds attribute
  • set the task argument of the ROC method used in compute_pro to binary (otherwise thresholds is not usable) and use the thresholds as argument.

Note:

  • Unlike Torchmetric's ROC or PrecisionRecallCurve, the predictions in AUPRO are not scaled between 0 and 1 when at least 1 prediction is greater than 1. This can create user misunderstanding/misusage when the user is used to provide thresholds between 0 and 1 to match with Torchmetric strategy. In my opinion, this should be updated to be consistent with Torchmetrics

  • I may be wrong, but It seems that Anomalib AUPR and AUROC can not take advantage of the binned version of Torchmetrics because of the backward compatibility design done in Torchmetrics. Indeed, when we specify the task type, this is creating a new object which is no more the anomalib object inheriting from Torchmetrics mother class. In my opinion, adding binned capability for both these metrics would also require separated pull request. On my side, I moved toward only torchmetrics calls !

Changes

  • Bug fix (non-breaking change which fixes an issue)
  • Refactor (non-breaking change which refactors the code base)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist

  • My code follows the pre-commit style and check guidelines of this project.
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing tests pass locally with my changes
  • I have added a summary of my changes to the CHANGELOG (not for minor changes, docs and tests).

@samet-akcay
Copy link
Contributor

@jpcbertoldo, can you review this PR please?

Copy link
Contributor

@jpcbertoldo jpcbertoldo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice proposal, I think this is very useful : )

Although, i'm not sure i understood this

Unlike Torchmetric's ROC or PrecisionRecallCurve, the predictions in AUPRO are not scaled between 0 and 1 when at least 1 prediction is greater than 1. This can create user misunderstanding/misusage when the user is used to provide thresholds between 0 and 1 to match with Torchmetric strategy. In my opinion, this should be updated because to be consistent with Torchmetrics

What I understood looking at torchmetric's code is that if any value in preds is > 1 or < 0, then preds = preds.sigmoid() here, because preds is assumed to be a logit, which is not the case for all AD models. However thresholds is not transormed.

So if we don't do the normalization to 0-1 on our side, the new argument is only useful for cases where (indeed) the model gives logits or probabilities, and the user can only provide probabilities. Distance-based (for ex.) models wouldn't take any advantage -- how could you set a meaningful range of distance thresholds, while the distances would be transformed by .sigmoid()?

Perhaps it'd be nice to cover this? Or at least make this clear in the doc.

src/anomalib/utils/metrics/aupro.py Outdated Show resolved Hide resolved
tests/pre_merge/utils/metrics/test_aupro.py Outdated Show resolved Hide resolved
tests/pre_merge/utils/metrics/test_aupro.py Outdated Show resolved Hide resolved
@djdameln
Copy link
Contributor

djdameln commented Jun 27, 2023

Unlike Torchmetric's ROC or PrecisionRecallCurve, the predictions in AUPRO are not scaled between 0 and 1 when at least 1 prediction is greater than 1. This can create user misunderstanding/misusage when the user is used to provide thresholds between 0 and 1 to match with Torchmetric strategy. In my opinion, this should be updated because to be consistent with Torchmetrics

This is a tricky topic. Many anomaly models produce distance-based scores which are not confined to a [0, 1] range. Enforcing these predictions to fall within this range by using a sigmoid leads to several problems such as a reduced resolution in the threshold computation (due to the squeezing effect of the sigmoid). In the more extreme case we might actually not be able to obtain a meaningful metric computation due to numerical limitations. Because of this, we asked the TorchMetrics developers to make the sigmoid operation optional in the PrecisionRecallCurve metric. A more detailed explanation of the problems related to the sigmoid mapping can be found in this issue.

So, since most anomaly models used unbounded distance-based anomaly scores, it is not a good idea to apply the sigmoid function in our performance metrics. At the same time, this makes it difficult to apply your proposed binned AUPRO metric in practice. We don't know the upper and lower bounds of the model predictions beforehand, which makes it challenging to choose an adequate range of thresholds when instantiating the metric class.

A possibility could be to infer the range of threshold values from the stored predictions at the beginning of the compute sequence. In this case we could ask the user to specify the number of bins instead of the actual threshold values.

Copy link
Contributor

@djdameln djdameln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR! I have some concerns about the usability of this feature, given that anomaly models generally do not generate predictions in the 0-1 range. See my other comment for more details.

@jpcbertoldo
Copy link
Contributor

Unlike Torchmetric's ROC or PrecisionRecallCurve, the predictions in AUPRO are not scaled between 0 and 1 when at least 1 prediction is greater than 1. This can create user misunderstanding/misusage when the user is used to provide thresholds between 0 and 1 to match with Torchmetric strategy. In my opinion, this should be updated because to be consistent with Torchmetrics

This is a tricky topic. Many anomaly models produce distance-based scores which are not confined to a [0, 1] range. Enforcing these predictions to fall within this range by using a sigmoid leads to several problems such as a reduced resolution in the threshold computation (due to the squeezing effect of the sigmoid). In the more extreme case we might actually not be able to obtain a meaningful metric computation due to numerical limitations. Because of this, we asked the TorchMetrics developers to make the sigmoid operation optional in the PrecisionRecallCurve metric. A more detailed explanation of the problems related to the sigmoid mapping can be found in this issue.

So, since most anomaly models used unbounded distance-based anomaly scores, it is not a good idea to apply the sigmoid function in our performance metrics. At the same time, this makes it difficult to apply your proposed binned AUPRO metric in practice. We don't know the upper and lower bounds of the model predictions beforehand, which makes it challenging to choose an adequate range of thresholds when instantiating the metric class.

A possibility could be to infer the range of threshold values from the stored predictions at the beginning of the compute sequence. In this case we could ask the user to specify the number of bins instead of the actual threshold values.

I think that a min-max normalization on anomalib's side (therefore torchmetrics .sigmoid() is bypassed) could be a simple and already useful solution for many use cases.

Perhaps a warning for extreme cases (e.g. very far away outliers) would be necessary, but the warning logic may become a point of complication.

If the goal is to make it binned-able (?), I think we should make sure to apply the same normalization to the thresholds (when the values are given) because otherwise you have to provide thresholds in the normalized range. Or we could just enable the option with thresholds: int | None.

@github-actions github-actions bot added the Docs label Jun 28, 2023
@ORippler
Copy link
Contributor

Unlike Torchmetric's ROC or PrecisionRecallCurve, the predictions in AUPRO are not scaled between 0 and 1 when at least 1 prediction is greater than 1. This can create user misunderstanding/misusage when the user is used to provide thresholds between 0 and 1 to match with Torchmetric strategy. In my opinion, this should be updated because to be consistent with Torchmetrics

This is a tricky topic. Many anomaly models produce distance-based scores which are not confined to a [0, 1] range. Enforcing these predictions to fall within this range by using a sigmoid leads to several problems such as a reduced resolution in the threshold computation (due to the squeezing effect of the sigmoid). In the more extreme case we might actually not be able to obtain a meaningful metric computation due to numerical limitations. Because of this, we asked the TorchMetrics developers to make the sigmoid operation optional in the PrecisionRecallCurve metric. A more detailed explanation of the problems related to the sigmoid mapping can be found in this issue.
So, since most anomaly models used unbounded distance-based anomaly scores, it is not a good idea to apply the sigmoid function in our performance metrics. At the same time, this makes it difficult to apply your proposed binned AUPRO metric in practice. We don't know the upper and lower bounds of the model predictions beforehand, which makes it challenging to choose an adequate range of thresholds when instantiating the metric class.
A possibility could be to infer the range of threshold values from the stored predictions at the beginning of the compute sequence. In this case we could ask the user to specify the number of bins instead of the actual threshold values.

I think that a min-max normalization on anomalib's side (therefore torchmetrics .sigmoid() is bypassed) could be a simple and already useful solution for many use cases.

Perhaps a warning for extreme cases (e.g. very far away outliers) would be necessary, but the warning logic may become a point of complication.

If the goal is to make it binned-able (?), I think we should make sure to apply the same normalization to the thresholds (when the values are given) because otherwise you have to provide thresholds in the normalized range. Or we could just enable the option with thresholds: int | None.

If we have defective data available we should be able to do min-max normalization. The issue is just that for very big datasets, you no longer have the capacity to store everything in RAM, so we would potentially need to do two passes over the data (or apply some parametric logit function/some safety margins around the scales).

@Yann-CV
Copy link
Contributor Author

Yann-CV commented Jun 29, 2023

If we have defective data available we should be able to do min-max normalization. The issue is just that for very big datasets, you no longer have the capacity to store everything in RAM, so we would potentially need to do two passes over the data (or apply some parametric logit function/some safety margins around the scales).

Unlike torchmetric PrecisionRecallCurve for instance, the update of this AUPRO metric is only appending on self.target and self.preds and not updating the metric's state. So I guess the RAM issue will happen anyway.

That being said, I would say that a mix of @djdameln and @jpcbertoldo proposals could work:

  • The input to set thresholds binning is only an optional integer: thresholds_count: int | None (by the way my current implementation is wrong by allowing integer or list without creating the tensor)
  • we add a compute_threshold function which is called during the compute call and can apply min-max normalization
  • we add documentation explaining the difference with Torchmetric
thresholds = compute_threshold(self.target, self.preds, self.thresholds_count)
fpr, tpr = self._compute(thresholds)

@djdameln @jpcbertoldo Does this solution look like good for you ?

@Yann-CV
Copy link
Contributor Author

Yann-CV commented Jun 29, 2023

Thanks @djdameln for the explanation.

I guess disabling formatting will indeed be the solution and will allow to move the Anomalib AUPR and AUROC to the new Torchmetric versions more easily and consistently !

@jpcbertoldo
Copy link
Contributor

If we have defective data available we should be able to do min-max normalization. The issue is just that for very big datasets, you no longer have the capacity to store everything in RAM, so we would potentially need to do two passes over the data (or apply some parametric logit function/some safety margins around the scales).

Not sure I understood your point @ORippler, but if the only option for the thresholds parameter is with an int then indeed there must be two passes (is that what you're saying?). I think min-max is a simpler solution than a parametric logit (is another advantage?).

@djdameln @jpcbertoldo Does this solution look like good for you ?

Thanks for the changes @Yann-CV . I think this solution is fine.

However, as you said, this won't change the memory consumption because either cases (thresholds: int and thresholds: None) the update() is just storing the targets/preds.
So, the whole point is to speed up the computation, right?
I wonder if the slowness of AUPRO actually comes from the CCA (if that's the case than the threshold wouldn't help).
Do you know/have tested that?

Either way I think adding this option is valid but we should aim to properly take advantage of the use case thresholds: int to also reduce the RAM consumption (like PrecisionRecallCurve in torchmetrics).

Copy link
Contributor

@djdameln djdameln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yann-CV I'm happy with this new approach. Thanks again for your efforts!

Copy link
Contributor

@jpcbertoldo jpcbertoldo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job @Yann-CV : )

@samet-akcay
Copy link
Contributor

@Yann-CV, code quality checks are failing. Can you please run one of the two approaches, whichever you prefer?

  1. Option-1
pre-commit run --all-files
  1. Option-2
tox -e pre-commit

@Yann-CV
Copy link
Contributor Author

Yann-CV commented Jul 7, 2023

So, the whole point is to speed up the computation, right?

The advantage of the binning is to keep a constant size for the output and to reduce the number of operations during the metric computation. It make it more suitable for big dataset.

Unfortunately, this is frustrating but with the current implementation, it seems to slow down the computation.

If you want see by yourself, I wrote this script:

import time
from contextlib import contextmanager

import numpy as np
import torch
from torchmetrics.functional.classification import binary_roc

DEVICE = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

SHAPE = (1, 300, 300)

target = np.zeros(SHAPE, dtype=int)
target[:, 50:70, 50:70] = 1
target=torch.from_numpy(target).to(DEVICE, non_blocking=True)

predictions = np.random.uniform(0, 1, size=SHAPE)
predictions=torch.from_numpy(predictions).to(DEVICE, non_blocking=True)

@contextmanager
def observe_execution_time(description):
    start = time.monotonic()
    yield
    delta = time.monotonic() - start
    print(f"{description}: took {delta} seconds")


def tensor_in_memory(tensor):
    return tensor.nelement() * tensor.element_size()


def result_memory_print(fpr, tpr, thresholds):
    print(
        f"fpr: {tensor_in_memory(fpr)} - "
        f"tpr: {tensor_in_memory(tpr)} - "
        f"thresholds: {tensor_in_memory(thresholds)}"
    )


def compute_binary_roc(num_thresholds = None):
    fpr, tpr, thresholds = binary_roc(
        target=target,
        preds=predictions,
        thresholds=num_thresholds,
    )

    return fpr, tpr, thresholds


print(f"\n\nCompute with shape: {SHAPE}")

with observe_execution_time("not binned"):
    fpr, tpr, thresholds = compute_binary_roc()
result_memory_print(fpr, tpr, thresholds)

with observe_execution_time("binned"):
    fpr, tpr, thresholds = compute_binary_roc(num_thresholds=100)
result_memory_print(fpr, tpr, thresholds)


SHAPE = (10, 300, 300)

target = np.zeros(SHAPE, dtype=int)
target[:, 50:70, 50:70] = 1
target=torch.from_numpy(target).to(DEVICE, non_blocking=True)

predictions = np.random.uniform(0, 1, size=SHAPE)
predictions=torch.from_numpy(predictions).to(DEVICE, non_blocking=True)

print(f"\n\nCompute with shape: {SHAPE}")

with observe_execution_time("not binned"):
    fpr, tpr, thresholds = compute_binary_roc()
result_memory_print(fpr, tpr, thresholds)

with observe_execution_time("binned"):
    fpr, tpr, thresholds = compute_binary_roc(num_thresholds=100)
result_memory_print(fpr, tpr, thresholds)

And the output on my computer is:

Compute with shape: (1, 300, 300)
not binned: took 0.0025854739997157594 seconds
fpr: 360004 - tpr: 360004 - thresholds: 720008
binned: took 0.006139285000244854 seconds
fpr: 400 - tpr: 400 - thresholds: 400


Compute with shape: (10, 300, 300)
not binned: took 0.008684517000801861 seconds
fpr: 3600004 - tpr: 3600004 - thresholds: 7200008
binned: took 0.05397904500023287 seconds
fpr: 400 - tpr: 400 - thresholds: 400

@Yann-CV
Copy link
Contributor Author

Yann-CV commented Jul 7, 2023

I wonder if the slowness of AUPRO actually comes from the CCA (if that's the case than the threshold wouldn't help).
Do you know/have tested that?

@jpcbertoldo I made a small test and I am obtaining these results:

  • without binning: the cca is the most expensive task but in the same order of magnitude than the pro computation
  • with binning: the pro computation is becoming the most expensive one (x10)
################# aupro binned #################
cca: took 0.4697692850004387 seconds
pro: took 3.937452400999973 seconds
aupro: took 0.00041078799949900713 seconds
binned: took 4.408177850998982 seconds
################# aupro not binned #################
cca: took 0.4510504099998798 seconds
pro: took 0.24317637300009665 seconds
aupro: took 0.0012773449998348951 seconds
not-binned: took 0.6958046370000375 seconds

@Yann-CV
Copy link
Contributor Author

Yann-CV commented Jul 7, 2023

@djdameln @jpcbertoldo thanks a lot for the review ;-)

Comment on lines +126 to +132
fpr: Tensor = binary_roc(
preds=preds,
target=target,
thresholds=thresholds,
)[
0
] # only need fpr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the roc function here instead? The reason is that binary_roc maps the predictions to the [0, 1] range using sigmoid, which is exactly what we're trying to avoid. We asked the TorchMetrics developers to make the sigmoid mapping optional, but until then it would be better if we use the legacy roc function, which does not remap the predictions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djdameln in fact roc() will just call binary_roc() (the former is just a wrapper for the argument task: str

https://github.com/Lightning-AI/torchmetrics/blob/2a055f5594a624685e26ba64bf20ab0d12225c86/src/torchmetrics/functional/classification/roc.py#L595

    if task is not None:
        if task == "binary":
            return binary_roc(preds, target, thresholds, ignore_index, validate_args)
        if task == "multiclass":
            assert isinstance(num_classes, int)
            return multiclass_roc(preds, target, num_classes, thresholds, ignore_index, validate_args)
        if task == "multilabel":
            assert isinstance(num_labels, int)
            return multilabel_roc(preds, target, num_labels, thresholds, ignore_index, validate_args)
        raise ValueError(
            f"Expected argument `task` to either be `'binary'`, `'multiclass'` or `'multilabel'` but got {task}"
        )
    else:
        rank_zero_warn(
            "From v0.10 an `'binary_*'`, `'multiclass_*'`, `'multilabel_*'` version now exist of each classification"
            " metric. Moving forward we recommend using these versions. This base metric will still work as it did"
            " prior to v0.10 until v0.11. From v0.11 the `task` argument introduced in this metric will be required"
            " and the general order of arguments may change, such that this metric will just function as an single"
            " entrypoint to calling the three specialized versions.",
            DeprecationWarning,
        )
    preds, target, num_classes, pos_label = _roc_update(preds, target, num_classes, pos_label)
    return _roc_compute(preds, target, num_classes, pos_label, sample_weights)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, the roc call with thresholds :

  • is requiring the task to be defined. Otherwise the thresholds will just be silently ignored.
  • is automatically calling binary roc at the end.

This roc function will be soon deprecated so I would suggest to keep it like this.

Also @djdameln we added a comment just above to mention this issue.

@Yann-CV
Copy link
Contributor Author

Yann-CV commented Jul 24, 2023

@jpcbertoldo @djdameln any other things needed from myself to merge this pull request ?

@jpcbertoldo
Copy link
Contributor

@jpcbertoldo @djdameln any other things needed from myself to merge this pull request ?

Sorry for the late reply.

I don't see anything else.

@djdameln?

@samet-akcay samet-akcay removed the Docs label Aug 4, 2023
@Yann-CV
Copy link
Contributor Author

Yann-CV commented Aug 16, 2023

@djdameln any last opinion here ?

@github-actions github-actions bot added the Docs label Aug 16, 2023
@Yann-CV
Copy link
Contributor Author

Yann-CV commented Aug 21, 2023

@samet-akcay the code quality check failed here after the merge of main:
An unexpected error has occurred: CalledProcessError: command: ('/runner/_work/anomalib/anomalib/.tox/pre-commit/bin/python', '-mnodeenv', '--prebuilt', '--clean-src', '/home/runner/.cache/pre-commit/repo17zd1x1h/node_env-system', '-n', 'system')

Let me know if I can help on this ;-)

@samet-akcay
Copy link
Contributor

@Yann-CV, no worries, this PR is outstanding for too long. We'll sort out the rest! Thanks a lot for your contribution and patience!

@samet-akcay samet-akcay merged commit 7feee1e into openvinotoolkit:main Aug 21, 2023
@Yann-CV Yann-CV deleted the aupro/binning-from-thresholds branch August 22, 2023 07:09
@ashwinvaidya17 ashwinvaidya17 mentioned this pull request Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants