Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction of ATOL in finite vectorisation method #348

Merged
merged 20 commits into from
Jul 23, 2020

Conversation

martinroyer
Copy link
Collaborator

No description provided.

Copy link
Member

@mglisse mglisse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
Could you include some minimal (or more extensive 😉) test, so we have a chance to notice early if we accidentally completely break the code at some point?

Copy link
Collaborator Author

@martinroyer martinroyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a minimal working example in the class documentation.

Is there a test file for representation methods that I cannot find?

@martinroyer martinroyer requested a review from mglisse June 9, 2020 09:21
@mglisse
Copy link
Member

mglisse commented Jun 9, 2020

Is there a test file for representation methods that I cannot find?

Currently there is only src/python/test/test_representations.py. You can add to it or create a separate file as you like.

For your example in the documentation, if you want, you can use syntax from https://www.sphinx-doc.org/en/master/usage/extensions/doctest.html so that your example code is actually run while building the documentation, and the output is checked against what you have written. For instance, src/python/doc/wasserstein_distance_user.rst uses testcode/testoutput.

@martinroyer
Copy link
Collaborator Author

For your example in the documentation, if you want, you can use syntax from https://www.sphinx-doc.org/en/master/usage/extensions/doctest.html so that your example code is actually run while building the documentation, and the output is checked against what you have written. For instance, src/python/doc/wasserstein_distance_user.rst uses testcode/testoutput.

The current format already allows for that, i.e. the test I wrote in the doc is already being run during the build (hence my problem above this comment).

@mglisse
Copy link
Member

mglisse commented Jun 9, 2020

The current format already allows for that, i.e. the test I wrote in the doc is already being run during the build

Oops, I missed that. Nice.

(hence my problem above this comment).

Er, which problem?

Copy link
Member

@mglisse mglisse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice you seem to be using the web interface to modify the code. Do you find it convenient? I usually find it easier to commit and push to my branch without messing with a browser.

@mglisse
Copy link
Member

mglisse commented Jun 9, 2020

Your name should probably appear in a more visible place than just one comment, maybe Vincent will have suggestions about that.

@martinroyer
Copy link
Collaborator Author

Er, which problem?

Problem with randomness that are not fixed by fixing the seed (the seed I used at home for KMeans lead to a different center order than that on the server)

"""
if sample_weight is None:
sample_weight = [self.weighting_method(measure) for measure in X]
return np.stack([self(measure, sample_weight=weight) for measure, weight in zip(X, sample_weight)])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATOL does not necessarily apply to diagrams, it could be simpler objects that have near zero production cost.
You're right about parallelizing the transform: our research shows ATOL can be very useful even after a rough calibration, so it's very likely that there are situations where fit will necessarily be quick and dirty, but then there is a lot of transforming to be done.

self.centers = self.quantiser.cluster_centers_
dist_centers = pairwise.pairwise_distances(self.centers)
np.fill_diagonal(dist_centers, np.inf)
self.inertias = np.min(dist_centers, axis=0)/2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this isn't meant to be used with a single center.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is no reason why it should not, and it is interesting.

Comment on lines 675 to 677
dist_centers = pairwise.pairwise_distances(self.centers)
np.fill_diagonal(dist_centers, np.inf)
self.inertias = np.min(dist_centers, axis=0)/2
Copy link
Collaborator Author

@martinroyer martinroyer Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dist_centers = pairwise.pairwise_distances(self.centers)
np.fill_diagonal(dist_centers, np.inf)
self.inertias = np.min(dist_centers, axis=0)/2
if self.quantiser.n_clusters == 1:
dist_centers = pairwise.pairwise_distances(measures_concat)
np.fill_diagonal(dist_centers, 0)
self.inertias = np.max(dist_centers)/2
else:
dist_centers = pairwise.pairwise_distances(self.centers)
np.fill_diagonal(dist_centers, np.inf)
self.inertias = np.min(dist_centers, axis=0)/2

Comment on lines 675 to 683
if self.quantiser.n_clusters == 1:
dist_centers = pairwise.pairwise_distances(measures_concat)
np.fill_diagonal(dist_centers, 0)
self.inertias = np.max(dist_centers)/2
else:
dist_centers = pairwise.pairwise_distances(self.centers)
np.fill_diagonal(dist_centers, np.inf)
self.inertias = np.min(dist_centers, axis=0)/2
return self
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens that two quantiser centers are identical, which leads to 0s in self.inertias and that is a problem I could address here...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea. Are they exactly identical, or can they be very close but slightly apart? Can there be an arbitrary number of identical centers? Do you already know how you want to handle this problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens that they can be very close but slightly apart, or that they are exactly identical. And there can be an arbitrary number of identical centers as far as I can tell. I am unsure as to how to handle this.

Options that I think of include:
1- removing centers that are identical - it actually sounds rather solid but we lose control of the number of centers and does not feel natural.
2- recasting the resulting inertias that are too close to zero - which is what prompts error when computing the contrast functions over measures. I actually like the simplicity of this. In a way it says that the quantised mean measure is supported on points that can have multiplicity higher than 1.
3- ensuring ahead of quantising that the calibrating measure points are sufficiently distinct (using some sort of tree-trimming), so that the space quantising will pick sufficiently different cluster centers; even though this could work well in practice, it is rather shaky as it assumes the quantiser will not mess up and adds an extra and arbitrary trimming step that could obfuscate the signal in the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what 2- means, but anyway, I am also ok with merging the current version and improving it once you have settled on a way to handle it, if the problem doesn't occur too often.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is rare and should be even rarer now, I'm ok with merging too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait and see if Vincent has any comments, he can handle the merging then.

@martinroyer
Copy link
Collaborator Author

I have a hard time understanding how this PR breaks those tests...

@mglisse
Copy link
Member

mglisse commented Jun 19, 2020

I have a hard time understanding how this PR breaks those tests...

The failure was due to a change in the contiguous integration we use, Vincent fixed it, it should work just fine on your next push.

Copy link
Contributor

@VincentRouvreau VincentRouvreau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In src/python/test/test_representations.py (or a new test_atol.py), I would test your example:

import numpy as np
from sklearn.cluster import KMeans
from gudhi.representations.vector_methods import Atol

def test_atol():
    a = np.array([[1, 2, 4], [1, 4, 0], [1, 0, 4]])
    b = np.array([[4, 2, 0], [4, 4, 0], [4, 0, 2]])
    c = np.array([[3, 2, -1], [1, 2, -1]])

    atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006))
    assert np.allclose(atol_vectoriser.fit(X=[a, b, c]).centers,
                       np.array([[2., 0.66666667, 3.33333333], [2.6, 2.8, -0.4]]))
    assert np.allclose(atol_vectoriser(a),
                       np.array([1.18168665, 0.42375966]))
    assert np.allclose(atol_vectoriser(c),
                       np.array([0.02062512, 1.25157463]))
    assert np.allclose(atol_vectoriser.transform(X=[a, b, c]),
                       np.array([[1.18168665, 0.42375966], [0.29861028, 1.06330156], [0.02062512, 1.25157463]]))

@VincentRouvreau
Copy link
Contributor

In src/python/doc/representations_sum.inc, you add your name:

:Author: Mathieu Carrière, Martin Royer

(be careful with table format that is ending with a |)

@mglisse
Copy link
Member

mglisse commented Jul 1, 2020

In src/python/test/test_representations.py (or a new test_atol.py), I would test your example:

I think it is already tested by sphinx?

@VincentRouvreau
Copy link
Contributor

I think it is already tested by sphinx?

Here it is just some text rendering. If you want sphinx to test it, you need to use .. testcode:: and .. testoutput:: syntax.

@mglisse
Copy link
Member

mglisse commented Jul 1, 2020

Here it is just some text rendering. If you want sphinx to test it, you need to use .. testcode:: and .. testoutput:: syntax.

https://www.sphinx-doc.org/en/master/usage/extensions/doctest.html it looks like this syntax also enables testing, and indeed it was causing trouble because Martin had a different version of sklearn at home and it was producing a different result.

@VincentRouvreau
Copy link
Contributor

https://www.sphinx-doc.org/en/master/usage/extensions/doctest.html it looks like this syntax also enables testing, and indeed it was causing trouble because Martin had a different version of sklearn at home and it was producing a different result.

Oh you are right, sphinx_py_test fails. I was not aware of this syntax, forget my remark.

@VincentRouvreau
Copy link
Contributor

VincentRouvreau commented Jul 1, 2020

I wrote this dummy test:

def test_dummy_atol():
    a = np.array([[1, 2, 4], [1, 4, 0], [1, 0, 4]])
    b = np.array([[4, 2, 0], [4, 4, 0], [4, 0, 2]])
    c = np.array([[3, 2, -1], [1, 2, -1]])

    for weighting_method in ["cloud", "iidproba"]:
        for contrast in ["gaussian", "laplacian", "indicator"]:
            atol_vectoriser = Atol(quantiser=KMeans(n_clusters=1, random_state=202006), weighting_method = weighting_method, contrast = contrast)
            atol_vectoriser(a)
            atol_vectoriser.transform(X=[a, b, c])

As Python is not compiled, it would be nice to have 100% of code coverage.

@martinroyer
Copy link
Collaborator Author

I wrote this dummy test:

def test_dummy_atol():
    a = np.array([[1, 2, 4], [1, 4, 0], [1, 0, 4]])
    b = np.array([[4, 2, 0], [4, 4, 0], [4, 0, 2]])
    c = np.array([[3, 2, -1], [1, 2, -1]])

    for weighting_method in ["cloud", "iidproba"]:
        for contrast in ["gaussian", "laplacian", "indicator"]:
            atol_vectoriser = Atol(quantiser=KMeans(n_clusters=1, random_state=202006), weighting_method = weighting_method, contrast = contrast)
            atol_vectoriser(a)
            atol_vectoriser.transform(X=[a, b, c])

As Python is not compiled, it would be nice to have 100% of code coverage.

Thank you!

@VincentRouvreau VincentRouvreau merged commit 7bd8c85 into GUDHI:master Jul 23, 2020
@VincentRouvreau VincentRouvreau added the 3.3.0 GUDHI version 3.3.0 label Jul 23, 2020
@martinroyer martinroyer deleted the atol branch June 28, 2024 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.3.0 GUDHI version 3.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants