-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduction of ATOL in finite vectorisation method #348
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Could you include some minimal (or more extensive 😉) test, so we have a chance to notice early if we accidentally completely break the code at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a minimal working example in the class documentation.
Is there a test file for representation methods that I cannot find?
Currently there is only src/python/test/test_representations.py. You can add to it or create a separate file as you like. For your example in the documentation, if you want, you can use syntax from https://www.sphinx-doc.org/en/master/usage/extensions/doctest.html so that your example code is actually run while building the documentation, and the output is checked against what you have written. For instance, src/python/doc/wasserstein_distance_user.rst uses testcode/testoutput. |
The current format already allows for that, i.e. the test I wrote in the doc is already being run during the build (hence my problem above this comment). |
Oops, I missed that. Nice.
Er, which problem? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice you seem to be using the web interface to modify the code. Do you find it convenient? I usually find it easier to commit and push to my branch without messing with a browser.
Your name should probably appear in a more visible place than just one comment, maybe Vincent will have suggestions about that. |
Problem with randomness that are not fixed by fixing the seed (the seed I used at home for KMeans lead to a different center order than that on the server) |
""" | ||
if sample_weight is None: | ||
sample_weight = [self.weighting_method(measure) for measure in X] | ||
return np.stack([self(measure, sample_weight=weight) for measure, weight in zip(X, sample_weight)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ATOL does not necessarily apply to diagrams, it could be simpler objects that have near zero production cost.
You're right about parallelizing the transform
: our research shows ATOL can be very useful even after a rough calibration, so it's very likely that there are situations where fit
will necessarily be quick and dirty, but then there is a lot of transform
ing to be done.
(thank you Marc!)
self.centers = self.quantiser.cluster_centers_ | ||
dist_centers = pairwise.pairwise_distances(self.centers) | ||
np.fill_diagonal(dist_centers, np.inf) | ||
self.inertias = np.min(dist_centers, axis=0)/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this isn't meant to be used with a single center.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there is no reason why it should not, and it is interesting.
dist_centers = pairwise.pairwise_distances(self.centers) | ||
np.fill_diagonal(dist_centers, np.inf) | ||
self.inertias = np.min(dist_centers, axis=0)/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dist_centers = pairwise.pairwise_distances(self.centers) | |
np.fill_diagonal(dist_centers, np.inf) | |
self.inertias = np.min(dist_centers, axis=0)/2 | |
if self.quantiser.n_clusters == 1: | |
dist_centers = pairwise.pairwise_distances(measures_concat) | |
np.fill_diagonal(dist_centers, 0) | |
self.inertias = np.max(dist_centers)/2 | |
else: | |
dist_centers = pairwise.pairwise_distances(self.centers) | |
np.fill_diagonal(dist_centers, np.inf) | |
self.inertias = np.min(dist_centers, axis=0)/2 |
if self.quantiser.n_clusters == 1: | ||
dist_centers = pairwise.pairwise_distances(measures_concat) | ||
np.fill_diagonal(dist_centers, 0) | ||
self.inertias = np.max(dist_centers)/2 | ||
else: | ||
dist_centers = pairwise.pairwise_distances(self.centers) | ||
np.fill_diagonal(dist_centers, np.inf) | ||
self.inertias = np.min(dist_centers, axis=0)/2 | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens that two quantiser centers are identical, which leads to 0s in self.inertias and that is a problem I could address here...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good idea. Are they exactly identical, or can they be very close but slightly apart? Can there be an arbitrary number of identical centers? Do you already know how you want to handle this problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens that they can be very close but slightly apart, or that they are exactly identical. And there can be an arbitrary number of identical centers as far as I can tell. I am unsure as to how to handle this.
Options that I think of include:
1- removing centers that are identical - it actually sounds rather solid but we lose control of the number of centers and does not feel natural.
2- recasting the resulting inertias that are too close to zero - which is what prompts error when computing the contrast functions over measures. I actually like the simplicity of this. In a way it says that the quantised mean measure is supported on points that can have multiplicity higher than 1.
3- ensuring ahead of quantising that the calibrating measure points are sufficiently distinct (using some sort of tree-trimming), so that the space quantising will pick sufficiently different cluster centers; even though this could work well in practice, it is rather shaky as it assumes the quantiser will not mess up and adds an extra and arbitrary trimming step that could obfuscate the signal in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what 2- means, but anyway, I am also ok with merging the current version and improving it once you have settled on a way to handle it, if the problem doesn't occur too often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is rare and should be even rarer now, I'm ok with merging too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait and see if Vincent has any comments, he can handle the merging then.
I have a hard time understanding how this PR breaks those tests... |
The failure was due to a change in the contiguous integration we use, Vincent fixed it, it should work just fine on your next push. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In src/python/test/test_representations.py (or a new test_atol.py), I would test your example:
import numpy as np
from sklearn.cluster import KMeans
from gudhi.representations.vector_methods import Atol
def test_atol():
a = np.array([[1, 2, 4], [1, 4, 0], [1, 0, 4]])
b = np.array([[4, 2, 0], [4, 4, 0], [4, 0, 2]])
c = np.array([[3, 2, -1], [1, 2, -1]])
atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006))
assert np.allclose(atol_vectoriser.fit(X=[a, b, c]).centers,
np.array([[2., 0.66666667, 3.33333333], [2.6, 2.8, -0.4]]))
assert np.allclose(atol_vectoriser(a),
np.array([1.18168665, 0.42375966]))
assert np.allclose(atol_vectoriser(c),
np.array([0.02062512, 1.25157463]))
assert np.allclose(atol_vectoriser.transform(X=[a, b, c]),
np.array([[1.18168665, 0.42375966], [0.29861028, 1.06330156], [0.02062512, 1.25157463]]))
In src/python/doc/representations_sum.inc, you add your name: :Author: Mathieu Carrière, Martin Royer (be careful with table format that is ending with a |
I think it is already tested by sphinx? |
Here it is just some text rendering. If you want sphinx to test it, you need to use |
https://www.sphinx-doc.org/en/master/usage/extensions/doctest.html it looks like this syntax also enables testing, and indeed it was causing trouble because Martin had a different version of sklearn at home and it was producing a different result. |
Oh you are right, sphinx_py_test fails. I was not aware of this syntax, forget my remark. |
I wrote this dummy test: def test_dummy_atol():
a = np.array([[1, 2, 4], [1, 4, 0], [1, 0, 4]])
b = np.array([[4, 2, 0], [4, 4, 0], [4, 0, 2]])
c = np.array([[3, 2, -1], [1, 2, -1]])
for weighting_method in ["cloud", "iidproba"]:
for contrast in ["gaussian", "laplacian", "indicator"]:
atol_vectoriser = Atol(quantiser=KMeans(n_clusters=1, random_state=202006), weighting_method = weighting_method, contrast = contrast)
atol_vectoriser(a)
atol_vectoriser.transform(X=[a, b, c]) As Python is not compiled, it would be nice to have 100% of code coverage. |
Thank you! |
No description provided.