-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switching spectral initialization to sklean.manifold.SpectralEmbeddings #223
Comments
@dkobak don't want to hijack this issue so feel free to contact me directly, but I am curious:
|
Hi James, thanks for the interest. This is a tiny graph with very pronounced geometric structure: https://sparse.tamu.edu/HB/can_96. The structure is a ring: https://sparse-files-images.engr.tamu.edu/HB/can_96_graph.gif Here is the entire code to reproduce the issue (download and unpack the Matrix Market data from the link above): %matplotlib notebook
import numpy as np
import pylab as plt
from scipy.io import mmread
adjacency = mmread('/home/dmitry/Downloads/can_96.mtx').toarray()
affinity = adjacency / np.sum(adjacency, axis=1, keepdims=True)
affinity = (affinity + affinity.T) / 2
affinity /= np.sum(affinity)
from openTSNE.initialization import spectral
Z1 = spectral(affinity, tol=0)
from sklearn.manifold import SpectralEmbedding
Z2 = SpectralEmbedding(affinity='precomputed').fit_transform(affinity)
plt.figure(figsize=(8, 4), layout='constrained')
plt.subplot(121)
plt.title('openTSNE')
plt.scatter(Z1[:,0], Z1[:,1])
plt.subplot(122)
plt.title('sklearn')
plt.scatter(Z2[:,0], Z2[:,1])
plt.savefig('spectral.png', dpi=200) SpectralEmbedding is run with default parameters so I assume it's using |
Wow, right after posting the above comment, I found out that the openTSNE's result only arises due to particular v0 = np.ones(A.shape[0]) / np.sqrt(A.shape[0])
eigvals, eigvecs = sp.linalg.eigsh(
A, M=D, k=k, tol=tol, maxiter=max_iter, which="LM", v0=v0
) If I simply remove this Sklearn uses this: https://github.com/scikit-learn/scikit-learn/blob/dc580a8ef5ee2a8aea80498388690e2213118efd/sklearn/utils/_arpack.py#L4 v0 = random_state.uniform(-1, 1, size) so openTSNE should probably do the same. Or we simply call |
More details on this: D = np.diag(np.ravel(np.sum(affinity, axis=1)))
from scipy.sparse.linalg import eigsh
v0 = np.ones(affinity.shape[0]) / np.sqrt(affinity.shape[0])
eigvals, _ = eigsh(affinity, M=D, k=3, tol=0, which="LM", v0=v0)
print(eigvals)
eigvals, _ = eigsh(affinity, M=D, k=3, tol=0, which="LM")
print(eigvals) results in
|
Last comment, promise: using %time eigvals, _ = eigsh(affinity, M=D, k=3, tol=0, which="LM")
%time eigvals, _ = eigsh(affinity, M=D, k=3, tol=0, which="LM", sigma=1.0) So if we keep our own implementation, then using |
Thanks for all those details Dmitry. I had a feeling it would be something weird with a very structured graph. It is indeed skipping an eigenvector for some choices of In this case, the result seems to be exceptionally sensitive to I have never had good luck with setting |
Yes it does! Weird. Sklearn does mention here https://github.com/scikit-learn/scikit-learn/blob/dc580a8ef5ee2a8aea80498388690e2213118efd/sklearn/utils/_arpack.py#L7 that the scale of initialization may be important... However, at this point I do not trust this approach and would rather have
Just tried now, by generating the standard t-SNE affinity matrix for MNIST (so 90 neighbors for each point). eigsh(affinity, M=D, k=3, tol=0, which="LM")
eigsh(affinity, M=D, k=3, tol=0, which="LM", sigma=1.0)
SpectralEmbedding(affinity='precomputed').fit_transform(affinity) The first call took 11 s. The other two did indeed hang forever and I stopped both of them after a few minutes. However, using Pyamg solver SpectralEmbedding(affinity='precomputed', eigen_solver='amg').fit_transform(affinity) also took 11 s. This may be due to looser tolerance though, as the docs mention that Just noticed that setting the tolerance for the ARPACK solver in SpectralEmbedding is only possible in version 1.2. Okay, so if we don't want to depend on sklearn 1.2, then it seems that maybe the best course of action is simply to kick out |
On a technical basis, I suspect in most real-world cases, the current choice of |
Do you have any evidence that the current choice of |
With |
It seems to me that if sklearn doesn't have these pitfalls, and produces the same results as our implementation, it would make more sense to just switch over to their implementation. I don't really know any of the details of the numerical solvers and it could be a real hassle to maintain if we find similar bugs to this in the future. What do you think? |
That was my original thinking here, however as @jlmelville pointed out, sklearn implementation can be VERY SLOW unless one either sets some non-zero So the question is if you want to depend on either sklearn 1.2 or pyamg. I also like the simplicity of simply running |
Okay, that's fine by me. I'm not going to pretend I know what this does, but I'm sure you've tested it out thoroughly and I'll leave it to your judgement :) Maybe it would be best to open up a PR with this change so we can get this merged ASAP. |
It is part of #225. But just to clarify, the only edit is that it replaces
with
before running
So the only edit is in |
A student in our lab is currently looking into spectral initialization, and she found out that
openTSNE.init.spectral(tol=0)
in some cases does not agree tosklearn.manifold.SpectralEmbedding(affinity='precomputed')
. In some cases it does agree perfectly or near-perfectly, but we have an example when the result is very different, andSpectralEmbedding
gives what seems like a more meaningful result.I looked at the math, and it seems that they should conceptually be computing the same thing (
SpectralEmbedding
finds eigenvectors of the L_sym, whereasinit.spectral
finds generalized eigenvectors or W and D, but that should be equivalent, as per https://jlmelville.github.io/smallvis/spectral.html @jlmelville). We don't know what the difference is due to. It may be numerical.However, conceptually, it seems sensible if openTSNE would simply outsource the computation to sklearn.
A related issue is that
init.spectral
is not reproducible and gives different results with each run. Apparently the way we initializev0
makes ARPACK to still have some randomness. Sklearn gets around this by initializingv0
differently. I guess openTSNE should do the same -- but of course if we end up simply callingSpectralEmbedding
then it won't matter.The text was updated successfully, but these errors were encountered: