Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Support NumPy 2 and build Python 3.12 wheels #644

Merged
merged 5 commits into from
Jul 5, 2024

Conversation

seberg
Copy link
Contributor

@seberg seberg commented Jul 3, 2024

This adds NumPy support, and for good sport also 3.12 wheels. I didn't bother with supporting 3.8 (which NumPy 2 doesn't support) for wheels, since scikit-learn doesn't either anymore and you can use old wheels.

Tests pass locally, and will run with NumPy 2 now anyway, I presume (not that there are many).

@seberg
Copy link
Contributor Author

seberg commented Jul 3, 2024

There seems to be a further test error:

================================================================================== FAILURES ===================================================================================
________________________________________________________________________ test_hdbscan_distance_matrix _________________________________________________________________________

    def test_hdbscan_distance_matrix():
        D = distance.squareform(distance.pdist(X))
        D /= np.max(D)
    
        labels, p, persist, ctree, ltree, mtree = hdbscan(D, metric="precomputed")
        # number of clusters, ignoring noise if present
        n_clusters_1 = len(set(labels)) - int(-1 in labels)  # ignore noise
        assert n_clusters_1 == n_clusters
    
        labels = HDBSCAN(metric="precomputed").fit(D).labels_
        n_clusters_2 = len(set(labels)) - int(-1 in labels)
        assert n_clusters_2 == n_clusters
    
        validity = validity_index(D, labels, metric="precomputed", d=2)
>       assert validity >= 0.6
E       assert 0.5441246857483076 >= 0.6

hdbscan/tests/test_hdbscan.py:165: AssertionError
=========================================================================== short test summary info ===========================================================================
FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_distance_matrix - assert 0.5441246857483076 >= 0.6
============================================================================== 1 failed in 0.73s ==============================================================================

That test error seems unrelated to NumPy 2? (I downgraded NumPy again) but I am not sure what it is related to, since downgrading scipy/sklearn didn't seem to do much. Some round-about issue with Cython?

@seberg
Copy link
Contributor Author

seberg commented Jul 3, 2024

Ah, the:

E   TypeError: 'numpy.float64' object cannot be interpreted as an integer

is specific to certain versions of Cython I think.

node_list = sorted(stability.keys(), reverse=True)

in get_clusters contains floats for the nodes. Which seems strange, but if that needs to work, the easiest is maybe to just add a:

node_list = [int(n) for n in node_list]

since I suspect that was always a requirement.

seberg added a commit to seberg/cuml that referenced this pull request Jul 3, 2024
This applys some smaller NumPy 2 related fixes.  With (in progress)
cupy 13.2 fixups, the single gpu test suite seems to be doing fine
(not quite finished, I may push more commits, but can also open a new PR).

The one thinig I noticed that is a bit anonying is that hdbscan is not
yet released for NumPy 2, is that actually still required since I think
sklearn has a version?
(I don't expect this to be a problem for long, but there is at least one odd test
failure trying to make hdbscan work in scikit-learn-contrib/hdbscan#644)
seberg added a commit to seberg/cuml that referenced this pull request Jul 3, 2024
This applys some smaller NumPy 2 related fixes.  With (in progress)
cupy 13.2 fixups, the single gpu test suite seems to be doing fine
(not quite finished, I may push more commits, but can also open a new PR).

The one thinig I noticed that is a bit anonying is that hdbscan is not
yet released for NumPy 2, is that actually still required since I think
sklearn has a version?
(I don't expect this to be a problem for long, but there is at least one odd test
failure trying to make hdbscan work in scikit-learn-contrib/hdbscan#644)
else:
node_list = sorted(stability.keys(), reverse=True)[:-1]
node_list = [int(n) for n in node_list]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it, but it works. Cython and NumPy have a weird interaction here with allowing conversion from float or not depending on both Cython and NumPy versions.

If you want to convert from float, you could also just cdef it to long or int, IIRC that should always work, and I don't think you care about mroe than 2**31 clusters....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that there are some weird interactions here. It is down to the fact that at some point the values are contained in a single numpy array of type float64 (since other values in the array are floats) and this was "fine", but it is causing catches now in the cython/numpy interactions. I think this workaround looks fine under the circumstances.

edge_selection = np.prod(np.in1d(min_span_tree.T[:2], vertices).reshape(
(min_span_tree.shape[0], 2), order='F'), axis=1).astype(bool)
edge_selection = np.prod(
np.isin(min_span_tree.T[:2], vertices), axis=0).astype(bool)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems correct, but may warrent a look. Anything else doesn't make sense to me anyway, the order="F" was breaking things when switching to isin, which doesn't flatten the result (keeps shape of vertices).

The other places want to work with 1-D arrays as far as I can tell, so this distinction doesn't matter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be correct. isin is a much nicer way to do this, but didn't exist back when this was first written.

@lmcinnes
Copy link
Collaborator

lmcinnes commented Jul 4, 2024

Thanks for taking a loom at this. There are definitely some difficulties. There are good chunks of the codebase that are pretty old at this point and fail to use modern idioms that would make things easier. If you can change the testing grid to drop python 3.7 and 3.8, and add 3.12 (so all the tests come out green) that would be appreciated.

@seberg
Copy link
Contributor Author

seberg commented Jul 5, 2024

Done, I guess it'll probably pass. The fact that floats are passed around as cluster IDs probably slip some things.

It might be that the integer conversion doesn't matter anymore so long you build with NumPy 2, but then I am not sure that it builds right with 1.x and cython 2, the integer conversion problem may be cropping up in other places as well.

Not sure I can help with auditing that if that is the case. The hacky way it could be solved is to use steal soething like this from the NumPy definitions (or just decide that int is sufficient anyway):

cdef extern from "numpy/arrayobject.h":
    # It would be nice to use size_t and ssize_t, but ssize_t has special
    # implicit conversion rules, so just use "long".
    # Note: The actual type only matters for Cython promotion, so long
    #       is closer than int, but could lead to incorrect promotion.
    #       (Not to worrying, and always the status-quo.)
    ctypedef signed long npy_intp

(the good way is probably a proper audit to actually use integers)

@lmcinnes
Copy link
Collaborator

lmcinnes commented Jul 5, 2024

Thanks for this it is honestly greatly appreciated!

@lmcinnes lmcinnes merged commit c201b2e into scikit-learn-contrib:master Jul 5, 2024
1 check passed
@seberg seberg deleted the numpy2 branch July 5, 2024 08:39
@seberg
Copy link
Contributor Author

seberg commented Jul 5, 2024

Thanks, would be cool if you can do a release soon (and remember that if there is an issue with integers and Cython, that is plausible for Cython 3 + old NumPy 1.x).
Although, it sounds like I may want to suggest looking at fast_hdbscan also if that is more maintained.

seberg added a commit to seberg/cuml that referenced this pull request Jul 18, 2024
This applys some smaller NumPy 2 related fixes.  With (in progress)
cupy 13.2 fixups, the single gpu test suite seems to be doing fine
(not quite finished, I may push more commits, but can also open a new PR).

The one thinig I noticed that is a bit anonying is that hdbscan is not
yet released for NumPy 2, is that actually still required since I think
sklearn has a version?
(I don't expect this to be a problem for long, but there is at least one odd test
failure trying to make hdbscan work in scikit-learn-contrib/hdbscan#644)
trxcllnt added a commit to trxcllnt/hdbscan that referenced this pull request Jul 22, 2024
rapids-bot bot pushed a commit to rapidsai/cuml that referenced this pull request Jul 28, 2024
This applies some smaller NumPy 2 related fixes.  With (in progress) cupy 13.2 fixups, the single gpu test suite seems to be doing mostly fine.  There is a single test remaining:
```
test_simpl_set.py::test_simplicial_set_embedding
```
is failing with:
```
(Pdb) cp.asarray(cu_embedding)
array([[23067.518, 23067.518],
       [17334.559, 17334.559],
       [22713.598, 22713.598],
       ...,
       [23238.438, 23238.438],
       [25416.912, 25416.912],
       [19748.943, 19748.943]], dtype=float32)
```
being completely different from the reference:
```
array([[5.330462 , 4.3419437],
       [4.1822557, 5.6225405],
       [5.200859 , 4.530094 ],
       ...,
       [4.852359 , 5.0026293],
       [5.361374 , 4.1475334],
       [4.0259256, 5.7187223]], dtype=float32)
```
And I am not sure why that might be, I will prod it a bit more, but it may need someone who knows the methods to have a look.

One wrinkle is that hdbscan is not yet released for NumPy 2, but I guess that still required even though sklearn has a version?
(Probably, not a big issue, but my fixups scikit-learn-contrib/hdbscan#644 run into some issue even though it doesn't seem NumPy 2 related.)

xref: rapidsai/build-planning#38

Authors:
  - Sebastian Berg (https://github.com/seberg)
  - https://github.com/jakirkham
  - Dante Gama Dessavre (https://github.com/dantegd)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #5954
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants