MAINT: Support NumPy 2 and build Python 3.12 wheels #644

seberg · 2024-07-03T12:23:27Z

This adds NumPy support, and for good sport also 3.12 wheels. I didn't bother with supporting 3.8 (which NumPy 2 doesn't support) for wheels, since scikit-learn doesn't either anymore and you can use old wheels.

Tests pass locally, and will run with NumPy 2 now anyway, I presume (not that there are many).

seberg · 2024-07-03T13:00:49Z

There seems to be a further test error:

================================================================================== FAILURES ===================================================================================
________________________________________________________________________ test_hdbscan_distance_matrix _________________________________________________________________________

    def test_hdbscan_distance_matrix():
        D = distance.squareform(distance.pdist(X))
        D /= np.max(D)
    
        labels, p, persist, ctree, ltree, mtree = hdbscan(D, metric="precomputed")
        # number of clusters, ignoring noise if present
        n_clusters_1 = len(set(labels)) - int(-1 in labels)  # ignore noise
        assert n_clusters_1 == n_clusters
    
        labels = HDBSCAN(metric="precomputed").fit(D).labels_
        n_clusters_2 = len(set(labels)) - int(-1 in labels)
        assert n_clusters_2 == n_clusters
    
        validity = validity_index(D, labels, metric="precomputed", d=2)
>       assert validity >= 0.6
E       assert 0.5441246857483076 >= 0.6

hdbscan/tests/test_hdbscan.py:165: AssertionError
=========================================================================== short test summary info ===========================================================================
FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_distance_matrix - assert 0.5441246857483076 >= 0.6
============================================================================== 1 failed in 0.73s ==============================================================================

That test error seems unrelated to NumPy 2? (I downgraded NumPy again) but I am not sure what it is related to, since downgrading scipy/sklearn didn't seem to do much. Some round-about issue with Cython?

seberg · 2024-07-03T13:02:42Z

Ah, the:

E   TypeError: 'numpy.float64' object cannot be interpreted as an integer

is specific to certain versions of Cython I think.

node_list = sorted(stability.keys(), reverse=True)

in get_clusters contains floats for the nodes. Which seems strange, but if that needs to work, the easiest is maybe to just add a:

node_list = [int(n) for n in node_list]

since I suspect that was always a requirement.

This applys some smaller NumPy 2 related fixes. With (in progress) cupy 13.2 fixups, the single gpu test suite seems to be doing fine (not quite finished, I may push more commits, but can also open a new PR). The one thinig I noticed that is a bit anonying is that hdbscan is not yet released for NumPy 2, is that actually still required since I think sklearn has a version? (I don't expect this to be a problem for long, but there is at least one odd test failure trying to make hdbscan work in scikit-learn-contrib/hdbscan#644)

seberg · 2024-07-04T21:05:18Z

hdbscan/_hdbscan_tree.pyx

    else:
        node_list = sorted(stability.keys(), reverse=True)[:-1]
+        node_list = [int(n) for n in node_list]


I don't like it, but it works. Cython and NumPy have a weird interaction here with allowing conversion from float or not depending on both Cython and NumPy versions.

If you want to convert from float, you could also just cdef it to long or int, IIRC that should always work, and I don't think you care about mroe than 2**31 clusters....

I do agree that there are some weird interactions here. It is down to the fact that at some point the values are contained in a single numpy array of type float64 (since other values in the array are floats) and this was "fine", but it is causing catches now in the cython/numpy interactions. I think this workaround looks fine under the circumstances.

seberg · 2024-07-04T21:08:27Z

hdbscan/validity.py

-    edge_selection = np.prod(np.in1d(min_span_tree.T[:2], vertices).reshape(
-        (min_span_tree.shape[0], 2), order='F'), axis=1).astype(bool)
+    edge_selection = np.prod(
+        np.isin(min_span_tree.T[:2], vertices), axis=0).astype(bool)


Seems correct, but may warrent a look. Anything else doesn't make sense to me anyway, the order="F" was breaking things when switching to isin, which doesn't flatten the result (keeps shape of vertices).

The other places want to work with 1-D arrays as far as I can tell, so this distinction doesn't matter.

Yes, this should be correct. isin is a much nicer way to do this, but didn't exist back when this was first written.

lmcinnes · 2024-07-04T21:27:51Z

Thanks for taking a loom at this. There are definitely some difficulties. There are good chunks of the codebase that are pretty old at this point and fail to use modern idioms that would make things easier. If you can change the testing grid to drop python 3.7 and 3.8, and add 3.12 (so all the tests come out green) that would be appreciated.

seberg · 2024-07-05T06:05:25Z

Done, I guess it'll probably pass. The fact that floats are passed around as cluster IDs probably slip some things.

It might be that the integer conversion doesn't matter anymore so long you build with NumPy 2, but then I am not sure that it builds right with 1.x and cython 2, the integer conversion problem may be cropping up in other places as well.

Not sure I can help with auditing that if that is the case. The hacky way it could be solved is to use steal soething like this from the NumPy definitions (or just decide that int is sufficient anyway):

cdef extern from "numpy/arrayobject.h":
    # It would be nice to use size_t and ssize_t, but ssize_t has special
    # implicit conversion rules, so just use "long".
    # Note: The actual type only matters for Cython promotion, so long
    #       is closer than int, but could lead to incorrect promotion.
    #       (Not to worrying, and always the status-quo.)
    ctypedef signed long npy_intp

(the good way is probably a proper audit to actually use integers)

lmcinnes · 2024-07-05T08:36:52Z

Thanks for this it is honestly greatly appreciated!

seberg · 2024-07-05T09:42:01Z

Thanks, would be cool if you can do a release soon (and remember that if there is an issue with integers and Cython, that is plausible for Cython 3 + old NumPy 1.x).
Although, it sounds like I may want to suggest looking at fast_hdbscan also if that is more maintained.

This applys some smaller NumPy 2 related fixes. With (in progress) cupy 13.2 fixups, the single gpu test suite seems to be doing fine (not quite finished, I may push more commits, but can also open a new PR). The one thinig I noticed that is a bit anonying is that hdbscan is not yet released for NumPy 2, is that actually still required since I think sklearn has a version? (I don't expect this to be a problem for long, but there is at least one odd test failure trying to make hdbscan work in scikit-learn-contrib/hdbscan#644)

This applies some smaller NumPy 2 related fixes. With (in progress) cupy 13.2 fixups, the single gpu test suite seems to be doing mostly fine. There is a single test remaining: ``` test_simpl_set.py::test_simplicial_set_embedding ``` is failing with: ``` (Pdb) cp.asarray(cu_embedding) array([[23067.518, 23067.518], [17334.559, 17334.559], [22713.598, 22713.598], ..., [23238.438, 23238.438], [25416.912, 25416.912], [19748.943, 19748.943]], dtype=float32) ``` being completely different from the reference: ``` array([[5.330462 , 4.3419437], [4.1822557, 5.6225405], [5.200859 , 4.530094 ], ..., [4.852359 , 5.0026293], [5.361374 , 4.1475334], [4.0259256, 5.7187223]], dtype=float32) ``` And I am not sure why that might be, I will prod it a bit more, but it may need someone who knows the methods to have a look. One wrinkle is that hdbscan is not yet released for NumPy 2, but I guess that still required even though sklearn has a version? (Probably, not a big issue, but my fixups scikit-learn-contrib/hdbscan#644 run into some issue even though it doesn't seem NumPy 2 related.) xref: rapidsai/build-planning#38 Authors: - Sebastian Berg (https://github.com/seberg) - https://github.com/jakirkham - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Dante Gama Dessavre (https://github.com/dantegd) URL: #5954

MAINT: Support NumPy 2 and build Python 3.12 wheels

29d8286

seberg mentioned this pull request Jul 3, 2024

issue with numy 2.0 ValueError: numpy.dtype size changed #642

Closed

Apply simple NumPy 2 fixes and silence most warnings

2e10d38

This was referenced Jul 3, 2024

NumPy 2.0 support rapidsai/build-planning#38

Closed

[MNT] Small NumPy 2 related fixes rapidsai/cuml#5954

Merged

seberg added 2 commits July 4, 2024 22:59

Avoid cython error (but there should be a better way)

c3f3d85

isin returns the vertices shape, so the reshape is incorrect now

9acb495

seberg commented Jul 4, 2024

View reviewed changes

CI: Test with 3.12 and stop testing 3.7 and 3.8

ad235f0

seberg force-pushed the numpy2 branch from d93331d to ad235f0 Compare July 5, 2024 07:17

lmcinnes merged commit c201b2e into scikit-learn-contrib:master Jul 5, 2024
1 check passed

seberg deleted the numpy2 branch July 5, 2024 08:39

jameslamb mentioned this pull request Jul 16, 2024

Use HDBSCAN package pin to 0.8.38 rapidsai/cuml#5906

Merged

vyasr mentioned this pull request Jul 22, 2024

Create a patch release with updated Cython dependencies #646

Closed

trxcllnt added a commit to trxcllnt/hdbscan that referenced this pull request Jul 22, 2024

Bump numpy upper bound after scikit-learn-contrib#644

f7837b0

jakirkham mentioned this pull request Jul 22, 2024

Remove cython from install requires #624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Support NumPy 2 and build Python 3.12 wheels #644

MAINT: Support NumPy 2 and build Python 3.12 wheels #644

seberg commented Jul 3, 2024

seberg commented Jul 3, 2024

seberg commented Jul 3, 2024

seberg Jul 4, 2024

lmcinnes Jul 4, 2024

seberg Jul 4, 2024

lmcinnes Jul 4, 2024

lmcinnes commented Jul 4, 2024

seberg commented Jul 5, 2024 •

edited

Loading

lmcinnes commented Jul 5, 2024

seberg commented Jul 5, 2024

MAINT: Support NumPy 2 and build Python 3.12 wheels #644

MAINT: Support NumPy 2 and build Python 3.12 wheels #644

Conversation

seberg commented Jul 3, 2024

seberg commented Jul 3, 2024

seberg commented Jul 3, 2024

seberg Jul 4, 2024

Choose a reason for hiding this comment

lmcinnes Jul 4, 2024

Choose a reason for hiding this comment

seberg Jul 4, 2024

Choose a reason for hiding this comment

lmcinnes Jul 4, 2024

Choose a reason for hiding this comment

lmcinnes commented Jul 4, 2024

seberg commented Jul 5, 2024 • edited Loading

lmcinnes commented Jul 5, 2024

seberg commented Jul 5, 2024

seberg commented Jul 5, 2024 •

edited

Loading