Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

post: Performance improvements blog posts #191

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions _posts/sklearn-perf-context.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
Title: Performance and scikit-learn (1/4)
Date: 2021-12-16
Category: scikit-learn
Slug: sklearn-perf-context
Lang: en
Authors: Julien Jerphanion
Summary: Context: the current state of scikit-learn performance
Status: Published

## High-level overview of the scikit-learn dependences

scikit-learn is mainly written in Python and is built on top of
some core libraries of the scientific Python ecosystem.

This ecosystem allows _high expressiveness_ and
_interactivity_: one can perform complex operations in a few
lines of code and get the results straight away.

It also allowed setting up simple conventions which makes the
code-base algorithms easy to understand and improve
for new contributors.

It also allows delegating most of the complex operations
to well-tested third-party libraries. For instance, calls
to functions implemented in
[`numpy.linalg`](https://numpy.org/doc/stable/reference/routines.linalg.html),
[`scipy.linalg`](https://docs.scipy.org/doc/scipy/reference/linalg.html ), and
[`scipy.sparse.linalg`](https://docs.scipy.org/doc/scipy/reference/sparse.linalg.html)
are delegated to
[BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms),
[LAPACK](https://www.netlib.org/lapack/),
and [ARPACK](https://www.caam.rice.edu/software/ARPACK/) interfaces.

## Main reasons for limited performance

The PyData stack is simple but is not tailored for optimal performance
for several reasons.

### CPython internals

CPython -- the main implementation of Python -- is slow.

First, CPython has _an interpreter_: there's a cost in converting Python
instructions into another intermediate representation --
the _byte-code_ -- and executing the instructions by interpreting their
byte-code.

Secondly, nearly every value in CPython is _boxed_ into a `PyObject`
-- implemented as a C struct. As such, simple operations
(like adding two floats) come with a non-negligible dispatch
overhead as the interpreter has to check the type which is unknown
in advance.

Thirdly, CPython for memory management relies on a global
mutex on its interpreter called the _Global Interpreter Lock_[ref]For more information about the GIL, see
[this reference from the Python Wiki](https://wiki.python.org/moin/GlobalInterpreterLock).[/ref].
This mechanism comes in handy but computations are restricted in
most cases to sequential execution in a single thread, removing the benefit
of using threads.

### Memory-hierarchy suboptimal implementations

`numpy` supports high-level operations but this comes with intermediate
and dynamically-allocated arrays.

Moreover, this pattern is inefficient from a memory perspective:
during the execution, blocks of data are moved back and forth
between the RAM and the different CPU caches several times, not
making optimal use of the caches.

For instance, based on this minimalistic toy example:
```python
import numpy as np

A = np.random.rand(100_000_000)
B = np.random.rand(100_000_000)

X = np.exp(A + B)
```

The following is performed:

- a first temporary array is allocated for `A + B`
- a second array is allocated to store `np.exp(A + B)` and
the first temporary array is discarded

This temporary allocation makes the implementation suboptimal
as memory allocation on the heap is slow.

Furthermore, high-level operations on `X` come with more data
moves between the RAM and the CPU than needed to compute the
elements of `X` and hardly make use of the memory hierarchy
and the size of the caches.

### No "bare-metal" data-structures

The Python ecosystem comes with a few high-level containers
such as numpy arrays, and pandas DataFrames.

Contrarily to other languages' standard libraries (like the one of
C++), no "bare-metal" data structures, including heaps, or
memory-contiguous resizable buffers (as implemented in C++ by
[`std::priority_queue`](https://en.cppreference.com/w/cpp/container/priority_queue)
and [`std::vector`](https://en.cppreference.com/w/cpp/container/vector))
are available to implement some algorithms efficiently
from both a computational complexity and a technical perspective.

## Cython: combining the conciseness of Python and the speed of C

In brief, Cython allows transpiling a superset of Python to C code and allows using code that was written in C or C++, which makes bypassing some of CPython's internals possible. Moreover, Cython allows using [OpenMP](https://www.openmp.org/specifications/), an API that allows using lower-level parallelism primitives for implementations written in C or Fortran[ref] For more information on Cython, see [its documentation](https://cython.readthedocs.io/en/latest/).[/ref].

In most cases, features provided by Cython are sufficient enough to reach optimal implementations for many scientific algorithms for which static tasks scheduling -- at the level of C via OpenMP -- is the most natural and optimal one.
Plus, its syntax makes this language expressive enough to get nearly optimal performance while keeping the instructions short and concise -- which is a real advantage for developers coming from Python who are looking for performance and relief for C or C++ developers[ref]Compared to C or C++, Cython is also less verbose and can be integrated Python build system more easily.[/ref].

As such, many algorithms in `scikit-learn` are implemented in Cython performance, some of which use OpenMP when possible. This is for instance the case of `KMeans` which was initially written in Python using numpy and which was rewritten in Cython by Jérémie du Boisberranger, improving the execution time by a factor of 5 for this algorithm[ref]For more information about `KMeans`, see the original contribution,
[`scikit-learn#11950`](https://github.com/scikit-learn/scikit-learn/pull/11950), and [this blog
post](https://scikit-learn.fondation-inria.fr/implementing-a-faster-kmeans-in-scikit-learn-0-23-2/).[/ref].

In the following posts, the case of $k$-nearest neighbors search -- the base routine
for `KNearestNeighborsClassifier`, `KNearestNeighborsRegressor` and other scikit-learn interfaces -- is covered
and a new Cython implementation is proposed.

---

## Notes

70 changes: 70 additions & 0 deletions _posts/sklearn-perf-knn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
Title: Performance and scikit-learn (2/4)
Date: 2021-12-17
Category: scikit-learn
Slug: sklearn-perf-knn
Lang: en
Authors: Julien Jerphanion
Summary: Hardware scalability issue: the k-neighbors search example
Status: Published

## $k$-nearest neighbors search in scikit-learn

$k$-nearest neighbors search is at the base of many implementations used within scikit-learn.

For instance, it is used in Affinity Propagation, BIRCH, Mean Shift, OPTICS,
Spectral Clustering, TSNE, KNeighbors Regressor, and KNeighbors Classifier.

Whilst many libraries implement approximated versions of $k$-nearest neighbors search to speed-up
the execution time[ref]Approximate nearest neighbors search algorithms come in many different
flavors, and there's even [a benchmark suite comparing them!](https://ann-benchmarks.com/).[/ref], scikit-learn's implementation aims at returning the exact $k$-nearest neighbors.


## Computing chunks of the distance matrix computations

The general steps for $k$-nearest neighbors search are:

- Compute $\mathbf{D}_d(\mathbf{X}, \mathbf{Y})$, the distance matrix between the vectors of two
arrays $\mathbf{X}$ and $\mathbf{Y}$.
- Reduce rows of $\mathbf{D}_d(\mathbf{X}, \mathbf{Y})$ appropriately for the given algorithm:
for instance, the adapted reduction for $k$-nn search is to return the $k$ smallest indices of values in an ordered set.
In what follows, we call this reduction $\texttt{argkmin}$.
- Perform extra operations on results of the reductions (here sort values).

Generally, one does not compute $\mathbf{D}_d(\mathbf{X}, \mathbf{Y})$ entirely because its
space complexity is $\Theta(n_x \times n_y)$. Practically,
$\mathbf{D}_d(\mathbf{X}, \mathbf{Y})$ does not fit in RAM for sound datasets.

Thus, in practice, one computes chunks of this dataset and reduced them directly.
This is what was performed as of scikit-learn 1.0[ref][`KNearestNeighbors.kneighbors`](https://github.com/scikit-learn/scikit-learn/blob/c762c407873b8d6417b1c2ff78d19d82550e48d3/sklearn/neighbors/_base.py#L650) is the interface to look for.[/ref].


## Current issues

The current implementation relies on a general parallelization scheme using higher-level functions with
[`joblib.Parallel`](https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html).

Technically, this is not the most efficient: working at this level with views on numpy arrays moves
large chunks of data back and forth several times between the RAM and the CPUs' caches, hardly
make use of caches, and allocate temporary results.

Moreover, the cost of manipulating those chunks of data in the CPython interpreter causes a non
negligible overhead because they are associated with Python objects which are bound to the
Global Lock Interpreter for counting their references.

Hence, this does not allow getting proper _hardware scalability_: an ideal parallel
implementation of an algorithm would run close to $n$ times faster when running on
$n$ threads on $n$ CPU cores compared to sequentially on $1$ thread.

For instance, the current implementation of $k$-nearest neighbors search of scikit-learn
cannot efficiently leverage all the available CPUs on a machine -- as shown by the figure below.

![Scalability of `kneighbors` as of scikit-learn 1.0](https://user-images.githubusercontent.com/13029839/155144242-6c041729-154b-47aa-9069-3a7d26deef5a.svg)

When using $8$ threads, it only run $\times 2$ faster than the sequential implementation
and adding more threads and CPUs beyond $8$ does not help to get better performance.

In the next blog, we go over the design of a new implementation to improve the scalability of $k$-nn search.

---

## Notes
Loading