Add CSD #125

hameerabbasi · 2018-03-04T13:57:54Z

CSD stands for Compressed Sparse Dimensions. It is a multidimensional generalization of CSR, CSC and COO.

CSD has a few "compressed" dimensions and a few "uncompressed" dimensions. The "compressed" dimensions are linearized and go into indptr similar to CSR/CSC. The "uncompressed" dimensions (corresponding to indices in CSR/CSC, but there can be multiple) will be stored in coords.

The plan is to:

Write CSD.
Test it.
Make COO and others inherit from it.

Task checklist:

Implement CSD (parametrize tests over compressed axes).:

Element-wise
Reductions
Slicing
Change compressed dimensions

Post-Implementation (parametrize tests over types):

Make COO inherit from CSD
Make CSR inherit from CSD
Make CSC inherit from CSD

Compressed Sparse Dimensions — Format Specification

Compressed Sparse Dimensions (CSD) is an multidimensional sparse array storage format.

Let’s say that the number of dimensions is ndim and compressed_axes represents the axes/dimensions that are compressed, as a tuple.

It contains three main arrays, one of them two-dimensional:

data (one dimensional, (nnz,))
coords (two-dimensional, (ndim - len(compressed_axes), nnz))
indptr (one dimensional (reduce(operator.mul, (shape[a] for a in compressed_axes), 1) + 1,))

`data`/`coords`/`indptr` from an `ndarray`

Relating to a normal NumPy array, we can specify data and coords directly. Let’s say arr is the Numpy array to convert.

non_compressed_axes = tuple(a for a in range(ndim) if a not in set(compressed_axis))
coo_coords = np.where(arr)
data = arr[coo_coords]
axes_order = compressed_axes + non_compressed_axes
if axes_order != tuple(range(ndim)):
    lin_coo_coords = linear_loc(coo_coords[np.asarray(axes_order)], tuple(self.shape[a] for a in axes_order))
    idx = np.argsort(lin_coo_coords)
    data = data[idx]
    coords = coords[:, idx]

coords = coo_coords[np.asarray(non_compressed_axes)].astype(np.min_scalar_type(max(non_compressed_axes)))
compressed_linear_shape = reduce(operator.mul, (shape[a] for a in compressed_axes), 1)
compressed_linearized_coords = linear_loc(coo_coords[np.array(compressed_axes)], compressed_axes)
indptr = np.empty(compressed_linear_shape + 1, dtype=np.min_scalar_type(compressed_linear_shape))
indptr[0] = 0
np.cumsum(np.bincount(compressed_linearized_coords, minlength=compressed_linear_shape), out=indptr[1:])

Interpreting the arrays

For CSR, the interpretation is simple: indices represents the columns, data the respective data. indptr[i]:indptr[i+1] represents the slice corresponding to row i for which columns and data are valid.

We can extend this definition to CSD as well. coords represent the non-compressed axes, and data the respective data. indptr can be interpreted the same way, replacing the word “row” with “linearized compressed axes index”, but it turns out that there is a simpler way to work with things.

Alternative representation for `indptr`

Consider the following two arrays produced from indptr:

starts = indptr[:-1].reshape((shape[a] for a in compressed_axes))
stops indptr[1:].reshape((shape[a] for a in compressed_axes))

Note that these are views, no copies are made. This gives a simpler and more intuitive interpretation for indptr. For every possible index corresponding to the compressed axes: Indexing into starts gives the starts for that index and stops gives the stops.

The “multiple-COO” view

For every value in starts/stops, coords[:, start_val:stop_val] and data[start_val:stop_val] can be seen as a mini-COO array, and can thus have any COO-related algorithms applied to it.

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-04-18T16:30:26Z

I mentioned this effort to some Scikit-Learn developers (notably @ogrisel) and they seemed interested in this. I think that in particular they want an object that is compatible with the CSR and CSC scipy objects in that is has the data, indptr, and indices attributes (these are necessary to be used within their codebase) and that also satisfies the numpy.ndarray interface well enough to be used within dask array.

@hameerabbasi do you have thoughts on how best to accomplish this other than what is stated above? Do you have any plans to work on this near-term? (no pressure, just curious)

hameerabbasi · 2018-04-18T16:47:20Z

Immediately, I plan to work on #114 and #124, as they're relatively easy to do. But since this is relatively important, yes, I plan to work on it soon-ish (right after that, ideally).

As indices in the case of N-D will need to store more than one dimension, I was thinking of naming it coords and making it 2-D (although it's essentially the same).

I've thought a bit about it for the past week or so, but I can't seem to find a better way to proceed than what's already in the OP. I'd like to unify implementations as much as possible, if the performance cost isn't too high. Of course I could hack together 2-D limited CSR/CSC fairly quickly, but I want to keep things general as much as possible, and do things once (and properly).

@stefanv also seemed interested in joining the development team, and getting this into SciPy (dependency or merging) as soon as possible, I'm assuming CSD is essential to that.

On another note, fairly often, I feel my thought processes being limited by what Numba can currently do, and I have to adapt algorithms to match. Example: and no support for nested lists (pushed back to Numba 0.39 from 0.38, I have a strong suspicion this will be needed for CSD) or dict (currently tagged 1.0).

hameerabbasi · 2018-04-18T17:26:25Z

The reason I feel it is this way is simple: In CSD, for every value of indptr, we get a new COO object in essence. So Numba-fication will become important even where it wasn't before.

stefanv · 2018-04-18T18:02:02Z

(Aside: my interest is in getting SciPy to support sparse arrays, so that packages like this one can depend on that structure internally, and so that we can eventually remove the Matrix class from NumPy).

Can you clarify in the description what "CSD" stands for?

hameerabbasi · 2018-04-18T18:07:37Z

Thanks, updated the issue with what it stands for and a short description.

hameerabbasi · 2018-04-18T18:32:19Z

Just a bit of background: A lot of machine learning algorithms use CSR/CSC as their primary input, as do BLAS/LAPACK/ARPACK etc. If we want to truly deprecate scipy.sparse.spmatrix and therefore np.matrix, we'll need this.

Overall, CSD (which CSR/CSC/COO will be thin inheritances of) and DOK will cover at least 80% of all use-cases (DOK for mutable uses and CSD for immutable ones).

The rest of the 20% will need block formats BSD/BDOK (not a particularly hard extension) and LIL/DIA (which can also be in block format).

Details of most formats are written up in this SPEP.

stefanv · 2018-04-18T19:35:31Z

About the CSD format, is this described in any publication, or was this designed here? Has it been compared to alternative storage options, such as ECCS (https://ieeexplore.ieee.org/abstract/document/1252859/)? Does it outperform coordinate storage, combined with a good query strategy?

hameerabbasi · 2018-04-18T20:08:15Z

I'll confess I haven't done a literature review on this before starting down this path (other than Wikipedia). But with a short look, I kind of got the gist of how ECCS will work.

Coordinate storage and query strategy will both be superior in ECCS. The downside is that it doesn't generalize CSR/CSC/COO, and most current algorithms are implemented/designed around CSR/CSC (machine learning/LAPACK/ARPACK/BLAS), so implementing those kind of makes more sense practically.

I suppose CSD is a generalization of CRS/CCS in literature, except that you can compress any combination of axes rather than just rows/columns after linearizing it.

Of course, later, we could also add these more sophisticated storage methods, but given the libraries that already exist around CSR/CSC, that may be unwise at this stage.

Another option would be to implement GCRS/GCCS, but that will present issues in several indexing/reduction tasks unless we convert the indices in real-time all the time.

mrocklin · 2018-04-18T23:01:00Z

I think that @stefanv 's question is good at highlighting that perhaps it would be good to ask around a bit about possible sparse storage structures, and lean about their pros and cons, before investing a lot of time into a new approach.

mrocklin · 2018-04-18T23:05:27Z

On another note, fairly often, I feel my thought processes being limited by what Numba can currently do, and I have to adapt algorithms to match. Example: and no support for nested lists (pushed back to Numba 0.39 from 0.38, I have a strong suspicion this will be needed for CSD) or dict (currently tagged 1.0).

If you're interested in walking back from the Numba decision that's fine with me. We should probably discuss that in another issue though.

It might be useful if you included more detail about what you're thinking both in terms of the CSD data structure and key algorithms that might be useful for it and how they would work. Perhaps others here could recommend alternatives.

stefanv · 2018-04-19T00:23:07Z

Other formats:

https://ieeexplore.ieee.org/abstract/document/1252859/

From Twitter:

https://ieeexplore.ieee.org/document/7237032/

hameerabbasi · 2018-04-19T09:52:45Z

CRS/CCS described in both papers is basically what I'm calling CSD, with the catch that you can compress any dimension(s) of your choosing.

The second one (GCRS/GCCS) seems like what I describe as CSD but with linearized indices, which will improve compression efficiency, but a lot of conversion to/from the linear to the full coordinate form will be required (think of np.unravel_index), practically speaking; and compression of only one axis rather than multiple axes (which will hinder some applications such as tensordot).

There is a trade-off here between needing conversions for various purposes and compression efficiency. I'm willing to implement CSD the "GCRS/GCCS" way (linearized indices) as well. It would suit almost all purposes just as easily, including generalizing to CSR/CSC with ease (but not COO).

However, it will make certain algorithms more complex. Think of binary search over a single axis (needed for indexing): It'll need a custom implementation that converts and then converts back to be O(log n).

The ECRS/ECCS schemes seem nice on paper, but algorithms haven't been written for them. Same for the EKMR which they're derived from, no one implements them, practically speaking.

Of course, it can always be done later, but (currently) the most practical way forward is with (linearized or non-linearized) CSD.

hameerabbasi · 2018-04-23T02:58:48Z

I'm blocked on this by numba/numba#2560 or #126.

mrocklin · 2018-04-23T08:52:10Z

It would be useful if you could describe more about the kinds of data structures you're using and why these things are necessary. It may be that someone else in the community will be able to recommend a different approach.

…

On Sun, Apr 22, 2018 at 10:58 PM, Hameer Abbasi ***@***.***> wrote: I'm blocked on this by numba/numba#2560 <numba/numba#2560> or #126 <#126>. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#125 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszGgKf2HOO_2XDh1rLuzrjXXdUCrHks5trUNogaJpZM4SbVxT> .

hameerabbasi · 2018-04-26T16:02:19Z

I've updated the issue description with a full format specification for CSD. In the coming days I'll add algorithms for reduction, element-wise, etc.

oleksandr-pavlyk · 2018-07-14T15:35:43Z

You may find the article of Tamara Kolda "Tensor Decompositions and
Applications", from SIAM Review, Vol. 51, No. 3, pp. 455–500 of interest.

ogrisel · 2018-07-14T15:35:56Z

I think that @cournape had also done a literature review for scipy on this topic in the past. Maybe he could also share his input on what are the suitable generalizations of compressed sparse matrix representations for n-d sparse arrays.

oleksandr-pavlyk · 2018-07-14T15:37:25Z

Compressed sparse fiber is another recent format to consider: http://glaros.dtc.umn.edu/gkhome/node/1177

ShadenSmith · 2018-07-14T20:10:23Z

Hi there, CSF author here. There's a newer CSF paper that makes the actual data structure a little more obvious (Figure 2).

CSF is a good format for performance and parallelism, but the implementation is more involved than COO or similar. I'm not sure if taco would be easy to interface with from python, but it's great for generating performant sparse tensor kernels. It uses CSF under the hood (and I believe other sparse formats as of recently).

hameerabbasi · 2018-07-14T20:17:31Z

Unfortunately requiring a compiler in this library is a bit too involved, especially as we're deeply tied to NumPy machinery and compiling for every ufunc is out of scope. However, implementing the format itself is definitely something I'm looking forward to. If you have resources about how to do arbitrary computations such as indexing/slicing, element-wise operations, reductions that'd be great.

ShadenSmith · 2018-07-14T20:38:34Z

If you have worked with CSR before for matrices, you can think of CSF as a higher-order interpretation of that. Essentially, one rowptr points into the next rowptr to encode the sparsity pattern. The fun part of CSF is that it actually reduces the number of memory accesses and flops for lots of sparse tensor kernels.

CSF is a mode-oriented structure. Similar to CSR, "row-wise" operations are more efficient than "column-wise" ones.

The best example I have is for MTTKRP, which is like a higher-order sparse matrix-vector multiplication. The nfactors there in the code is how many vectors you are multiplying. That specific function is hardcoded for three-way tensors...you'll also find the n-d case implemented in the same file (it uses a stack instead of hardcoded nested loops).

daletovar · 2019-06-10T21:23:14Z

I recently started working on this. It's basically just GCRS/GCCS but I opted to represent the rows of the underlying matrix as the first several axes and used the remaining axes to represent the columns. This is in line with how numpy reshapes arrays and so it made the most sense. I like this format a lot for a few reasons: 1) adding additional axes only minimally affects the compression ratio, 2) indexing (in principle) should be faster, especially for 2d arrays, and 3) because a 2d array is just a csr/csc matrix with the data, indices, and indptr attributes. Consequently, you get scipy/scipy#8162 for free. These arrays could well with anything that expects either a scipy.sparse matrix or np.ndarray. Good candidates for all of this are scikit-learn, dask, and xarray. I only just started the project but I like the idea of adding this to pydata/sparse when it's ready.

hameerabbasi · 2019-06-11T05:27:25Z

Hi @daletovar! I’m glad... I lazed a bit too long on this but I’m happy someone really is working on it.

We don’t need a complete implementation, feel free to work within PyData/Sparse. I know some things like coverage and documentation will slow you down, so it’s better to work on a fork.

Feel free to reach out to me personally via email if you’d like to coordinate and concentrate efforts. I’d love to work with you on this! From your last commits, you seem to have a great handle on PyData/Sparse!

daletovar · 2019-06-11T18:06:54Z

Thanks @hameerabbasi, I'd love to coordinate efforts on this.

hameerabbasi · 2019-06-11T18:33:42Z

Feel free to reach out to me at habbasi [at] quansight [dot] com. 😄 Let's set up a meeting!

daletovar · 2019-06-11T18:41:01Z

Perfect, that sounds great!

ivirshup · 2019-11-13T03:31:21Z

I'm really looking forward to this! Is there more work that needs to be done before the GXCS format can be in a release?

daletovar · 2019-11-13T07:02:10Z

@ivirshup I'm not sure exactly when, but there are some things I'd like to add (faster slicing, stack, concatenate) plus some documentation. I don't think it'll be too long though.

ivirshup · 2019-11-13T09:12:14Z

Sounds good! Would you mind if I took a crack at implementing any of those in the meantime?

daletovar · 2019-11-13T20:33:28Z

That would be fantastic! I was going to add stack and concatenate in the next couple of days. Faster slicing would be a huge help. We also eventually need to add elementwise ops, reductions, dot, tensordot, and matmul. Any contributions on any of that would be greatly appreciated.

luk-f-a · 2020-06-27T15:41:54Z

hi! was this closed by #258 ?

hameerabbasi · 2020-06-27T15:42:57Z

It still isn't complete, nor is it public API, so I haven't closed it yet.

BorisTheBrave · 2021-01-01T23:14:07Z

I found the idea of CSF intgruiging, but poorly explained in general, so I have made my own attempt in a blog post. Would love a review from @ShadenSmith .
I haven't compared CSF to CSD or GCRS, though.

hameerabbasi added enhancement Indicates new feature requests discussion labels Mar 4, 2018

mrocklin mentioned this issue Mar 4, 2018

Use Cython, Numba, or C/C++ for algorithmic code #126

Open

hameerabbasi added this to the future milestone Mar 21, 2018

hameerabbasi closed this as completed Jul 14, 2018

hameerabbasi reopened this Jul 14, 2018

mrkn mentioned this issue Sep 17, 2018

ARROW-854: [Format] Add tentative SparseTensor format apache/arrow#2546

Closed

hameerabbasi mentioned this issue Apr 16, 2019

Allow uint32 dtype for coords #249

Open

daletovar mentioned this issue Jun 18, 2019

Add CSD #258

Merged

rgommers mentioned this issue Nov 26, 2019

Investigate Dask for speeding up Zheng17 scverse/scanpy#921

Open

hameerabbasi removed this from the future milestone Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CSD #125

Add CSD #125

hameerabbasi commented Mar 4, 2018 •

edited

Loading

mrocklin commented Apr 18, 2018

hameerabbasi commented Apr 18, 2018 •

edited

Loading

hameerabbasi commented Apr 18, 2018

stefanv commented Apr 18, 2018

hameerabbasi commented Apr 18, 2018

hameerabbasi commented Apr 18, 2018

stefanv commented Apr 18, 2018

hameerabbasi commented Apr 18, 2018

mrocklin commented Apr 18, 2018

mrocklin commented Apr 18, 2018

stefanv commented Apr 19, 2018 •

edited

Loading

hameerabbasi commented Apr 19, 2018 •

edited

Loading

hameerabbasi commented Apr 23, 2018

mrocklin commented Apr 23, 2018 via email

hameerabbasi commented Apr 26, 2018

oleksandr-pavlyk commented Jul 14, 2018

ogrisel commented Jul 14, 2018 •

edited

Loading

oleksandr-pavlyk commented Jul 14, 2018

ShadenSmith commented Jul 14, 2018

hameerabbasi commented Jul 14, 2018

ShadenSmith commented Jul 14, 2018

daletovar commented Jun 10, 2019

hameerabbasi commented Jun 11, 2019

daletovar commented Jun 11, 2019

hameerabbasi commented Jun 11, 2019

daletovar commented Jun 11, 2019

ivirshup commented Nov 13, 2019

daletovar commented Nov 13, 2019

ivirshup commented Nov 13, 2019

daletovar commented Nov 13, 2019

luk-f-a commented Jun 27, 2020

hameerabbasi commented Jun 27, 2020

BorisTheBrave commented Jan 1, 2021

Add CSD #125

Add CSD #125

Comments

hameerabbasi commented Mar 4, 2018 • edited Loading

Compressed Sparse Dimensions — Format Specification

data/coords/indptr from an ndarray

Interpreting the arrays

Alternative representation for indptr

The “multiple-COO” view

mrocklin commented Apr 18, 2018

hameerabbasi commented Apr 18, 2018 • edited Loading

hameerabbasi commented Apr 18, 2018

stefanv commented Apr 18, 2018

hameerabbasi commented Apr 18, 2018

hameerabbasi commented Apr 18, 2018

stefanv commented Apr 18, 2018

hameerabbasi commented Apr 18, 2018

mrocklin commented Apr 18, 2018

mrocklin commented Apr 18, 2018

stefanv commented Apr 19, 2018 • edited Loading

hameerabbasi commented Apr 19, 2018 • edited Loading

hameerabbasi commented Apr 23, 2018

mrocklin commented Apr 23, 2018 via email

hameerabbasi commented Apr 26, 2018

oleksandr-pavlyk commented Jul 14, 2018

ogrisel commented Jul 14, 2018 • edited Loading

oleksandr-pavlyk commented Jul 14, 2018

ShadenSmith commented Jul 14, 2018

hameerabbasi commented Jul 14, 2018

ShadenSmith commented Jul 14, 2018

daletovar commented Jun 10, 2019

hameerabbasi commented Jun 11, 2019

daletovar commented Jun 11, 2019

hameerabbasi commented Jun 11, 2019

daletovar commented Jun 11, 2019

ivirshup commented Nov 13, 2019

daletovar commented Nov 13, 2019

ivirshup commented Nov 13, 2019

daletovar commented Nov 13, 2019

luk-f-a commented Jun 27, 2020

hameerabbasi commented Jun 27, 2020

BorisTheBrave commented Jan 1, 2021

hameerabbasi commented Mar 4, 2018 •

edited

Loading

`data`/`coords`/`indptr` from an `ndarray`

Alternative representation for `indptr`

hameerabbasi commented Apr 18, 2018 •

edited

Loading

stefanv commented Apr 19, 2018 •

edited

Loading

hameerabbasi commented Apr 19, 2018 •

edited

Loading

ogrisel commented Jul 14, 2018 •

edited

Loading