Create MutableMapping for automatic compression #3656

mrocklin · 2020-03-28T16:56:30Z

In some workloads with highly compressible data we would like to trade off some computation time for more in-memory storage automatically. Dask workers store data in a MutableMapping (the superclass of dict). So in principle all we would need to do is make a MutableMapping subclass that overrides the getitem and setitem methods to compress and decompress data on demand.

This would be an interesting task for someone who wants to help Dask, wants to learn some internals, but doesn't know a lot just yet. I'm marking this as a good first issue. This is an interesting and useful task that doesn't require deep incidental Dask knowledge.

Here is a conceptual prototype of such a MutableMapping. This is completely untested, but maybe gives a sense of how I think about this problem. It's probably not ideal though so I would encourage others to come up with their own design.

import collections
from typing import Dict, Tuple, Callable

class TypeCompression(collections.MutableMapping):
    def __init__(
        self,
        types: Dict[type, Tuple[Callable, Callable]],
        storage=dict
    ):
        self.types = type
        self.storage = collections.defaultdict(storage)

    def __setitem__(self, key, value):
        typ = type(key)
        if typ in self.types:
            compress, decompress = self.types[typ]
            value = compress(value)
        self.storage[typ] = value

    def __getitem__(self, key):
        for typ, d in self.storage.items():
            if key in d:
                value = d[key]
                break
        else:
            raise KeyError(key)

        if typ in self.types:
            compress, decompress = self.types[typ]
            value = decompress(value)

        return value

This came up in #3624 . cc @madsbk and @jakirkham from that PR. cc also @eric-czech who was maybe curious about automatic compression/decompression.

People looking at compression might want to look at and use Dask's serializations and comrpession machinery in distributed.protocol (maybe start by looking at the dumps, serialize and maybe_compress functions).

The text was updated successfully, but these errors were encountered:

mrocklin · 2020-03-28T16:58:14Z

Also cc @prasunanand and @andersy005 who have both asked about good first issues in the past. I think that this would be fun.

prasunanand · 2020-03-28T18:31:04Z

@mrocklin I would love to work in it. :)

mrocklin · 2020-03-28T19:48:58Z

Great!

…

On Sat, Mar 28, 2020, 11:31 AM Prasun Anand ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> I would love to work in it. :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3656 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFWQHQ7XEOXVJR3ITLRJY67JANCNFSM4LVUKVEA> .

prasunanand · 2020-04-09T09:44:29Z

Hi, I need a little help.

Do I need to modify the logic in dumps, loads ( link ) ?

Does types in TypeCompress refer to int, double, etc. or to snappy, blosc, lz4 etc. ? If it refers to int, double, etc where are the corresponding compressors and decompressors ?

cc @jrbourbeau

mrocklin · 2020-04-09T15:16:07Z

Does types in TypeCompress refer to int, double, etc. or to snappy, blosc, lz4 etc. ?

You don't have to use the structure I started with. I encourage you to think about this on your own and how you would design it. If you blindly follow my design you probably won't develop a high level understanding of the problem. What I put up there was just an idea, but not a fully formed one, whoever solves this task will need to think a lot more about the problem than what I did.

mrocklin · 2020-04-12T16:42:07Z

To add some more context here, this is an object that would replace the MutableMapping currently used in Worker.data. It would expect to receive any user generated Python object as a value. We would want to take those values, serialize, and maybe compress them when we put them into the underlying mapping.

So for example we would want something like the following to work:

x = np.ones(1000000)  # a large but compressible piece of data

d = MyMapping()
d["x"] = x  # put Python object into d
out = d["x"]    # get the object back out

assert str(out) == str(x)  # the two objects should match

# assuming here that the underlying bytes are stored in something like a `.storage` attribute, but this isn't required
# we check that the amount of actual data stored is small
assert sum(map(len, d.storage.values())) < x.nbytes

In Dask one would test this by putting it into a Worker

@pytest.mark.asyncio
async def test_compression():
    async with Scheduler() as s:
        async with Worker(s.address, data=MyMapping):
            async with Client(s.address, asynchronous=True) as c:
                x = da.ones((10000, 10000))
                y = await x.persist()  # put data in memory
                y = await (x + x.T).mean().persist()  # do some work
                assert sum(map(len, worker.data.storage.values())) < x.nbytes

(None of the code here was tested, and may have bugs. I wouldn't trust it too much)

fjetter · 2020-04-14T12:20:28Z

The mutable mapping looks like something which would be well suited for https://github.com/dask/zict The integration to distributed would then look similar to how spill-to-disk is implemented at the moment

jakirkham · 2020-06-24T00:02:17Z

PR ( #3702 ) seems to be going in the right direction. Probably the best place to move this forward atm.

mrocklin mentioned this issue Mar 28, 2020

[WIP] Send serialized device array #3624

Closed

mrocklin added the good first issue Clearly described and easy to accomplish. Good for beginners to the project. label Mar 28, 2020

prasunanand linked a pull request Apr 12, 2020 that will close this issue

Autocompression #3702

Open

jakirkham mentioned this issue Jun 23, 2020

Compress payload during shuffle operation dask/dask#6259

Closed

mrocklin mentioned this issue Jun 28, 2020

Compress collections dask/dask#6352

Open

beckernick mentioned this issue Jul 9, 2020

[FEA] Allow communicating spilled data rapidsai/dask-cuda#342

Closed

jakirkham mentioned this issue Jul 18, 2020

WIP: Store compressed data on the Worker #3968

Draft

mrocklin removed the good first issue Clearly described and easy to accomplish. Good for beginners to the project. label Jun 11, 2021

This was referenced Mar 3, 2022

Encapsulate spill buffer and memory_monitor #5891

Closed

Deserialise data on demand #5900

Open

crusaderky added the memory label Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create MutableMapping for automatic compression #3656

Create MutableMapping for automatic compression #3656

mrocklin commented Mar 28, 2020

mrocklin commented Mar 28, 2020

prasunanand commented Mar 28, 2020

mrocklin commented Mar 28, 2020 via email

prasunanand commented Apr 9, 2020 •

edited

Loading

mrocklin commented Apr 9, 2020

mrocklin commented Apr 12, 2020

fjetter commented Apr 14, 2020

jakirkham commented Jun 24, 2020

Create MutableMapping for automatic compression #3656

Create MutableMapping for automatic compression #3656

Comments

mrocklin commented Mar 28, 2020

mrocklin commented Mar 28, 2020

prasunanand commented Mar 28, 2020

mrocklin commented Mar 28, 2020 via email

prasunanand commented Apr 9, 2020 • edited Loading

mrocklin commented Apr 9, 2020

mrocklin commented Apr 12, 2020

fjetter commented Apr 14, 2020

jakirkham commented Jun 24, 2020

prasunanand commented Apr 9, 2020 •

edited

Loading