-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create MutableMapping for automatic compression #3656
Comments
Also cc @prasunanand and @andersy005 who have both asked about good first issues in the past. I think that this would be fun. |
@mrocklin I would love to work in it. :) |
Great!
…On Sat, Mar 28, 2020, 11:31 AM Prasun Anand ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin> I would love to work in it. :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3656 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFWQHQ7XEOXVJR3ITLRJY67JANCNFSM4LVUKVEA>
.
|
Hi, I need a little help. Do I need to modify the logic in Does cc @jrbourbeau |
You don't have to use the structure I started with. I encourage you to think about this on your own and how you would design it. If you blindly follow my design you probably won't develop a high level understanding of the problem. What I put up there was just an idea, but not a fully formed one, whoever solves this task will need to think a lot more about the problem than what I did. |
To add some more context here, this is an object that would replace the MutableMapping currently used in So for example we would want something like the following to work: x = np.ones(1000000) # a large but compressible piece of data
d = MyMapping()
d["x"] = x # put Python object into d
out = d["x"] # get the object back out
assert str(out) == str(x) # the two objects should match
# assuming here that the underlying bytes are stored in something like a `.storage` attribute, but this isn't required
# we check that the amount of actual data stored is small
assert sum(map(len, d.storage.values())) < x.nbytes In Dask one would test this by putting it into a Worker @pytest.mark.asyncio
async def test_compression():
async with Scheduler() as s:
async with Worker(s.address, data=MyMapping):
async with Client(s.address, asynchronous=True) as c:
x = da.ones((10000, 10000))
y = await x.persist() # put data in memory
y = await (x + x.T).mean().persist() # do some work
assert sum(map(len, worker.data.storage.values())) < x.nbytes (None of the code here was tested, and may have bugs. I wouldn't trust it too much) |
The mutable mapping looks like something which would be well suited for https://github.com/dask/zict The integration to distributed would then look similar to how spill-to-disk is implemented at the moment |
PR ( #3702 ) seems to be going in the right direction. Probably the best place to move this forward atm. |
In some workloads with highly compressible data we would like to trade off some computation time for more in-memory storage automatically. Dask workers store data in a MutableMapping (the superclass of
dict
). So in principle all we would need to do is make a MutableMapping subclass that overrides the getitem and setitem methods to compress and decompress data on demand.This would be an interesting task for someone who wants to help Dask, wants to learn some internals, but doesn't know a lot just yet. I'm marking this as a good first issue. This is an interesting and useful task that doesn't require deep incidental Dask knowledge.
Here is a conceptual prototype of such a
MutableMapping
. This is completely untested, but maybe gives a sense of how I think about this problem. It's probably not ideal though so I would encourage others to come up with their own design.This came up in #3624 . cc @madsbk and @jakirkham from that PR. cc also @eric-czech who was maybe curious about automatic compression/decompression.
People looking at compression might want to look at and use Dask's serializations and comrpession machinery in
distributed.protocol
(maybe start by looking at thedumps
,serialize
andmaybe_compress
functions).The text was updated successfully, but these errors were encountered: