[BUG] Severe performance degradation of SliceSampler for large buffers #2669

nicklashansen · 2024-12-19T14:56:31Z

Describe the bug

Replay buffer SliceSampler speed is heavily dependent on a small buffer capacity. Increasing buffer capacity renders the replay buffer practically unusable. I have compiled some timings on sampling speed for the TD-MPC2 codebase with a varying number of transitions stored + total buffer capacity. These timings are all for low-dimensional observations/actions and a relatively small batch size of 256 sequences each of length 4. This issue seems to be directly related to the SliceSampler class since I did not encounter this issue before the introduction of the SliceSampler.

Precise sample() timings:

5M transitions, 6M capacity: 0.0048s
5M transitions, 600M capacity: 0.0361s
346M transitions, 600M capacity: 1.6531s

To Reproduce

The replay buffer implementation is available here and mostly relies on torchrl features.

I'm also happy to create a minimal example that does not depend on the TD-MPC2 codebase if you think that would helpful.

System info

I use the conda environment available here: https://github.com/nicklashansen/tdmpc2/blob/main/docker/environment.yaml

Checklist

I have checked that there is no similar issue in the repo (required)
I have read the documentation (required)
I have provided a minimal working example to reproduce the bug (required)

The text was updated successfully, but these errors were encountered:

vmoens · 2024-12-19T14:57:53Z

Oh yeah that's pretty bad, let me see what I can do

Quick question: where do you store your data? By that I mean the trajectory count / done states.
I suspect that storing them on cuda and compiling should help (?)

vmoens · 2024-12-19T15:00:31Z

Found it
https://github.com/nicklashansen/tdmpc2/blob/df8a465c8e137c652a142f6ad6cdf540d3a6a39a/tdmpc2/common/buffer.py#L61

nicklashansen · 2024-12-19T15:00:56Z

That would be appreciated! Worst case, a possible workaround would be to have two replay buffer implementations and select the most appropriate implementation based on capacity, but I would be interested in better solutions.

nicklashansen · 2024-12-19T15:02:38Z

Data is stored in RAM for these larger buffers since they are in the 50-100GB range

vmoens · 2024-12-19T15:04:16Z

Ok I'll create a sandbox to check this carefully and post it here for reference

vmoens · 2024-12-19T15:55:53Z

https://gist.github.com/vmoens/6a860ba376ce99737dfdf5637c7eaee7

Can you let me know how to edit the fake data to make it more suited?

RE caching: caching is useful if you're sampling more than once after every extension. This can drastically speed things up because you don't need to recompute the indices of the trajectories.

vmoens · 2024-12-19T16:12:41Z

Throwing another datapoint here:
On my cluster, if I run the code using the trajectory indicator or the end signal (the done state) the latter is 2x faster (500ms vs 1 sec for 600M capacity filled at 50%) than the former, presumably because we can work with bits and not integers to identify the trajectories

nicklashansen · 2024-12-19T16:16:00Z

Oh I was not aware of the caching argument. I tried it just now with the 346M transitions, 600M capacity buffer and enabling caching speeds up sampling by 500x on my machine. That's really good to know!

if I run the code using the trajectory indicator or the end signal (the done state) the latter is 2x faster (500ms vs 1 sec for 600M capacity filled at 50%) than the former, presumably because we can work with bits and not integers to identify the trajectories

That makes a lot of sense to me, I can try rewrite my code a bit to operate with a done signal rather than episode ID.

Thanks a lot for your help!

nicklashansen · 2024-12-19T16:19:56Z

Can you let me know how to edit the fake data to make it more suited?

Regarding this. The data used in offline TD-MPC2 training has this structure:

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([345690000, 6]), device=cpu, dtype=torch.float32, is_shared=False),
        episode: Tensor(shape=torch.Size([345690000]), device=cpu, dtype=torch.int64, is_shared=False),
        obs: Tensor(shape=torch.Size([345690000, 24]), device=cpu, dtype=torch.float32, is_shared=False),
        reward: Tensor(shape=torch.Size([345690000]), device=cpu, dtype=torch.float32, is_shared=False),
        task: Tensor(shape=torch.Size([345690000]), device=cpu, dtype=torch.int32, is_shared=False)},
    batch_size=torch.Size([345690000]),
    device=cpu,
    is_shared=False)

vmoens · 2024-12-19T16:29:05Z

Gotcha

I will also land #2672 and #2671 once the tests pass (and I document the features a bit more!)

Together they give me a speedup of about 2-3x when cache is disabled

nicklashansen · 2024-12-19T17:45:00Z

That's very impressive! Let me know when the new features are ready and I'll be more than happy to give them a try with the tdmpc2 repo.

vmoens · 2024-12-20T11:06:29Z

Closing this issue thanks to #2670, #2671 and #2672

nicklashansen added the bug Something isn't working label Dec 19, 2024

nicklashansen assigned vmoens Dec 19, 2024

vmoens closed this as completed Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Severe performance degradation of SliceSampler for large buffers #2669

[BUG] Severe performance degradation of SliceSampler for large buffers #2669

nicklashansen commented Dec 19, 2024 •

edited

Loading

vmoens commented Dec 19, 2024 •

edited

Loading

vmoens commented Dec 19, 2024

nicklashansen commented Dec 19, 2024

nicklashansen commented Dec 19, 2024

vmoens commented Dec 19, 2024

vmoens commented Dec 19, 2024

vmoens commented Dec 19, 2024

nicklashansen commented Dec 19, 2024

nicklashansen commented Dec 19, 2024

vmoens commented Dec 19, 2024 •

edited

Loading

nicklashansen commented Dec 19, 2024

vmoens commented Dec 20, 2024

[BUG] Severe performance degradation of SliceSampler for large buffers #2669

[BUG] Severe performance degradation of SliceSampler for large buffers #2669

Comments

nicklashansen commented Dec 19, 2024 • edited Loading

Describe the bug

To Reproduce

System info

Checklist

vmoens commented Dec 19, 2024 • edited Loading

vmoens commented Dec 19, 2024

nicklashansen commented Dec 19, 2024

nicklashansen commented Dec 19, 2024

vmoens commented Dec 19, 2024

vmoens commented Dec 19, 2024

vmoens commented Dec 19, 2024

nicklashansen commented Dec 19, 2024

nicklashansen commented Dec 19, 2024

vmoens commented Dec 19, 2024 • edited Loading

nicklashansen commented Dec 19, 2024

vmoens commented Dec 20, 2024

nicklashansen commented Dec 19, 2024 •

edited

Loading

vmoens commented Dec 19, 2024 •

edited

Loading

vmoens commented Dec 19, 2024 •

edited

Loading