Same order of training samples with `NoDuplicatesBatchSampler` #3069

antigregory · 2024-11-18T15:00:05Z

It seems that NoDuplicatesBatchSampler produces the same set of batches in the same order regardless of the epoch index.
Indeed, in this piece of code, the order of the indices in remaining_indices does not depend on the random permutation torch.randperm(len(self.dataset), generator=self.generator) as it is reset to the ordered range with set.

Moreover, the seed in line 185 does not change from one epoch to another (the set_epoch method does not seem to be used...)

The text was updated successfully, but these errors were encountered:

tomaarsen · 2024-11-18T19:10:09Z

Hello!

set_epoch is used via the transformers Trainer which calls set_epoch of the accelerate-wrapped DataLoader, which should propagates it down into the sampler. However, it indeed seems that accelerate only propagates it into the sampler, not the batch sampler: https://github.com/huggingface/accelerate/blob/8ade23cc6aec7c3bd3d80fef6378cafaade75bbe/src/accelerate/data_loader.py#L591-L592
Perhaps this warrants a feature request/pull request on accelerate @muellerzr

As for the set - well spotted. I was falsely under the impression that the insertion order was kept, much like for dict instances. I'd like to avoid converting remaining_indices into a list, as that has an expensive pop. I'm open to suggestions here.

Tom Aarsen

antigregory · 2024-11-19T12:03:36Z

It might be not the most elegant solution, but maybe remaining_indices could be an OrderedDict?
remaining_indices = OrderedDict({k: None for k in torch.randperm(len(self.dataset), generator=self.generator).tolist()})

It probably won't be as fast as with the set (because the OrderedDict will need to maintain the order after each deletion), but deleting an element from an OrderedDict is still of constant complexity in average.

tomaarsen · 2024-11-19T12:43:26Z

I was also considering a dict. Because we're at Python 3.7+ now, I think we can just use a normal dict:

the insertion-order preservation nature of dict objects has been declared to be an official part of the Python language spec.

from https://docs.python.org/3/whatsnew/3.7.html

If the performance hit is not too large, then this is an acceptable solution I think.

I'll also look more into fixing the set_epoch issue in accelerate.

Tom Aarsen

antigregory · 2024-11-19T13:13:53Z

Yes, I agree. A normal dictionary can also be considered.

Even though, the behavior might be slightly less predictable. Because the declared "insertion-order preservation" does not necessarily mean the preservation of the order after the deletion of some elements.

tomaarsen added the bug Something isn't working label Nov 18, 2024

This was referenced Nov 19, 2024

[data_loader] Optionally also propagate set_epoch to batch sampler huggingface/accelerate#3246

Merged

[fix] Fix different batches per epoch in NoDuplicatesBatchSampler #3073

Merged

tomaarsen closed this as completed in #3073 Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Same order of training samples with `NoDuplicatesBatchSampler` #3069

Same order of training samples with `NoDuplicatesBatchSampler` #3069

antigregory commented Nov 18, 2024

tomaarsen commented Nov 18, 2024

antigregory commented Nov 19, 2024

tomaarsen commented Nov 19, 2024

antigregory commented Nov 19, 2024

Same order of training samples with NoDuplicatesBatchSampler #3069

Same order of training samples with NoDuplicatesBatchSampler #3069

Comments

antigregory commented Nov 18, 2024

tomaarsen commented Nov 18, 2024

antigregory commented Nov 19, 2024

tomaarsen commented Nov 19, 2024

antigregory commented Nov 19, 2024

Same order of training samples with `NoDuplicatesBatchSampler` #3069

Same order of training samples with `NoDuplicatesBatchSampler` #3069