Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dataset] add shuffle at shards tar/raw file level #2424

Merged
merged 3 commits into from
Mar 20, 2024

Conversation

kakashidan
Copy link
Contributor

No description provided.

@xingchensong xingchensong requested a review from Mddct March 19, 2024 13:12
@Mddct
Copy link
Collaborator

Mddct commented Mar 19, 2024

raw 和 shard的source dataset需要加个shuffle的参数,原来是不shuffle的,要不然ut 过不了

@kakashidan
Copy link
Contributor Author

raw 和 shard的source dataset需要加个shuffle的参数,原来是不shuffle的,要不然ut 过不了

OK.

@kakashidan
Copy link
Contributor Author

raw 和 shard的source dataset需要加个shuffle的参数,原来是不shuffle的,要不然ut 过不了

增加了两个参数,list_shuffle控制tar or raw list level shuffle(区别于samples shuffle), list_shuffle_size控制shuffle buffer大小,默认为10000。如果多个data.list concat,shuffle size最好足够大来尽量保证数据全部随机

@Mddct
Copy link
Collaborator

Mddct commented Mar 19, 2024

默认值可以直接给个sys.max

@Mddct Mddct merged commit 605384a into wenet-e2e:main Mar 20, 2024
6 checks passed
Comment on lines 398 to +402
self.dp = TextLineDataPipe(filenames).repeat(cycle).prefetch(
prefetch).shard(partition)
prefetch)
if shuffle:
self.dp = self.dp.shuffle(buffer_size=shuffle_size)
self.dp = self.dp.shard(partition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个shuffle是不是应该在prefetch之前?@Mddct

@kakashidan kakashidan deleted the fix-first_stage_shuffle branch March 23, 2024 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants