-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmark regressions #9914
Comments
For Plots were never in sub-second range. |
Makes sense, so we only need to investigate |
Added |
It looks like the issue for both fetch and push is with index collection - specifically the problem is that sqltrie traverse is slow. This is a local fetch of the cats_dogs dataset (25k files) in main, using 3.10.0 (without the index collection) the fetch only takes 10.3s with the profiler enabled: Caching the So fixing this will require revisiting how the sqltrie traversal queries work. Or maybe we need to consider only using sqltrie for dump/load and then do all of our index operations in memory. cc @efiop |
Actually it looks like we can probably just get away with building an in-memory pygtrie for sqltrie views, since views are the only place we heavily use traverse right now |
I think closing was an accident @efiop? |
Yeah, sorry about that. We only have |
Does iterative/sqltrie#24 fix all pull regressions? Also I opened iterative/dvc-bench#478 since I notice the benchmarks are failing. |
Yes, all the pull regressions should be resolved after the sqltrie changes. |
|
Switched pull -> fetch in #10117 and there is no huge regression for local:
but there is for checkout, which we already knew. |
Ok, so for the record: with small files (esp the ones in mnist) the overhead of callback creation in https://github.com/iterative/dvc-objects/blob/231e5273cf929cea2424c9d6e885f72bb5d90750/src/dvc_objects/fs/utils.py#L167 is just too wasteful and unnecessary. There are a few things that we can do there as well (like using file-sized buffer for small files, using larger buffer for large files, or even just using shutils copysomething that can leverage things like sendfile if necessary), but even that alone speeds up link creation for me by like 50x times. reflink/hardlink/symlink bypass that thus not suffering from the overhead, so our goal here is for EDIT: Merged iterative/dvc-objects#229 and triggered https://github.com/iterative/dvc-bench/actions/runs/7040080174 , let's see... |
There is no reason to create a progress bar overhead for anything that takes < 2sec to copy on a modern machine. Related iterative/dvc#9914
There is no reason to create a progress bar overhead for anything that takes < 2sec to copy on a modern machine. Related iterative/dvc#9914
What changed between 2.58.2 and 3.0.0? The callback was still there before that, no? |
This again wastes a lot of time, as `set_size` might trigger a pbar refresh. Related iterative#229 Related iterative/dvc#9914
This again wastes a lot of time, as `set_size` might trigger a pbar refresh. Related #229 Related iterative/dvc#9914
The main thing that changed is #9444 . Before it we were processing files one-by-one and passing them to This is a bit early, but I've sent a few more fixes and now, at least on my linux machine, it looks like we are not only back to 2.58.2, but actually faster too (~30sec vs ~37sec for mnist checkout). Waiting for proper confirmation in dvc-bench https://github.com/iterative/dvc-bench/actions/runs/7066555049/job/19239201268 |
Great work, @efiop. The only thing that I want to mention is that, iterative/dvc-objects#231 is going to make it slower for users with large files. So we should benchmark on that kind of datasets to be sure. ThreadPoolExecutor has high overhead for small files, but that is noise for larger files which take a lot of time. Parallellizing also help SSDs as they have a lot of bandwidth, and HDDs in planning in their shortest route. |
Maybe it's time to write checkout in rust? 😉 |
Can we test on some dataset like https://www.kaggle.com/datasets/thedevastator/open-orca-augmented-flan-dataset? We don't need to put them into our regular benchmarks but would be good to test it out this time to see how these changes perform on a large file. |
Sorry for the delay, was AFK. Indeed, we should benchmark other datasets as well, no doubt about that. And we should perform well in both cases, mnist is a very good edge case for that. We can pick smarter strategies based on what we are dealing with. OpenOrca seems pretty good for testing a dataset that has 1 file that is big-ish, we should definitely add it to the dvc-bench datasets so that we have it around at our disposal, I'll upload it later today 👍 . I was also looking into some famous/well-known video/sound/other catalogue-looking datasets to test many (but not too many, maybe up to 10-100) of big-ish files (so typical classification datasets won't do), to test a bit beyond that. Those videos have to be non-copyright ones, so youtube-based ones won't do. If you have ideas - let me know. |
Thanks for adding @efiop! To be clear, I didn't mean we have to add it to dvc-bench, although I'm fine with doing so. Sometimes just doing a one-off test is fine if we have something particular to test. |
Reminder: we have a gentree command that can randomly generate a dataset. dvc-data gentree data 10 1Gb --depth 1
1.0G data
- ├── dir00
86M │ └── 0.txt
- ├── dir01
111M │ └── 0.txt
- ├── dir02
79M │ └── 0.txt
- ├── dir03
129M │ └── 0.txt
- ├── dir04
100M │ └── 0.txt
- ├── dir05
95M │ └── 0.txt
- ├── dir06
159M │ └── 0.txt
- ├── dir07
124M │ └── 0.txt
- ├── dir08
95M │ └── 0.txt
77M └── 0.txt |
@dberenbaum It is very nice to have it in the toolset generally, but it also allows us to trigger |
For the record, @skshetry and I did some optimizations on reflink creation as a part of https://github.com/iterative/dql/pull/1007, approx 2-3 times faster now that it was before. One thing that is obvious from benchmarks is that fetch/push-noop are kinda slow, but we are not caching compacted/deduplicated list of files we get from going through all revisions just yet, so I would get back to that after it is done. |
For the record, some benchmarks with different datasets: mnist (70K+ ~300B images) https://bench.dvc.org/schedule/7147613411_2.html xray (6K ~30KB images) https://bench.dvc.org/workflow_dispatch/7151367965_3.html used-cargs (1.4G single file) |
Worth noting that these are not "real" benchmarks for older versions though since those may be benefitting from updates to dvc-objects and other libs, right? |
So overall, it looks like it's a great improvement but we are trading off some performance on big files to get improvements on many small files. Is that right? |
Any updates here? What about s3? Any idea why it doesn't show the same improvements as other clouds in the benchmarks? |
Sorry, I was convinced I responded before, but probably forgot to post.
Yes, mostly recent versions as very old ones will have more strict requirements.
Marginally in
No updates, haven't looked into this since the last time 🙁 s3 is likely because of chunking configuration (similar to improvements that @pmrowla did in adlfs in the past), but haven't looked into it. |
@pmrowla Any thoughts on the s3 performance? I'd like to look into that either as part of this issue or a new one. Is there anything else left before closing this issue? |
@dberenbaum this can be closed, we have a separate issue for the s3 performance #9893 |
Investigate regressions in https://github.com/iterative/dvc-bench for:
plotsexp showThe text was updated successfully, but these errors were encountered: