Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark regressions #9914

Closed
4 of 5 tasks
dberenbaum opened this issue Sep 5, 2023 · 37 comments · Fixed by iterative/sqltrie#24
Closed
4 of 5 tasks

benchmark regressions #9914

dberenbaum opened this issue Sep 5, 2023 · 37 comments · Fixed by iterative/sqltrie#24
Assignees
Labels
p1-important Important, aka current backlog of things to do

Comments

@dberenbaum
Copy link
Collaborator

dberenbaum commented Sep 5, 2023

Investigate regressions in https://github.com/iterative/dvc-bench for:

  • push (sometime between 3.10 and 3.20)
  • plots
  • exp show
  • pull (after 3.20)
  • checkout
@dberenbaum dberenbaum added this to DVC Sep 5, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Sep 5, 2023
@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Sep 5, 2023
@dberenbaum dberenbaum moved this from Backlog to Todo in DVC Sep 5, 2023
@skshetry
Copy link
Member

skshetry commented Sep 5, 2023

For exp show and plots, it looks like the repository used for benchmarking is example-get-started which is incompatible with 2.x due to artifacts in dvc.yaml and md5/hash in dvc.lock changes, so they are failing.

Plots were never in sub-second range.

@dberenbaum
Copy link
Collaborator Author

Makes sense, so we only need to investigate push

@dberenbaum dberenbaum moved this from Todo to In Progress in DVC Sep 12, 2023
@dberenbaum dberenbaum moved this from In Progress to Todo in DVC Sep 12, 2023
@dberenbaum dberenbaum moved this from Todo to Backlog in DVC Sep 19, 2023
@dberenbaum
Copy link
Collaborator Author

Added pull based on issues since 3.20

@dberenbaum dberenbaum moved this from Backlog to Todo in DVC Nov 14, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Nov 15, 2023

It looks like the issue for both fetch and push is with index collection - specifically the problem is that sqltrie traverse is slow.

This is a local fetch of the cats_dogs dataset (25k files) in main, using 3.10.0 (without the index collection) the fetch only takes 10.3s with the profiler enabled:
Screenshot 2023-11-15 at 5 46 47 PM

Caching the sqltrie._traverse() results in memory cuts the fetch time in half, but the real issue is that the underlying script to generate the temporary steps tables for traversal has to be called once per file, and that temp table generation is the main performance blocker. (the caching cuts the runtime exactly in half because we are calling traverse twice per file)

So fixing this will require revisiting how the sqltrie traversal queries work. Or maybe we need to consider only using sqltrie for dump/load and then do all of our index operations in memory. cc @efiop

@pmrowla pmrowla moved this from Todo to In Progress in DVC Nov 15, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Nov 15, 2023

Actually it looks like we can probably just get away with building an in-memory pygtrie for sqltrie views, since views are the only place we heavily use traverse right now

@dberenbaum
Copy link
Collaborator Author

I think closing was an accident @efiop?

@dberenbaum dberenbaum reopened this Nov 27, 2023
@github-project-automation github-project-automation bot moved this from Done to Todo in DVC Nov 27, 2023
@efiop
Copy link
Contributor

efiop commented Nov 27, 2023

Yeah, sorry about that.

We only have push left to check it looks like.

@dberenbaum
Copy link
Collaborator Author

Does iterative/sqltrie#24 fix all pull regressions? Also I opened iterative/dvc-bench#478 since I notice the benchmarks are failing.

@pmrowla
Copy link
Contributor

pmrowla commented Nov 28, 2023

Yes, all the pull regressions should be resolved after the sqltrie changes.

@pmrowla
Copy link
Contributor

pmrowla commented Nov 28, 2023

bench.dvc.org is up to date now, pull regressions have been resolved but it looks like push has still regressed since 3.10

pull
push

@pmrowla pmrowla moved this from Todo to In Progress in DVC Nov 28, 2023
@skshetry
Copy link
Member

pull still seems to have regressed, no? The index migration for pull/fetch/checkout happened in #9424 in 3.0.

@efiop
Copy link
Contributor

efiop commented Nov 28, 2023

Switched pull -> fetch in #10117 and there is no huge regression for local:

2023-11-28T18:04:54.4979752Z ---------------------------------------------------------------------------- benchmark 'test_fetch-fetch': 5 tests ----------------------------------------------------------------------------
2023-11-28T18:04:54.4982155Z Name (time in s)                  Min                 Max                Mean            StdDev              Median               IQR            Outliers     OPS            Rounds  Iterations
2023-11-28T18:04:54.4984185Z -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2023-11-28T18:04:54.4986347Z test_fetch-fetch[2.58.2]     112.4179 (1.0)      112.4179 (1.0)      112.4179 (1.0)      0.0000 (1.0)      112.4179 (1.0)      0.0000 (1.0)           0;0  0.0089 (1.0)           1           1
2023-11-28T18:04:54.4988991Z test_fetch-fetch[3.0.0]      118.5058 (1.05)     118.5058 (1.05)     118.5058 (1.05)     0.0000 (1.0)      118.5058 (1.05)     0.0000 (1.0)           0;0  0.0084 (0.95)          1           1
2023-11-28T18:04:54.4991617Z test_fetch-fetch[main]       118.6735 (1.06)     118.6735 (1.06)     118.6735 (1.06)     0.0000 (1.0)      118.6735 (1.06)     0.0000 (1.0)           0;0  0.0084 (0.95)          1           1
2023-11-28T18:04:54.4994162Z test_fetch-fetch[3.10.0]     119.2171 (1.06)     119.2171 (1.06)     119.2171 (1.06)     0.0000 (1.0)      119.2171 (1.06)     0.0000 (1.0)           0;0  0.0084 (0.94)          1           1
2023-11-28T18:04:54.4996773Z test_fetch-fetch[3.20.0]     119.2340 (1.06)     119.2340 (1.06)     119.2340 (1.06)     0.0000 (1.0)      119.2340 (1.06)     0.0000 (1.0)           0;0  0.0084 (0.94)          1           1
2023-11-28T18:04:54.4998835Z -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2023-11-28T18:04:54.4999696Z 

but there is for checkout, which we already knew.

@efiop
Copy link
Contributor

efiop commented Nov 29, 2023

Ok, so for the record: with small files (esp the ones in mnist) the overhead of callback creation in https://github.com/iterative/dvc-objects/blob/231e5273cf929cea2424c9d6e885f72bb5d90750/src/dvc_objects/fs/utils.py#L167 is just too wasteful and unnecessary. There are a few things that we can do there as well (like using file-sized buffer for small files, using larger buffer for large files, or even just using shutils copysomething that can leverage things like sendfile if necessary), but even that alone speeds up link creation for me by like 50x times. reflink/hardlink/symlink bypass that thus not suffering from the overhead, so our goal here is for copy to take about the same time as other links when dealing with small files. The whole copyfile callback situation is rather messy, somewhat keen on revisiting it a bit more. I'll send some patches in the morning.

EDIT: Merged iterative/dvc-objects#229 and triggered https://github.com/iterative/dvc-bench/actions/runs/7040080174 , let's see...

@efiop efiop self-assigned this Nov 29, 2023
efiop added a commit to efiop/dvc-objects that referenced this issue Nov 30, 2023
There is no reason to create a progress bar overhead for anything that
takes < 2sec to copy on a modern machine.

Related iterative/dvc#9914
efiop added a commit to iterative/dvc-objects that referenced this issue Nov 30, 2023
There is no reason to create a progress bar overhead for anything that
takes < 2sec to copy on a modern machine.

Related iterative/dvc#9914
@skshetry
Copy link
Member

What changed between 2.58.2 and 3.0.0? The callback was still there before that, no?

efiop added a commit to efiop/dvc-objects that referenced this issue Dec 1, 2023
This again wastes a lot of time, as `set_size` might trigger
a pbar refresh.

Related iterative#229
Related iterative/dvc#9914
efiop added a commit to iterative/dvc-objects that referenced this issue Dec 1, 2023
This again wastes a lot of time, as `set_size` might trigger
a pbar refresh.

Related #229
Related iterative/dvc#9914
@efiop
Copy link
Contributor

efiop commented Dec 2, 2023

The main thing that changed is #9444 . Before it we were processing files one-by-one and passing them to generic.transfer(), while after it, among other things, we are passing all files that we need to create to generic.transfer() which makes it use a bit more overhead in different places down the line.

This is a bit early, but I've sent a few more fixes and now, at least on my linux machine, it looks like we are not only back to 2.58.2, but actually faster too (~30sec vs ~37sec for mnist checkout). Waiting for proper confirmation in dvc-bench https://github.com/iterative/dvc-bench/actions/runs/7066555049/job/19239201268

@efiop
Copy link
Contributor

efiop commented Dec 2, 2023

Ok, so looks promising so far.

Screenshot 2023-12-02 at 03 11 27

3.20 also improved because it also installed new dvc-objects version. So indeed, checkout-copy is now faster than it was in 2.x -

But because we've unified all transfer-like logic in the past year, we also get these improvements in other commands that use this same code. E.g. fetch from local remote is now 2.5 times faster than it was.

Screenshot 2023-12-02 at 03 13 21

and push to local remote is now ~2 times faster than before and over 4 times faster than in 2.x

Screenshot 2023-12-02 at 03 16 20

and dvc get-url

Screenshot 2023-12-02 at 03 18 42

and dvc import-url

Screenshot 2023-12-02 at 03 19 17

and a few more. We'll see on on bench.dvc.org once ready.

PS: Hopefully not a fluke...

@skshetry
Copy link
Member

skshetry commented Dec 2, 2023

Great work, @efiop. The only thing that I want to mention is that, iterative/dvc-objects#231 is going to make it slower for users with large files. So we should benchmark on that kind of datasets to be sure. ThreadPoolExecutor has high overhead for small files, but that is noise for larger files which take a lot of time. Parallellizing also help SSDs as they have a lot of bandwidth, and HDDs in planning in their shortest route.

@skshetry
Copy link
Member

skshetry commented Dec 2, 2023

Maybe it's time to write checkout in rust? 😉

@dberenbaum
Copy link
Collaborator Author

Can we test on some dataset like https://www.kaggle.com/datasets/thedevastator/open-orca-augmented-flan-dataset? We don't need to put them into our regular benchmarks but would be good to test it out this time to see how these changes perform on a large file.

@efiop
Copy link
Contributor

efiop commented Dec 3, 2023

Sorry for the delay, was AFK.

Indeed, we should benchmark other datasets as well, no doubt about that. And we should perform well in both cases, mnist is a very good edge case for that. We can pick smarter strategies based on what we are dealing with.

OpenOrca seems pretty good for testing a dataset that has 1 file that is big-ish, we should definitely add it to the dvc-bench datasets so that we have it around at our disposal, I'll upload it later today 👍 . I was also looking into some famous/well-known video/sound/other catalogue-looking datasets to test many (but not too many, maybe up to 10-100) of big-ish files (so typical classification datasets won't do), to test a bit beyond that. Those videos have to be non-copyright ones, so youtube-based ones won't do. If you have ideas - let me know.

@dberenbaum
Copy link
Collaborator Author

Thanks for adding @efiop! To be clear, I didn't mean we have to add it to dvc-bench, although I'm fine with doing so. Sometimes just doing a one-off test is fine if we have something particular to test.

@skshetry
Copy link
Member

skshetry commented Dec 4, 2023

Reminder: we have a gentree command that can randomly generate a dataset.

dvc-data gentree data 10 1Gb --depth 1
1.0G data
   - ├── dir00
 86M │  └── 0.txt
   - ├── dir01
111M │  └── 0.txt
   - ├── dir02
 79M │  └── 0.txt
   - ├── dir03
129M │  └── 0.txt
   - ├── dir04
100M │  └── 0.txt
   - ├── dir05
 95M │  └── 0.txt
   - ├── dir06
159M │  └── 0.txt
   - ├── dir07
124M │  └── 0.txt
   - ├── dir08
 95M │  └── 0.txt
 77M └── 0.txt

@efiop
Copy link
Contributor

efiop commented Dec 4, 2023

@dberenbaum It is very nice to have it in the toolset generally, but it also allows us to trigger dispatch runs in CI. I've also added a smaller one used-cars dataset and it is even faster to run than mnist (i mean total time) https://github.com/iterative/dvc-bench/actions/runs/7087497084 , so we can probably run something chunkier on a regular basis, just need to figure out how/where to post. I'll also add a dataset like the one I was talking about above later.

@efiop
Copy link
Contributor

efiop commented Dec 6, 2023

For the record, @skshetry and I did some optimizations on reflink creation as a part of https://github.com/iterative/dql/pull/1007, approx 2-3 times faster now that it was before.

One thing that is obvious from benchmarks is that fetch/push-noop are kinda slow, but we are not caching compacted/deduplicated list of files we get from going through all revisions just yet, so I would get back to that after it is done.

@efiop
Copy link
Contributor

efiop commented Dec 10, 2023

For the record, some benchmarks with different datasets:

mnist (70K+ ~300B images)

https://bench.dvc.org/schedule/7147613411_2.html

xray (6K ~30KB images)

https://bench.dvc.org/workflow_dispatch/7151367965_3.html

used-cargs (1.4G single file)

https://bench.dvc.org/workflow_dispatch/7146419624_1.html

@dberenbaum
Copy link
Collaborator Author

Worth noting that these are not "real" benchmarks for older versions though since those may be benefitting from updates to dvc-objects and other libs, right?

@dberenbaum
Copy link
Collaborator Author

So overall, it looks like it's a great improvement but we are trading off some performance on big files to get improvements on many small files. Is that right?

@dberenbaum
Copy link
Collaborator Author

Any updates here? What about s3? Any idea why it doesn't show the same improvements as other clouds in the benchmarks?

@efiop
Copy link
Contributor

efiop commented Jan 2, 2024

Sorry, I was convinced I responded before, but probably forgot to post.

Worth noting that these are not "real" benchmarks for older versions though since those may be benefitting from updates to dvc-objects and other libs, right?

Yes, mostly recent versions as very old ones will have more strict requirements.

So overall, it looks like it's a great improvement but we are trading off some performance on big files to get improvements on many small files. Is that right?

Marginally in used-cars, but there is no benchmark with many big files which should be more affected.

Any updates here? What about s3? Any idea why it doesn't show the same improvements as other clouds in the benchmarks?

No updates, haven't looked into this since the last time 🙁 s3 is likely because of chunking configuration (similar to improvements that @pmrowla did in adlfs in the past), but haven't looked into it.

@dberenbaum
Copy link
Collaborator Author

@pmrowla Any thoughts on the s3 performance? I'd like to look into that either as part of this issue or a new one.

Is there anything else left before closing this issue?

@pmrowla
Copy link
Contributor

pmrowla commented Jan 4, 2024

@dberenbaum this can be closed, we have a separate issue for the s3 performance #9893

@pmrowla pmrowla closed this as completed Jan 4, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in DVC Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
4 participants