Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: notebook serialization for large notebooks blocks cell execution #214040

Closed
amunger opened this issue May 31, 2024 · 16 comments
Closed

perf: notebook serialization for large notebooks blocks cell execution #214040

amunger opened this issue May 31, 2024 · 16 comments

Comments

@amunger
Copy link
Contributor

amunger commented May 31, 2024

Steps to Reproduce:

  1. enable auto-save
  2. create a large notebook: run the code below in a cell to create a chart with a lot of data and copy that cell ~50 times
  3. save to disk
  4. create another cell with just print(1) and run it a few times
    🐛 sometimes cell execution takes several seconds

serialization still blocks the extension host for the entire time, which means execution cannot happen

code for large outputs:

import plotly.express as px
import random
x = []
y = []
for i in range(100000):
    random_number = random.randint(1, 10000)
    x.append(i + random_number)
    y.append(i + random_number)
fig = px.scatter(x, y)

fig.show()
@amunger
Copy link
Contributor Author

amunger commented Jun 3, 2024

I played around with an incremental serialization that would allow for more responsiveness (draft PR), basically serializing the code cells one at a time and checking for cancellation in between and the results were not very promising - serialization takes around 4x as long and still has some pretty significant hiccups in execution time.

@rebornix - do you have any ideas for improvement here? Perhaps something like a dedicated worker thread on the EH?

@arman258

This comment was marked as off-topic.

@amunger
Copy link
Contributor Author

amunger commented Jun 4, 2024

@arman258 I created another issue for that since it doesn't have anything to do with serializing large notebooks.

@rebornix rebornix added under-discussion Issue is under discussion for relevance, priority, approach notebook-perf notebook-serialization labels Jun 13, 2024
@nojaf
Copy link
Contributor

nojaf commented Aug 7, 2024

Hi @amunger, we are are facing this issue as well and are interested in helping out.
I've pushed my sample project to https://github.com/nojaf/python-notebook-serialization, I can reproduce this in WSL and Windows directly.

When I open my debug VSCode instance, I can't really mimic the exact behaviour as I'm missing some extensions in that instance:

image

Is it possible to install extensions in the local VSCode instance? Does that require something special?

In my code, I added some more logs to view some timings:

image

image

Although I suspect the problem is worse when the actually graphs are rendered.

Any pointers on what might be interesting to investigate next?

@amunger
Copy link
Contributor Author

amunger commented Aug 7, 2024

@nojaf - yes, you can download the necessary extension vsix from the marketplace website and just use "install from visx" in the extensions sidebar menu.

I don't think it actually needs to display though, the performance hit is all within the serialization of the data, which is there whether or not it displays

@nojaf
Copy link
Contributor

nojaf commented Aug 12, 2024

@amunger thanks for the pointer.

Thinking out loud, would JSON.stringify(sorted) (without indentation) be an option to speed up the serialization?
Locally that turns 499.814ms into 193.844ms for me.

@amunger
Copy link
Contributor Author

amunger commented Aug 12, 2024

If it makes that much of a difference, then it's worth considering, but it would cause a lot of headaches for people that are using version control for their notebooks as all the whitespace would suddenly change. We would probably need it behind a setting if we went that direction.

@nojaf
Copy link
Contributor

nojaf commented Aug 14, 2024

@amunger, I'm considering adding a separate timeout setting for files.autoSaveDelay. By increasing this to, say, 1 minute, the serialization process wouldn't interfere with cell execution. For users with large notebooks, being able to configure this might be a practical workaround. What do you think?

@amunger
Copy link
Contributor Author

amunger commented Aug 14, 2024

Do you mean a setting that is specific to notebooks? files.autoSaveDelay already exists and could be used for this purpose, but it would affect all files.

@nojaf
Copy link
Contributor

nojaf commented Aug 16, 2024

Do you mean a setting that is specific to notebooks? files.autoSaveDelay already exists and could be used for this purpose, but it would affect all files.

Yes, exactly.

@nojaf
Copy link
Contributor

nojaf commented Aug 20, 2024

@amunger, I hate to trouble you again, but is this additional setting worth pursuing? I'm ready to proceed with a PR, but I'd appreciate some confirmation that it's a reasonable addition.

@amunger
Copy link
Contributor Author

amunger commented Aug 20, 2024

looking at where that solution would be implemented - I don't think we can cleanly split the delay value by which editor has the file open, and it's probably not worth adding that complexity into the core saving mechanism for this issue.

@nojaf
Copy link
Contributor

nojaf commented Aug 20, 2024

@amunger, ok that sounds fair. Any other avenues to explore here? This is really a problem at our company and we actively want to contribute to fixing this.

@Naomi-msft
Copy link

@nojaf same issue experienced here. Notebook size is considerably big, and auto-save is on. When executing simple print("hello world") it takes from 20 to 60 seconds, it gets executed successfully, however the execution time displayed is 0 seconds. Is there any update here on workarounds or potential fix?

@nojaf
Copy link
Contributor

nojaf commented Sep 6, 2024

I'm trying something behind a feature flag in #226632

@amunger
Copy link
Contributor Author

amunger commented Sep 24, 2024

I tested the latest with #226632 and the latency is greatly improved. Thanks for getting that change going @nojaf

Use the setting "ipynb.experimental.serialization": true to free up the extension host for execution.

@amunger amunger closed this as completed Sep 24, 2024
@amunger amunger added on-testplan and removed under-discussion Issue is under discussion for relevance, priority, approach labels Sep 24, 2024
@amunger amunger added this to the September 2024 milestone Sep 24, 2024
@vs-code-engineering vs-code-engineering bot locked and limited conversation to collaborators Nov 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants