Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage and latency when a program produces output too quickly #525

Closed
kxt opened this issue May 19, 2021 · 9 comments · Fixed by #536
Closed

High memory usage and latency when a program produces output too quickly #525

kxt opened this issue May 19, 2021 · 9 comments · Fixed by #536

Comments

@kxt
Copy link
Contributor

kxt commented May 19, 2021

version 0.12.0

Reproduction

Run time cat 1_000_000_lines.txt and compare it with the time measured by wall clock. The amount of time time reports is significantly less than what is observed. Having a slow computer and using non-optimized (non-release) build helps.

Root cause

A program's output is being processed by the Pty thread. This thread reads the output and propagates it to the Screen thread for rendering through an unbounded channel called to_screen. If the program produces output faster than what Screen can render, to_screen's queue starts to grow indefinitely. This results in growing memory usage and input latency.

An alternative explanation (kudos to @aeinstein): running nearly at the speed of light, cat is experiencing time dilation compared to a stationary observer.

As PR #523 increases the throughput of Pty thread, it exacerbates this issue.

@kxt
Copy link
Contributor Author

kxt commented May 19, 2021

Currently there is a workaround in Pty calling task::sleep to slow it down.

However, as there is no right amount of sleep that works for every machine, the proper solution would be to have back pressure between Pty and Screen, so if Screen can't keep up, Pty will stop reading from the program until the queue clears up. This'd need a bounded queue between Pty and Screen, however, as Pty is async, we can't simply replace the queue with an mpsc SyncSend, as it is blocking. This means an async queue is needed, like async-std's bounded channel.

@kxt
Copy link
Contributor Author

kxt commented May 20, 2021

I made a proof-of-concept using a bounded to_screen queue, and I ran into a deadlock:

  • pty thread processes a lot of input, filling up to_screen queue as soon as screen thread pops something from it.
  • screen thread pops a ScreenInstruction::Render from to_screen, and processes it with Screen::render, which calls Tab::render, which calls Pane::render. pane happens be a PluginPane, and it's render opens a channel to the plugin, sends it a PluginInstruction::Render and waits for the plugin's response through that channel, blocking screen thread in the mean time.
  • However, the wasm thread that would process PluginInstruction::Render is blocked: it still haven't finished processing a previous PluginInstruction::Update message. PluginInstruction::Update involves sending an ScreenInstruction::Render to the to_screen queue, which is already filled by pty thread, so this send is blocking.

TL;DR: pty thread has filled to_screen, wasm thread is blocked by the filled to_screen, and screen thread (which should clear up to_screen) is blocked by the blocked wasm thread.

So simply changing to_screen to a bounded queue is not a solution.

@kxt
Copy link
Contributor Author

kxt commented May 21, 2021

I found a workable way to add back pressure between pty and screen threads without triggering the mentioned deadlock.

The idea is to have a dedicated bounded channel between pty and screen, so we gain back pressure, while keeping the unbounded channel between every other component and screen, to avoid deadlocks.

This required a change from mpsc to crossbeam, as mpsc lacks a way to select between multiple Receivers.

@imsnif If this is an acceptable approach (poc patch here) I'll whip up a PR-able implementation.

@imsnif
Copy link
Member

imsnif commented May 22, 2021

@kxt - at a brief glance this looks like a good direction (though I haven't run it or looked too deeply at it yet). I'm good with moving forward with a PR so we can look deeper. I might want to bring in @kunalmohan to also take a look (if he has the time) since he's done a lot of work in this area.

@kunalmohan
Copy link
Member

@kxt great work in the #523! The throughput has increased by a LOT!
Regarding this issue, I personally didn't notice this issue in release builds (on main as well as in #523), but it is quite evident in the debug builds. Your proposal of using crossbeam channels with select feature looks like a very good solution. Since it might be some time before the rendering throughput is increased, I think we should go ahead with this. So go ahead and draft a PR and we'll discuss this further :)

This was referenced May 25, 2021
@kxt
Copy link
Contributor Author

kxt commented May 25, 2021

I've created two PRs. The first one contains some refactors preparing for the fix, including switching from mpsc to crossbeam, they are ready for review and merge.

The second draft PR contains the actual fix, seems to work OK. However, some of the testcases fail unless sleep was introduced at random points, which is not great. I wasn't able to figure out why it is necessary.

I've noticed that many testcases already contain sleeps, and there is are also a generous amount of sleep in read_from_stdin, which makes running the testcases slow.

Can someone help me figuring out what's going on with the testcases? @kunalmohan ?

@imsnif
Copy link
Member

imsnif commented May 26, 2021

Hey @kxt - I added a review in the first PR. About the tests in the draft PR (which I think you didn't yet create? unless I missed it?): our testing infrastructure is one of the most broken parts of the code-base. It was one of the first pieces of Zellij and is very long overdue for an overhaul.

Since it works by comparing snapshots of the app state (essentially what's on screen in certain situations) and the app includes some asynchronous elements, instead of creating a proper synchronization we just kept adding sleeps to make it work properly. I am not proud of this, but I am honest about it. :) I plan on devoting attention to it very soon after all the other urgent stuff.

So I'm afraid the best answer I can give you now is to add sleeps where necessary to wait for data to come from the fake pty before sending the next command. (also, there might still be some atomicity issues left where stdin is concerned, though I think we got most of those already).

If you want further specific help with this, let me know.

@kxt
Copy link
Contributor Author

kxt commented May 26, 2021

Sorry, I did not link the draft PR properly properly, it's in my fork of zellij right now.

I figured that it's because of the lots of cross-thread communication makes it hard to figure out if something has already happened or not.

If you say sleeps are ok as of now, then I'll open a proper PR with those two patches.

Do you have any concrete plans on the test overhaul?

@imsnif
Copy link
Member

imsnif commented May 26, 2021

If you say sleeps are ok as of now, then I'll open a proper PR with those two patches.

Sounds good.

Do you have any concrete plans on the test overhaul?

An idea we've been throwing around the team is to move whatever makes sense in the integration tests to be unit tests, and make the rest be proper end-to-end tests. Our idea is to spin up a docker container and communicate with it through ssh, thus simulating the full app cycle. This still doesn't solve the synchronization problem though, but I have a feeling that's the sort of issue you solve as you work on it.

If you have any other ideas, they would of course be most welcome.

kxt added a commit to kxt/zellij that referenced this issue May 27, 2021
Pty reads a command's output and feeds it to Screen using an unbounded
queue.

However, if the command produces output faster than what `Screen` can
render, `Pty` still pushes it on the queue, causing it to grow
indefinitely, resulting in high memory usage and latency.

This patch fixes this by using a bounded queue between Pty and Screen,
so if Screen can't keep up with Pty, the queue will fill up, exerting
back pressure on Pty, making it read the command's output only as fast
as Screen renders it.

The unbounded queue is kept between Screen and producers other than Pty
to avoid a deadlock such as this scenario:
* pty thread filling up screen queue as soon as screen thread pops
  something from it
* wasm thread is processing a Render instruction, blocking on the screen
  queue
* screen thread is trying to render a plugin pane. It attempts to send a
  Render insturction to the blocked wasm thread running the plugin.

This patch also adds a generous amount of sleeps to the integration
tests as having backpressure changes the timing of how instructions are
processed.

Fixes zellij-org#525.
kxt added a commit to kxt/zellij that referenced this issue May 27, 2021
Pty reads a command's output and feeds it to Screen using an unbounded
queue.

However, if the command produces output faster than what `Screen` can
render, `Pty` still pushes it on the queue, causing it to grow
indefinitely, resulting in high memory usage and latency.

This patch fixes this by using a bounded queue between Pty and Screen,
so if Screen can't keep up with Pty, the queue will fill up, exerting
back pressure on Pty, making it read the command's output only as fast
as Screen renders it.

The unbounded queue is kept between Screen and producers other than Pty
to avoid a deadlock such as this scenario:
* pty thread filling up screen queue as soon as screen thread pops
  something from it
* wasm thread is processing a Render instruction, blocking on the screen
  queue
* screen thread is trying to render a plugin pane. It attempts to send a
  Render insturction to the blocked wasm thread running the plugin.

This patch also adds a generous amount of sleeps to the integration
tests as having backpressure changes the timing of how instructions are
processed.

Fixes zellij-org#525.
imsnif pushed a commit that referenced this issue May 27, 2021
* refactor(fakes): clean up add_terminal_input

* refactor(fakes): append whole buf to output_buffer in FakeStdoutWriter::write

* refactor(fakes): append whole buf to output_buffer in FakeInputOutput::write_to_tty_stdin

* fix(fakes): allow partial reads in read_from_tty_stdout

This patch fixes two bugs in read_from_tty_stdout:
* if there was a partial read (ie. `bytes.read_position` is not 0 but
less than `bytes.content.len()`), subsequent calls to would fill `buf`
starting at index `bytes.read_position` instead of 0, leaving range
0..`bytes.read_position` untouched.
* if `buf` was smaller than `bytes.content.len()`, a panic would occur.

* refactor(channels): use crossbeam instead of mpsc

This patch replaces mpsc with crossbeam channels because crossbeam
supports selecting on multiple channels which will be necessary in a
subsequent patch.

* refactor(threadbus): allow multiple receivers in Bus

This patch changes Bus to use multiple receivers. Method `recv` returns
data from all of them. This will be used in a subsequent patch for
receiving from bounded and unbounded queues at the same time.

* refactor(channels): remove SenderType enum

This enum has only one variant, so the entire enum can be replaced with
the innards of said variant.

* refactor(channels): remove Send+Sync trait implementations

The implementation of these traits is not necessary, as
SenderWithContext is automatically Send and Sync for every T and
ErrorContext that's Send and Sync.
kxt added a commit to kxt/zellij that referenced this issue May 27, 2021
Pty reads a command's output and feeds it to Screen using an unbounded
queue.

However, if the command produces output faster than what `Screen` can
render, `Pty` still pushes it on the queue, causing it to grow
indefinitely, resulting in high memory usage and latency.

This patch fixes this by using a bounded queue between Pty and Screen,
so if Screen can't keep up with Pty, the queue will fill up, exerting
back pressure on Pty, making it read the command's output only as fast
as Screen renders it.

The unbounded queue is kept between Screen and producers other than Pty
to avoid a deadlock such as this scenario:
* pty thread filling up screen queue as soon as screen thread pops
  something from it
* wasm thread is processing a Render instruction, blocking on the screen
  queue
* screen thread is trying to render a plugin pane. It attempts to send a
  Render insturction to the blocked wasm thread running the plugin.

This patch also adds a generous amount of sleeps to the integration
tests as having backpressure changes the timing of how instructions are
processed.

Fixes zellij-org#525.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants