High memory usage and latency when a program produces output too quickly #525

kxt · 2021-05-19T19:56:07Z

version 0.12.0

Reproduction

Run time cat 1_000_000_lines.txt and compare it with the time measured by wall clock. The amount of time time reports is significantly less than what is observed. Having a slow computer and using non-optimized (non-release) build helps.

Root cause

A program's output is being processed by the Pty thread. This thread reads the output and propagates it to the Screen thread for rendering through an unbounded channel called to_screen. If the program produces output faster than what Screen can render, to_screen's queue starts to grow indefinitely. This results in growing memory usage and input latency.

An alternative explanation (kudos to @aeinstein): running nearly at the speed of light, cat is experiencing time dilation compared to a stationary observer.

As PR #523 increases the throughput of Pty thread, it exacerbates this issue.

The text was updated successfully, but these errors were encountered:

kxt · 2021-05-19T20:04:52Z

Currently there is a workaround in Pty calling task::sleep to slow it down.

However, as there is no right amount of sleep that works for every machine, the proper solution would be to have back pressure between Pty and Screen, so if Screen can't keep up, Pty will stop reading from the program until the queue clears up. This'd need a bounded queue between Pty and Screen, however, as Pty is async, we can't simply replace the queue with an mpsc SyncSend, as it is blocking. This means an async queue is needed, like async-std's bounded channel.

kxt · 2021-05-20T12:13:22Z

I made a proof-of-concept using a bounded to_screen queue, and I ran into a deadlock:

pty thread processes a lot of input, filling up to_screen queue as soon as screen thread pops something from it.
screen thread pops a ScreenInstruction::Render from to_screen, and processes it with Screen::render, which calls Tab::render, which calls Pane::render. pane happens be a PluginPane, and it's render opens a channel to the plugin, sends it a PluginInstruction::Render and waits for the plugin's response through that channel, blocking screen thread in the mean time.
However, the wasm thread that would process PluginInstruction::Render is blocked: it still haven't finished processing a previous PluginInstruction::Update message. PluginInstruction::Update involves sending an ScreenInstruction::Render to the to_screen queue, which is already filled by pty thread, so this send is blocking.

TL;DR: pty thread has filled to_screen, wasm thread is blocked by the filled to_screen, and screen thread (which should clear up to_screen) is blocked by the blocked wasm thread.

So simply changing to_screen to a bounded queue is not a solution.

kxt · 2021-05-21T19:58:30Z

I found a workable way to add back pressure between pty and screen threads without triggering the mentioned deadlock.

The idea is to have a dedicated bounded channel between pty and screen, so we gain back pressure, while keeping the unbounded channel between every other component and screen, to avoid deadlocks.

This required a change from mpsc to crossbeam, as mpsc lacks a way to select between multiple Receivers.

@imsnif If this is an acceptable approach (poc patch here) I'll whip up a PR-able implementation.

imsnif · 2021-05-22T08:18:09Z

@kxt - at a brief glance this looks like a good direction (though I haven't run it or looked too deeply at it yet). I'm good with moving forward with a PR so we can look deeper. I might want to bring in @kunalmohan to also take a look (if he has the time) since he's done a lot of work in this area.

kunalmohan · 2021-05-23T07:47:20Z

@kxt great work in the #523! The throughput has increased by a LOT!
Regarding this issue, I personally didn't notice this issue in release builds (on main as well as in #523), but it is quite evident in the debug builds. Your proposal of using crossbeam channels with select feature looks like a very good solution. Since it might be some time before the rendering throughput is increased, I think we should go ahead with this. So go ahead and draft a PR and we'll discuss this further :)

kxt · 2021-05-25T23:51:04Z

I've created two PRs. The first one contains some refactors preparing for the fix, including switching from mpsc to crossbeam, they are ready for review and merge.

The second draft PR contains the actual fix, seems to work OK. However, some of the testcases fail unless sleep was introduced at random points, which is not great. I wasn't able to figure out why it is necessary.

I've noticed that many testcases already contain sleeps, and there is are also a generous amount of sleep in read_from_stdin, which makes running the testcases slow.

Can someone help me figuring out what's going on with the testcases? @kunalmohan ?

imsnif · 2021-05-26T16:29:35Z

Hey @kxt - I added a review in the first PR. About the tests in the draft PR (which I think you didn't yet create? unless I missed it?): our testing infrastructure is one of the most broken parts of the code-base. It was one of the first pieces of Zellij and is very long overdue for an overhaul.

Since it works by comparing snapshots of the app state (essentially what's on screen in certain situations) and the app includes some asynchronous elements, instead of creating a proper synchronization we just kept adding sleeps to make it work properly. I am not proud of this, but I am honest about it. :) I plan on devoting attention to it very soon after all the other urgent stuff.

So I'm afraid the best answer I can give you now is to add sleeps where necessary to wait for data to come from the fake pty before sending the next command. (also, there might still be some atomicity issues left where stdin is concerned, though I think we got most of those already).

If you want further specific help with this, let me know.

kxt · 2021-05-26T16:49:03Z

Sorry, I did not link the draft PR properly properly, it's in my fork of zellij right now.

I figured that it's because of the lots of cross-thread communication makes it hard to figure out if something has already happened or not.

If you say sleeps are ok as of now, then I'll open a proper PR with those two patches.

Do you have any concrete plans on the test overhaul?

imsnif · 2021-05-26T17:09:48Z

If you say sleeps are ok as of now, then I'll open a proper PR with those two patches.

Sounds good.

Do you have any concrete plans on the test overhaul?

An idea we've been throwing around the team is to move whatever makes sense in the integration tests to be unit tests, and make the rest be proper end-to-end tests. Our idea is to spin up a docker container and communicate with it through ssh, thus simulating the full app cycle. This still doesn't solve the synchronization problem though, but I have a feeling that's the sort of issue you solve as you work on it.

If you have any other ideas, they would of course be most welcome.

Pty reads a command's output and feeds it to Screen using an unbounded queue. However, if the command produces output faster than what `Screen` can render, `Pty` still pushes it on the queue, causing it to grow indefinitely, resulting in high memory usage and latency. This patch fixes this by using a bounded queue between Pty and Screen, so if Screen can't keep up with Pty, the queue will fill up, exerting back pressure on Pty, making it read the command's output only as fast as Screen renders it. The unbounded queue is kept between Screen and producers other than Pty to avoid a deadlock such as this scenario: * pty thread filling up screen queue as soon as screen thread pops something from it * wasm thread is processing a Render instruction, blocking on the screen queue * screen thread is trying to render a plugin pane. It attempts to send a Render insturction to the blocked wasm thread running the plugin. This patch also adds a generous amount of sleeps to the integration tests as having backpressure changes the timing of how instructions are processed. Fixes zellij-org#525.

* refactor(fakes): clean up add_terminal_input * refactor(fakes): append whole buf to output_buffer in FakeStdoutWriter::write * refactor(fakes): append whole buf to output_buffer in FakeInputOutput::write_to_tty_stdin * fix(fakes): allow partial reads in read_from_tty_stdout This patch fixes two bugs in read_from_tty_stdout: * if there was a partial read (ie. `bytes.read_position` is not 0 but less than `bytes.content.len()`), subsequent calls to would fill `buf` starting at index `bytes.read_position` instead of 0, leaving range 0..`bytes.read_position` untouched. * if `buf` was smaller than `bytes.content.len()`, a panic would occur. * refactor(channels): use crossbeam instead of mpsc This patch replaces mpsc with crossbeam channels because crossbeam supports selecting on multiple channels which will be necessary in a subsequent patch. * refactor(threadbus): allow multiple receivers in Bus This patch changes Bus to use multiple receivers. Method `recv` returns data from all of them. This will be used in a subsequent patch for receiving from bounded and unbounded queues at the same time. * refactor(channels): remove SenderType enum This enum has only one variant, so the entire enum can be replaced with the innards of said variant. * refactor(channels): remove Send+Sync trait implementations The implementation of these traits is not necessary, as SenderWithContext is automatically Send and Sync for every T and ErrorContext that's Send and Sync.

Pty reads a command's output and feeds it to Screen using an unbounded queue. However, if the command produces output faster than what `Screen` can render, `Pty` still pushes it on the queue, causing it to grow indefinitely, resulting in high memory usage and latency. This patch fixes this by using a bounded queue between Pty and Screen, so if Screen can't keep up with Pty, the queue will fill up, exerting back pressure on Pty, making it read the command's output only as fast as Screen renders it. The unbounded queue is kept between Screen and producers other than Pty to avoid a deadlock such as this scenario: * pty thread filling up screen queue as soon as screen thread pops something from it * wasm thread is processing a Render instruction, blocking on the screen queue * screen thread is trying to render a plugin pane. It attempts to send a Render insturction to the blocked wasm thread running the plugin. This patch also adds a generous amount of sleeps to the integration tests as having backpressure changes the timing of how instructions are processed. Fixes zellij-org#525.

kxt added the suspected bug label May 19, 2021

kxt mentioned this issue May 21, 2021

fix(pty): use async io to avoid polling #523

Merged

This was referenced May 25, 2021

refactors for #525 #534

Merged

525 pty backpressure kxt/zellij#1

Closed

kxt mentioned this issue May 27, 2021

Add back pressure to pty #536

Merged

kunalmohan closed this as completed in #536 May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage and latency when a program produces output too quickly #525

High memory usage and latency when a program produces output too quickly #525

kxt commented May 19, 2021

kxt commented May 19, 2021

kxt commented May 20, 2021

kxt commented May 21, 2021

imsnif commented May 22, 2021

kunalmohan commented May 23, 2021

kxt commented May 25, 2021

imsnif commented May 26, 2021

kxt commented May 26, 2021

imsnif commented May 26, 2021

High memory usage and latency when a program produces output too quickly #525

High memory usage and latency when a program produces output too quickly #525

Comments

kxt commented May 19, 2021

Reproduction

Root cause

kxt commented May 19, 2021

kxt commented May 20, 2021

kxt commented May 21, 2021

imsnif commented May 22, 2021

kunalmohan commented May 23, 2021

kxt commented May 25, 2021

imsnif commented May 26, 2021

kxt commented May 26, 2021

imsnif commented May 26, 2021