Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump default Mplex split_send_size to 64Kbyte #802

Closed

Conversation

dvdplm
Copy link
Contributor

@dvdplm dvdplm commented Dec 20, 2018

The default value of 1Kbyte is rather small and may impact throughput adversly. This PR proposes 64Kbyte. As a comparison, the Yamux default receive_window size is 256Kbyte.

@ghost ghost assigned dvdplm Dec 20, 2018
@ghost ghost added the in progress label Dec 20, 2018
@tomaka
Copy link
Member

tomaka commented Dec 20, 2018

The idea in splitting packets in small sizes is to potentially improve interlacing of substreams. In practice though interlacing will most likely not happen because this is a half-baked change.

However I'd honestly prefer to figure out why mplex is so slow first, before merging any PR that will make the slowness disappear.

@dvdplm
Copy link
Contributor Author

dvdplm commented Dec 20, 2018

will most likely not happen because this is a half-baked change.

Can you elaborate on this? Not sure what you mean.

However I'd honestly prefer to figure out why mplex is so slow first

I'd like to dig deeper too, but I think the test I've used so far is too lacking and we need some proper benchmarks. Note that the mplex slowness in reading data stands out in debug mode; in release mode it looks fine and reads and writes are in the same order of magnitude of each other. To me it is hard to justify spending lots of time on a performance problem that only shows up in debug mode.

@tomaka
Copy link
Member

tomaka commented Dec 20, 2018

Can you elaborate on this? Not sure what you mean.

Right now if you send one packet of 1MB on substream A, and one packet of 1MB on substream B, the multiplexer will send the entire 1MB of substream A followed with the 1MB packet on substream B.

By splitting packets into smaller packets of 1kB, the idea was to send the first kilobyte of substream A, followed with the first kilobyte of substream B, followed with the second kilobyte of substream A, the second kilobyte of substream B, and so on.

However we only do the splitting, and nothing actually interleaves the packets because it hasn't been implemented. In other words, we will send 1024 packets of substream A followed with 1024 packets of substream B.

To me it is hard to justify spending lots of time on a performance problem that only shows up in debug mode.

I don't remember the figures you gave me, but it was something like 15ms to transfer 1MB in release mode. This is way too much in my opinion.

@dvdplm
Copy link
Contributor Author

dvdplm commented Dec 21, 2018

I don't remember the figures you gave me, but it was something like 15ms to transfer 1MB in release mode. This is way too much in my opinion.

Ok, that is good info – I honestly have no strong opinion on what "fast" or "slow" is.

Here are some new numbers after yesterdays debugging&fixing things (using yet unreleased yamux) obtained using the elapsed crate's measure_time() around the reader/writer Future. Here I send 7Mbyte (compiled in release mode):

Mplex, 1024byte chunks:

[test, writer] Running the writer future took 556.02 ms
[test, reader] Running the reader future took 563.98 ms

mplex, 8Kbyte:

[test, writer] Running the writer future took 80.27 ms
[test, reader] Running the reader future took 84.02 ms

mplex, 64Kbyte:

[test, writer] Running the writer future took 18.29 ms
[test, reader] Running the reader future took 18.68 ms

mplex, 256Kbyte:

[test, writer] Running the writer future took 10.75 ms
[test, reader] Running the reader future took 10.83 ms

Yamux, receive_window at 1Mbyte

[test, writer] Running the writer future took 40.91 ms
[test, reader] Running the reader future took 44.38 ms

Yamux, receive_window at 256Kbye (default):

[test, writer] Running the writer future took 45.53 ms
[test, reader] Running the reader future took 45.59 ms

It's interesting to see that the read/write asymmetry disappears in release mode; it is also somewhat surprising to me that Yamux seems to have a higher overhead.

@tomaka
Copy link
Member

tomaka commented Dec 22, 2018

I don't think you're using secio, but another "obvious" problem that should be looked at is that secio does an encryption and hmac round for every single chunk of data.

I'll investigate these performance issues when I'm back from vacation, if nobody does before.

@tomaka
Copy link
Member

tomaka commented Dec 28, 2018

I added some benchmarks to mplex on my side.
One my machine, sending one kB of data takes around 160µs, sending one MB of data takes around 1.3ms, and sending two MBs takes around 2.2ms.

Most notably, changing the split_send_size (even to 1MB) doesn't change anything.

@dvdplm
Copy link
Contributor Author

dvdplm commented Dec 28, 2018

I added some benchmarks to mplex on my side.

Can you share the benchmark somewhere so I can check on my side too?

@dvdplm
Copy link
Contributor Author

dvdplm commented Dec 28, 2018

I ran tomaka/benches-mplex on my side:

running 4 tests
test connect_and_send_hello  ... bench:     330,046 ns/iter (+/- 558,511)
test connect_and_send_one_kb ... bench:     321,593 ns/iter (+/- 217,815)
test connect_and_send_one_mb ... bench:  11,599,293 ns/iter (+/- 4,913,540)
test connect_and_send_two_mb ... bench:  46,662,473 ns/iter (+/- 8,849,963)

Does that jive with what you're seeing too? Do you also have the same kind of wild variations on your machine?

Increasing the split_send_size to 64Kbyte I get this:

running 4 tests
test connect_and_send_hello  ... bench:   2,091,373 ns/iter (+/- 1,790,809)
test connect_and_send_one_kb ... bench:   2,552,111 ns/iter (+/- 2,601,211)
test connect_and_send_one_mb ... bench:   5,915,998 ns/iter (+/- 3,047,250)
test connect_and_send_two_mb ... bench:   7,749,807 ns/iter (+/- 1,694,218)

Again the variability is decidedly weird, especially for the smaller payloads I get numbers between 300 000 ns and 2 million, but the speed-up for larger payloads is consistent. Not what you're seeing I take it?

If I switch to using the multi-threaded tokio Runtime and 64Kbyte chunks I get better values still (and less crazy error ranges):

test connect_and_send_hello  ... bench:     460,012 ns/iter (+/- 192,313)
test connect_and_send_one_kb ... bench:     440,164 ns/iter (+/- 693,928)
test connect_and_send_one_mb ... bench:   1,520,596 ns/iter (+/- 480,530)
test connect_and_send_two_mb ... bench:   3,481,443 ns/iter (+/- 1,380,928)

@romanb
Copy link
Contributor

romanb commented Nov 12, 2020

Superseded by #1834.

@romanb romanb closed this Nov 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants