Introduce a circular buffer that always returns continuous chunks #105

sergiu128 · 2023-09-10T18:44:18Z

note: the same syscalls that make the mirrored buffer possible also make aeron possible. There's no other magic involved. Both aeron and sonic strive to use /dev/shm.

note: work in progress. I'm planning to testdrive this on edx in 1 weekish on a machine with plenty of RAM to spare

Why do we need this?

To avoid memory allocations and copies in tcp codecs, such as websocket, http or any other exchange specific protocol. What we mean by tcp codec is best understood through an example.

Say a computer wants to communicate with us reliably. This computer sends us bytes through TCP. Now TCP only deals with reading/writing bytes, so we need to agree on a protocol with the computer to interpret those bytes. Moreover, TCP is a stream transport. A single tcp read might return 1 or 1000 bytes. We don't know ahead of time. This is in contrast with packet based transports such as UDP, where each network read will return us a single packet. A packet is at most 64KB, so we know ahead of time how much memory to allocate to accommodate any packet.

The TCP protocol is simple:

Each message has a variable length. It can be 1 byte or 1GB.
Since the length of each message is variable, we need some information on how big the message is
This information will be encoded in a fixed-sized header of 4 bytes
In short, each message follows 4 bytes that carry its size: |header|variable payload|
Given the above, a single read from the network could give us the following bytes:

|2|00|4|0000|8|00001111|7|00

In the above example we have 3 complete messages of lengths 2 4 and 8 and an incomplete message of length 7 (we only read the first 2 bytes of that message). A further tcp read call will probably read the leftover 5 bytes of the 4th message as well as read some more (possibly incomplete) messages.

Now we look at how to interpret these messages. In the above example, we can:

process the first 3 messages of lengths 2 4 and 8
we can't process the 4th message yet as it is incomplete. We read from the network again.
the read syscall expects a slice of bytes. Say we initially allocated a slice of 16 bytes. Until now we used 2+4+8+2(incomplete message) = 16. That means we don't have any space leftover in the current buffer. We can:
- allocate a bigger buffer and only copy the 2 bytes from the 4th message into it
- copy the 2 bytes of the 4th message to the beginning of the current buffer, overwriting what's there. This leaves us with 14 bytes to read into.

But we don't want to allocate. That's expensive and unpredictable. We also don't want to copy. That's again expensive, although a bit more predictable. What if, we could use a circular buffer instead?

Now, we can't use a normal circular buffer because each network call expects a contiguous slice of bytes. A circular buffer might wrap, hence returning us two slices to read into, which is incompatible with the read/write syscalls. We also can't use a bip_buffer as TCP is stream, not packet-based.

Given the above, we introduce a mirrored_buffer: a circular buffer that can always return a contiguous slice of bytes. This fully avoids memory allocations and copies for TCP based codecs.

Benchmarks

See BenchmarkMirroredBuffer.

BenchmarkMirroredBuffer/byte_buffer_131.1_kB_1
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_1-16                 625738              1909 ns/op              0 B/op          0 allocs/op
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_2
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_2-16                 424297              2855 ns/op              0 B/op          0 allocs/op
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_4
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_4-16                 248610              4671 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_8
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_8-16                 176844              6702 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_16
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_16-16                105042             10808 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_32
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_32-16                 57811             20493 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_64
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_64-16                 29202             38521 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_128
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_128-16                15518             75744 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_256
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_256-16                 8073            145317 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_512
BenchmarkMirroredBuffer/byte_buffer_131.1_kB_512-16                 4088            287882 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_1
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_1-16             567561              1858 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_2
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_2-16             614644              1919 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_4
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_4-16             626976              1850 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_8
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_8-16             575004              1874 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_16
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_16-16            640305              1913 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_32
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_32-16            625038              1903 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_64
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_64-16            619766              1921 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_128
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_128-16           625246              1860 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_256
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_256-16           622999              1871 ns/op               0 B/op          0 allocs/op
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_512
BenchmarkMirroredBuffer/mirrored_buffer_131.1_kB_512-16           617131              1870 ns/op               0 B/op          0 allocs/op

Docs:

Appendix

Besides allocating and copying, we can go a 3rd, extremely inefficient and mostly unpredictable route: invoke the read syscall for each header + message. For the above example, this results in 8 syscalls:

for each message, read the 4 bytes, parse it into an integer n and then read the payload of n bytes.
Syscalls are expensive and should be minimized if not totally avoided for latency critical software, such as trading systems. That's why stuff like io_uring or direct-memory-access into network cards exists.

DYI

#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/types.h>

const char* name = "/mirrored_buffer_test";

int main() {
        int size = sysconf(_SC_PAGE_SIZE);
        if (size == -1) {
                perror("sysconf");
        }

        printf("page_size=%d\n", size);

        void* base_addr = mmap(NULL, 2 * size, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (base_addr == MAP_FAILED) {
                perror("mmap");
        }
        printf("base_addr=%p\n", base_addr);

        int fd = shm_open(name, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
        if (fd < 0) {
                perror("shm_open");
        }

        if (shm_unlink(name)) {
                perror("shmunlink");
        }

        if (ftruncate(fd, size)) {
                perror("ftruncate");
        }

        char* first_addr = (char*)base_addr;
        char* second_addr = first_addr + size;
        void* addr;

        printf("first_addr=%p\n", (void*)first_addr);
        printf("second_addr=%p\n", (void*)second_addr);

        addr = mmap((void*)first_addr, size, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0);
        if (addr == MAP_FAILED) {
                perror("first mmap");
        }
        if ((char*)addr != first_addr) {
                exit(EXIT_FAILURE);
        }

        addr = mmap((void*)second_addr, size, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0);
        if (addr == MAP_FAILED) {
                perror("second mmap");
        }
        if ((char*)addr != second_addr) {
                exit(EXIT_FAILURE);
        }

        if (close(fd)) {
                perror("close");
        }

        // Write some bytes in the first half of the first mapping.
        // All these write will be seen in the second mapping.
        char* p;
        p = first_addr;
        for (size_t i = 0; i < size / 2; i++) {
                *p = 1;
                p++;
        }

        p = first_addr;
        for (size_t i = 0; i < size; i++) {
                printf("%d", *p);
                p++;
        }
        printf("\n\n");
        p = second_addr;
        for (size_t i = 0; i < size; i++) {
                printf("%d", *p);
                p++;
        }
        printf("\n");

        for (;;) {
        }
}

Output:

base_addr=0x7fd41a422000
first_addr=0x7fd41a422000
second_addr=0x7fd41a423000

sudo pmap -x:

3817536:   ./a
Address           Kbytes     RSS   Dirty Mode  Mapping
000055987b965000       4       4       0 r---- a
000055987b966000       4       4       0 r-x-- a
000055987b967000       4       4       0 r---- a
000055987b968000       4       4       4 r---- a
000055987b969000       4       4       4 rw--- a
00007fdb8c600000     160     160       0 r---- libc.so.6
00007fdb8c628000    1620     736       0 r-x-- libc.so.6
00007fdb8c7bd000     352      64       0 r---- libc.so.6
00007fdb8c815000      16      16      16 r---- libc.so.6
00007fdb8c819000       8       8       8 rw--- libc.so.6
00007fdb8c81b000      52      16      16 rw---   [ anon ]
00007fdb8c858000      12       8       8 rw---   [ anon ]
00007fdb8c85f000       4       4       4 rw-s- mirrored_buffer_test (deleted) --> THIS IS THE FIRST MAPPING
00007fdb8c860000       4       0       0 rw-s- mirrored_buffer_test (deleted) --> THIS IS THE SECOND MAPPING
00007fdb8c861000       8       4       4 rw---   [ anon ]
00007fdb8c863000       8       8       0 r---- ld-linux-x86-64.so.2
00007fdb8c865000     168     168       0 r-x-- ld-linux-x86-64.so.2
00007fdb8c88f000      44      44       0 r---- ld-linux-x86-64.so.2
00007fdb8c89b000       8       8       8 r---- ld-linux-x86-64.so.2
00007fdb8c89d000       8       8       8 rw--- ld-linux-x86-64.so.2
00007ffc833f1000     132      12      12 rw---   [ stack ]
00007ffc834c5000      16       0       0 r----   [ anon ]
00007ffc834c9000       8       4       0 r-x--   [ anon ]
ffffffffff600000       4       0       0 --x--   [ anon ]
---------------- ------- ------- -------
total kB            2652    1288      92

If we just do the first mapping (MAP_ANONYMOUS | MAP_PRIVATE with PROT_NONE) then:

1348683:   ./a
Address           Kbytes     RSS   Dirty Mode  Mapping
0000558fb9458000       4       4       4 r---- a
0000558fb9459000       4       4       4 r-x-- a
0000558fb945a000       4       4       4 r---- a
0000558fb945b000       4       4       4 r---- a
0000558fb945c000       4       4       4 rw--- a
0000558fb996a000     132       4       4 rw---   [ anon ]
00007f9207400000     160     160       0 r---- libc.so.6
00007f9207428000    1620     992       0 r-x-- libc.so.6
00007f92075bd000     352      64       0 r---- libc.so.6
00007f9207615000      16      16      16 r---- libc.so.6
00007f9207619000       8       8       8 rw--- libc.so.6
00007f920761b000      52      20      20 rw---   [ anon ]
00007f9207810000      12       8       8 rw---   [ anon ]
00007f9207817000       8       0       0 -----   [ anon ] --> THIS IS THE ANONYMOUS MAPPING
00007f9207819000       8       4       4 rw---   [ anon ]
00007f920781b000       8       8       0 r---- ld-linux-x86-64.so.2
00007f920781d000     168     168       0 r-x-- ld-linux-x86-64.so.2
00007f9207847000      44      44       0 r---- ld-linux-x86-64.so.2
00007f9207853000       8       8       8 r---- ld-linux-x86-64.so.2
00007f9207855000       8       8       8 rw--- ld-linux-x86-64.so.2
00007ffcda74a000     132      12      12 rw---   [ stack ]
00007ffcda781000      16       0       0 r----   [ anon ]
00007ffcda785000       8       4       0 r-x--   [ anon ]
ffffffffff600000       4       0       0 --x--   [ anon ]
---------------- ------- ------- -------
total kB            2784    1548     108

So we can see the that two fd mappings of the shm handle replaces the single anonymous one.

mirrored_buffer.go

bytes/mirrored_buffer.go

grddev

Looks good.

In terms of implementation, the main comment is probably that I don't understand the Head method.

I'm also having some trouble understanding the tests, as they lack names and or comments to clarify what they are actually testing. With the lack of explicit assertions, it is also somewhat difficult to discern test setup from test execution.

bytes/mirrored_buffer.go

bytes/util.go

bytes/mirrored_buffer.go

bytes/mirrored_buffer_test.go

bytes/mirrored_buffer.go

Just to be safe. I don't think the second shunk will get GCed either way as we hold the baseAddr already. But still.

Co-authored-by: Gustav <[email protected]>

And inline the definition in the mirrored buffer's constructor instead.

This makes it such that we can instantiate MirroredBuffers concurrently.

grddev

Nice! I left some style comments, but they are not important.

bytes/mirrored_buffer_test.go

bytes/util_bsd.go

bytes/util_linux.go

bytes/mirrored_buffer.go

sergiu128 linked an issue Sep 10, 2023 that may be closed by this pull request

tcp: introduce and use a mirrored buffer #104

Closed

sergiu128 marked this pull request as draft September 10, 2023 18:44

github-advanced-security bot found potential problems Sep 10, 2023

View reviewed changes

mirrored_buffer.go Fixed Show fixed Hide fixed

mirrored_buffer.go Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Sep 11, 2023

View reviewed changes

mirrored_buffer.go Fixed Show fixed Hide fixed

sergiu128 force-pushed the 104-tcp-introduce-and-use-a-mirrored-buffer branch from 3674675 to a7e0909 Compare September 17, 2023 12:32

github-advanced-security bot found potential problems Sep 17, 2023

View reviewed changes

mirrored_buffer.go Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Sep 18, 2023

View reviewed changes

bytes/mirrored_buffer.go Fixed Show fixed Hide fixed

bytes/mirrored_buffer.go Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Sep 18, 2023

View reviewed changes

bytes/mirrored_buffer.go Fixed Show fixed Hide fixed

sergiu128 requested review from grddev, MateuszPulka and avgoustinos-kadis September 18, 2023 13:52

sergiu128 marked this pull request as ready for review September 18, 2023 13:52

sergiu128 changed the title ~~Introduce a circular buffer that always returns contiguous chunks~~ Introduce a circular buffer that always returns continuous chunks Sep 18, 2023

sergiu128 self-assigned this Sep 18, 2023

grddev reviewed Sep 19, 2023

View reviewed changes

talostrading deleted a comment from grddev Sep 20, 2023

github-advanced-security bot found potential problems Sep 20, 2023

View reviewed changes

bytes/mirrored_buffer.go Fixed Show fixed Hide fixed

sergiu128 force-pushed the 104-tcp-introduce-and-use-a-mirrored-buffer branch from bb42867 to bf52226 Compare September 21, 2023 16:52

sergiu128 added 12 commits September 21, 2023 18:52

Add mirrored buffer constructor

e354b90

Add MirroredBuffer.Claim

365645d

No pointer arithmetic between uintptr

499e1c2

Just to be safe. I don't think the second shunk will get GCed either way as we hold the baseAddr already. But still.

Add MirroredBuffer Commit and Consume

91b24a0

Add MirroredBuffer.Full

db10202

Ensure mirrored buffer is not GCed

bade776

Add mirrored buffer Destroy()

2236f96

Add mirrored buffer vs byte buffer benchmarks

439f63a

Cleanup benchmarks

7549004

Go mod tidy

e78f2f5

Update benchmarks

43bd6b2

Document mirrored buffer's mmaps

c0967f7

sergiu128 and others added 22 commits September 21, 2023 18:52

Make mirrored buffer work on macos

9f7b175

Ensure mirrored buffer's file does not exist

bc81bcb

Provide the filesystem name for a mirrored buffer

eab5982

Bound mirrored buffer max size in tests

73e5331

Bound mirrored buffer benchmark to 2MB

acf19eb

Note regarding raw mmaps

6816a11

Document MAP_PRIVATE and MAP_SHARED behaviour

93ba7d2

Test mapping a huge mirrored buffer

8da5177

Notes on what to do next

6b80314

Docs

7fa5099

MirroredBuffer: no need for baseAddr

d198586

MirroredBuffer: audit unsafe calls

71d3f4c

Update bytes/mirrored_buffer.go

6b96419

Co-authored-by: Gustav <[email protected]>

MirroredBuffer: get rid of mmapRaw

a9426f7

And inline the definition in the mirrored buffer's constructor instead.

MirroredBuffer: correctly destroy on failed ctor

da2a693

MirroredBuffer: close fd if constructor fails

fde1cbb

Cleanup documentation and tests

4036e6e

MirroredBuffer: update benchmark

380546f

MirroredBuffer: no branching when modulo size

32ee87a

MirroredBuffer: created file is temporary

2f32e6f

This makes it such that we can instantiate MirroredBuffers concurrently.

MirroredBuffer: cleanup size tests

ba96918

MirroredBuffer: cleanup tests

bf52226

sergiu128 requested a review from grddev September 22, 2023 08:08

grddev previously approved these changes Sep 22, 2023

View reviewed changes

sergiu128 dismissed grddev’s stale review via 21aea21 September 22, 2023 14:00

MirroredBuffer: cleanup some more

21aea21

sergiu128 requested a review from grddev September 22, 2023 14:21

grddev approved these changes Sep 25, 2023

View reviewed changes

sergiu128 merged commit 21f29b6 into master Sep 25, 2023

sergiu128 deleted the 104-tcp-introduce-and-use-a-mirrored-buffer branch September 25, 2023 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a circular buffer that always returns continuous chunks #105

Introduce a circular buffer that always returns continuous chunks #105

sergiu128 commented Sep 10, 2023 •

edited

Loading

grddev left a comment

grddev left a comment

Introduce a circular buffer that always returns continuous chunks #105

Introduce a circular buffer that always returns continuous chunks #105

Conversation

sergiu128 commented Sep 10, 2023 • edited Loading

Why do we need this?

Benchmarks

Appendix

DYI

grddev left a comment

Choose a reason for hiding this comment

grddev left a comment

Choose a reason for hiding this comment

sergiu128 commented Sep 10, 2023 •

edited

Loading