I just learned about memory alignment from Production Twitter on One Machine? 100Gbps NICs and NVMe are fast - Tristan Hume and this @mitchellh thread and I wanted to experiment to see how much of a difference it makes.
I tested:
- Iterating over an array containing struct elements spanning 3 cache lines, explicitly tagged with
repr(align(64))
- Iterating over an array containing struct elements spanning 3 cache lines, not tagged with
repr(align(64))
- Iterating over an array containing struct elements spanning 7 bytes
Results, using hyperfine
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
explicit |
3.2 ± 0.3 | 2.9 | 4.5 | 1.02 ± 0.10 |
implicit |
3.1 ± 0.1 | 2.9 | 4.5 | 1.00 |
no-alignment |
30.2 ± 0.3 | 29.8 | 32.8 | 9.68 ± 0.43 |
I found out that there's no real point of explicitly tagging the structs which are known to have a size which is a multiple of a cache line size. I'll have to learn more about when alignment is actually useful then.
But it's interesting/gratifying to see that the no-alignment
case is significantly slower.
I did a second round of experiments doing a similar experiment as above but instead of testing 24 byte (3 cache lines) structs or 7 byte structs I tried struct sizes from 1 byte to 128 bytes. Here are the results:
struct size (bytes) | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
1 | 287.7 ± 10.6 | 265.8 | 305.0 | 31.30 ± 2.66 |
2 | 184.7 ± 3.3 | 179.8 | 192.5 | 20.09 ± 1.58 |
3 | 229.2 ± 1.0 | 227.7 | 232.3 | 24.93 ± 1.91 |
4 | 198.6 ± 0.3 | 198.1 | 199.6 | 21.61 ± 1.66 |
5 | 180.7 ± 0.2 | 180.4 | 181.0 | 19.66 ± 1.51 |
6 | 168.4 ± 0.2 | 167.9 | 169.0 | 18.32 ± 1.40 |
7 | 159.9 ± 0.3 | 159.3 | 160.9 | 17.40 ± 1.33 |
8 | 45.3 ± 0.6 | 44.8 | 49.1 | 4.93 ± 0.38 |
9 | 52.1 ± 0.2 | 51.7 | 52.6 | 5.67 ± 0.43 |
10 | 57.7 ± 0.2 | 57.4 | 58.2 | 6.28 ± 0.48 |
11 | 82.1 ± 0.2 | 81.7 | 82.5 | 8.93 ± 0.68 |
12 | 84.4 ± 0.2 | 84.1 | 84.7 | 9.18 ± 0.70 |
13 | 86.8 ± 1.4 | 86.2 | 94.7 | 9.45 ± 0.74 |
14 | 88.1 ± 0.2 | 87.6 | 89.0 | 9.58 ± 0.73 |
15 | 89.5 ± 0.3 | 89.1 | 90.3 | 9.74 ± 0.75 |
16 | 29.5 ± 0.2 | 29.2 | 30.1 | 3.20 ± 0.25 |
17 | 29.6 ± 0.3 | 28.9 | 30.7 | 3.22 ± 0.25 |
18 | 33.0 ± 0.2 | 32.7 | 33.9 | 3.60 ± 0.28 |
19 | 48.3 ± 0.3 | 48.0 | 49.8 | 5.26 ± 0.40 |
20 | 51.4 ± 0.7 | 51.0 | 55.3 | 5.59 ± 0.43 |
21 | 54.2 ± 0.2 | 53.8 | 55.0 | 5.90 ± 0.45 |
22 | 56.8 ± 0.2 | 56.3 | 57.8 | 6.18 ± 0.47 |
23 | 59.0 ± 0.2 | 58.7 | 59.5 | 6.41 ± 0.49 |
24 | 33.8 ± 0.2 | 33.4 | 34.9 | 3.68 ± 0.28 |
25 | 32.5 ± 0.2 | 32.1 | 33.1 | 3.53 ± 0.27 |
26 | 35.5 ± 0.2 | 35.2 | 36.7 | 3.86 ± 0.30 |
27 | 46.4 ± 0.5 | 46.0 | 49.2 | 5.05 ± 0.39 |
28 | 48.6 ± 0.2 | 48.3 | 49.9 | 5.29 ± 0.41 |
29 | 50.5 ± 0.1 | 50.3 | 50.8 | 5.50 ± 0.42 |
30 | 52.7 ± 0.2 | 52.4 | 53.6 | 5.73 ± 0.44 |
31 | 54.4 ± 0.2 | 54.2 | 55.5 | 5.92 ± 0.45 |
32 | 29.0 ± 0.3 | 28.6 | 30.3 | 3.16 ± 0.24 |
33 | 28.2 ± 0.2 | 27.9 | 28.6 | 3.07 ± 0.24 |
34 | 30.6 ± 0.2 | 30.3 | 31.3 | 3.33 ± 0.26 |
35 | 39.2 ± 0.3 | 38.8 | 40.2 | 4.26 ± 0.33 |
36 | 41.1 ± 0.2 | 40.7 | 42.3 | 4.47 ± 0.34 |
37 | 43.0 ± 0.3 | 42.6 | 43.8 | 4.68 ± 0.36 |
38 | 44.6 ± 0.2 | 44.4 | 45.0 | 4.86 ± 0.37 |
39 | 46.4 ± 0.4 | 46.1 | 48.3 | 5.05 ± 0.39 |
40 | 26.2 ± 0.3 | 25.8 | 28.0 | 2.85 ± 0.22 |
41 | 25.6 ± 0.2 | 25.2 | 26.2 | 2.78 ± 0.21 |
42 | 27.5 ± 0.2 | 27.2 | 27.9 | 3.00 ± 0.23 |
43 | 34.5 ± 0.3 | 34.2 | 36.4 | 3.76 ± 0.29 |
44 | 36.3 ± 0.2 | 35.8 | 37.3 | 3.95 ± 0.30 |
45 | 37.8 ± 0.2 | 37.6 | 38.7 | 4.11 ± 0.32 |
46 | 39.5 ± 0.2 | 39.1 | 40.3 | 4.29 ± 0.33 |
47 | 41.1 ± 0.6 | 40.6 | 44.7 | 4.47 ± 0.35 |
48 | 24.3 ± 0.4 | 24.0 | 26.7 | 2.65 ± 0.21 |
49 | 23.9 ± 0.2 | 23.6 | 24.5 | 2.60 ± 0.20 |
50 | 25.6 ± 0.3 | 25.3 | 28.8 | 2.79 ± 0.22 |
51 | 31.5 ± 0.2 | 31.1 | 32.5 | 3.43 ± 0.26 |
52 | 33.0 ± 0.2 | 32.6 | 33.8 | 3.59 ± 0.28 |
53 | 34.5 ± 0.2 | 34.1 | 35.7 | 3.75 ± 0.29 |
54 | 35.8 ± 0.3 | 35.4 | 37.0 | 3.89 ± 0.30 |
55 | 37.2 ± 0.2 | 36.8 | 38.0 | 4.04 ± 0.31 |
56 | 22.8 ± 0.1 | 22.6 | 23.4 | 2.48 ± 0.19 |
57 | 22.5 ± 0.2 | 22.2 | 23.2 | 2.44 ± 0.19 |
58 | 24.0 ± 0.2 | 23.7 | 24.9 | 2.61 ± 0.20 |
59 | 29.2 ± 0.4 | 28.8 | 31.3 | 3.18 ± 0.25 |
60 | 30.5 ± 0.2 | 30.1 | 31.1 | 3.31 ± 0.25 |
61 | 31.8 ± 0.2 | 31.4 | 32.7 | 3.46 ± 0.27 |
62 | 33.0 ± 0.2 | 32.7 | 33.6 | 3.59 ± 0.28 |
63 | 34.2 ± 0.2 | 33.9 | 34.8 | 3.72 ± 0.29 |
64 | 9.5 ± 0.2 | 9.1 | 10.4 | 1.04 ± 0.08 |
65 | 11.3 ± 0.2 | 10.8 | 11.8 | 1.23 ± 0.10 |
66 | 12.2 ± 0.4 | 11.6 | 14.1 | 1.33 ± 0.11 |
67 | 16.5 ± 0.7 | 15.8 | 19.9 | 1.79 ± 0.16 |
68 | 17.5 ± 0.4 | 17.1 | 19.7 | 1.91 ± 0.15 |
69 | 19.3 ± 1.0 | 18.4 | 24.1 | 2.10 ± 0.19 |
70 | 20.3 ± 0.5 | 19.8 | 23.3 | 2.21 ± 0.18 |
71 | 21.4 ± 0.3 | 21.0 | 24.2 | 2.33 ± 0.18 |
72 | 10.4 ± 0.5 | 9.8 | 13.5 | 1.14 ± 0.10 |
73 | 11.0 ± 0.2 | 10.4 | 11.5 | 1.19 ± 0.09 |
74 | 11.3 ± 0.5 | 10.7 | 16.4 | 1.22 ± 0.11 |
75 | 14.4 ± 0.1 | 14.2 | 15.0 | 1.57 ± 0.12 |
76 | 15.7 ± 0.2 | 15.4 | 16.6 | 1.70 ± 0.13 |
77 | 16.9 ± 0.2 | 16.6 | 18.2 | 1.84 ± 0.14 |
78 | 18.3 ± 0.4 | 17.8 | 21.3 | 1.99 ± 0.16 |
79 | 19.4 ± 0.5 | 19.0 | 21.9 | 2.11 ± 0.17 |
80 | 10.1 ± 0.3 | 9.6 | 12.2 | 1.10 ± 0.09 |
81 | 10.8 ± 0.2 | 10.4 | 12.6 | 1.18 ± 0.09 |
82 | 11.4 ± 0.2 | 10.9 | 12.3 | 1.24 ± 0.10 |
83 | 13.2 ± 0.2 | 12.9 | 14.2 | 1.44 ± 0.11 |
84 | 14.3 ± 0.1 | 14.1 | 15.0 | 1.56 ± 0.12 |
85 | 15.4 ± 0.2 | 15.1 | 16.2 | 1.67 ± 0.13 |
86 | 16.6 ± 0.4 | 16.3 | 19.4 | 1.80 ± 0.14 |
87 | 17.6 ± 0.2 | 17.2 | 18.5 | 1.91 ± 0.15 |
88 | 11.2 ± 0.2 | 10.9 | 12.1 | 1.22 ± 0.10 |
89 | 11.3 ± 0.2 | 10.9 | 11.9 | 1.22 ± 0.10 |
90 | 12.2 ± 0.2 | 11.9 | 13.4 | 1.33 ± 0.10 |
91 | 15.5 ± 0.1 | 15.3 | 16.0 | 1.69 ± 0.13 |
92 | 16.6 ± 0.2 | 16.3 | 17.5 | 1.81 ± 0.14 |
93 | 17.5 ± 0.2 | 17.2 | 18.0 | 1.91 ± 0.15 |
94 | 18.6 ± 0.2 | 18.3 | 19.5 | 2.03 ± 0.16 |
95 | 19.5 ± 0.2 | 19.3 | 20.1 | 2.12 ± 0.16 |
96 | 11.4 ± 0.1 | 11.2 | 11.7 | 1.24 ± 0.10 |
97 | 11.5 ± 0.2 | 11.2 | 12.8 | 1.26 ± 0.10 |
98 | 12.4 ± 0.2 | 12.1 | 13.6 | 1.35 ± 0.11 |
99 | 16.0 ± 0.8 | 15.3 | 18.4 | 1.74 ± 0.16 |
100 | 17.0 ± 0.5 | 16.3 | 18.6 | 1.85 ± 0.15 |
101 | 17.7 ± 0.7 | 17.1 | 20.5 | 1.93 ± 0.17 |
102 | 19.0 ± 0.7 | 18.1 | 21.9 | 2.06 ± 0.17 |
103 | 19.6 ± 0.5 | 19.0 | 22.3 | 2.13 ± 0.17 |
104 | 11.6 ± 0.2 | 11.4 | 12.7 | 1.26 ± 0.10 |
105 | 12.1 ± 0.5 | 11.5 | 14.2 | 1.31 ± 0.11 |
106 | 13.5 ± 0.9 | 12.4 | 17.3 | 1.46 ± 0.15 |
107 | 16.3 ± 1.0 | 15.3 | 20.1 | 1.78 ± 0.17 |
108 | 17.2 ± 1.1 | 16.1 | 20.4 | 1.87 ± 0.19 |
109 | 17.3 ± 0.5 | 16.9 | 19.9 | 1.89 ± 0.15 |
110 | 18.1 ± 0.2 | 17.8 | 19.2 | 1.97 ± 0.15 |
111 | 19.0 ± 0.6 | 18.6 | 22.0 | 2.07 ± 0.17 |
112 | 12.0 ± 0.4 | 11.6 | 14.0 | 1.30 ± 0.11 |
113 | 11.9 ± 0.2 | 11.6 | 12.4 | 1.30 ± 0.10 |
114 | 13.1 ± 0.8 | 12.4 | 16.5 | 1.42 ± 0.14 |
115 | 15.4 ± 0.3 | 15.1 | 17.4 | 1.68 ± 0.13 |
116 | 16.4 ± 0.5 | 16.0 | 18.9 | 1.78 ± 0.15 |
117 | 17.7 ± 1.0 | 16.8 | 20.5 | 1.93 ± 0.19 |
118 | 17.8 ± 0.3 | 17.5 | 19.3 | 1.94 ± 0.15 |
119 | 18.6 ± 0.2 | 18.3 | 19.4 | 2.02 ± 0.16 |
120 | 12.1 ± 0.1 | 11.8 | 12.7 | 1.31 ± 0.10 |
121 | 12.1 ± 0.3 | 11.7 | 14.6 | 1.32 ± 0.11 |
122 | 13.0 ± 0.5 | 12.6 | 16.1 | 1.41 ± 0.12 |
123 | 15.5 ± 0.5 | 15.1 | 18.4 | 1.69 ± 0.14 |
124 | 16.1 ± 0.1 | 15.8 | 16.4 | 1.75 ± 0.13 |
125 | 16.9 ± 0.2 | 16.6 | 18.4 | 1.84 ± 0.14 |
126 | 18.9 ± 1.1 | 17.4 | 23.6 | 2.05 ± 0.19 |
127 | 19.5 ± 1.0 | 18.1 | 22.3 | 2.12 ± 0.20 |
128 | 9.2 ± 0.7 | 8.3 | 11.5 | 1.00 |
After running this though, I'm not sure this experiment is actually isolating what I'm trying to test. When the struct size is 1 bytes, we are allocating a vector of 1_572_864 bytes. This means that the CPU intensive operation that is run on each element of the vector is being called more times, which could explain the increased time. That being said, the best performance comes when the array elements at 64/128 bytes wide which corresponds to an integer multiple of a cache line. So there is something here but I think I need to do a better job of isolating the thing that I'm trying to measure. I'm not sure how to fix this yet so I guess I need to read up more about cache lines. Until next time!