-
Notifications
You must be signed in to change notification settings - Fork 1
/
talk.slide
227 lines (175 loc) · 6.81 KB
/
talk.slide
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
Optimizing performance
or: How I Learned to Stop Worrying and Love the GO
10 Jul 2015
Tags: denvergophers, sendgrid, gophercon2015
Kane Kim, Vasko Zdravevski, Isaac Saldana
SendGrid
https://github.com/Kane-Sendgrid/gopher-meetup-talk
@KaneKK
* Performance Tweaks - Lessons Learned
- SendGrid's servers process 19B+ emails/month
- 137 servers around the world running Python/Twisted to handle that load
- Rewrite in optimized Go, reducing server need from 137 -> less than 10
* Topic Covered
- Methodologies
- Benchmarks and results
- Tips and Tricks
* Goals
- Optimizing for hardware (network throughput, cpu, disk, memory)
1gb NIC, 24 cores ~$500
10gb NIC, 40 cores ~$800
- Can't utilize bandwidth/cores = loosing money
- Reach 500MB/s (4gbps) disk writes per server
* Bottlenecks
- Garbage collection: prevents utilizing available CPU
- io.Copy has hidden copy and allocation if WriteTo/ReadFrom methods are missing
- Disk writes, can't enable TRIM, page size alignment
- Limiting number of CGO/syscalls
- stdlib inefficiencies: http (parsing multipart forms), net/smtp.Client
* Environment test
.play -numbers code/envs.go
* Garbage collection: workarounds
- Limit pointers
- Reuse objects
- Add WriteTo/ReadFrom methods
- Slice/map preallocation
- Smaller datatypes
- Flat heap structure
* Slices:
.play -numbers code/slices/slice.go /BEGIN/,/END/
* Slices: no GC
.play -numbers code/slices/slice_nogc.go /BEGIN/,/END/
* Pipes Example 1
.play -numbers code/pipe1/pipe1.go /BEGIN/,/END/
* Pipes Example 2 setup
.code -numbers code/pipe2/pipe2.go /SETUP_START/,/SETUP_END/
* Pipes Example 2
.play -numbers code/pipe2/pipe2.go /BEGIN/,/END/
* Buffer Setup
.code -numbers code/freelist/freelist.go /SS/,/SE/
* Buffer Benchmark
.play -numbers code/freelist/freelist.go /START/,/END/
* Maps in 1.4 have bug (9477)
.play -numbers code/maps/map.go /START/,/END/
* Debugging GC problems
- GOGC setting, measure performance with disabled GC for potentially problematic code
- GODEBUG=gctrace=1 - tracing garbage collection
- gcflags=-m - analyzing heap escapes
- pprof
* GOGC
- sets percentage of live heap increase required to trigger GC
- higher percentage == less often GC runs, bigger heap, chance of longer STW
- go 1.5 has improved GC mechanism, STW phase is shorter, but cost of MARK phase is higher
* Reading output of GODEBUG=gctrace=1 (v1.4)
gc4(6): 1+4+29449+34 us, 0 -> 36 MB, 418 (645-227) objects, 5 goroutines, 45/21/0 sweeps,
14(162) handoff, 1(1) steal, 180/54/1250 yields
- gc4(6): 4 - number of the GC run since start, 6 - number of threads doing GC
- 1+4+29449+34 us: GC phases: STW/SWEEP/MARK/FINISH, most important stop-the-world and marking phases
- 0 -> 36 MB: 0 - live heap after previous GC, 36: live heap before this GC
- 418 (645-227) objects: 645-227=418, number of objects allocated, freed (total since beginning) and currently in the heap
* pprof
- run net/http/pprof in production (negligible overhead)
- run test with: -cpuprofile=test.prof
- go tool pprof --cum test.binary test.prof
(pprof) top30
3.33s of 3.37s total (98.81%)
Dropped 5 nodes (cum <= 0.02s)
flat flat% sum% cum cum%
0 0% 0% 2.97s 88.13% System
2.43s 72.11% 72.11% 2.43s 72.11% runtime.mach_semaphore_wait
0.48s 14.24% 86.35% 0.48s 14.24% MCentral_Grow
0 0% 86.35% 0.33s 9.79% GC
0.26s 7.72% 94.07% 0.26s 7.72% runtime.usleep
0 0% 94.07% 0.07s 2.08% github.com/sendgrid/go-labs/structures.TestMap
* Disk IO
- Writes are aligned on page size
- If possible never write less data than page size
- Page size is different for different SSD models, 16KB page is currently a maximum
- Wrap file handlers with buffers aligned by page size
- SSD garbage collection, enable TRIM (unfortunately most raid controllers don't support it)
* Buffered disk writes and freelist pool
type BufioWriterPool struct {
pool sync.Pool
}
func NewBufioWriterPool() *BufioWriterPool {
p := BufioWriterPool{}
p.pool = sync.Pool{
New: func() interface{} {
return bufio.NewWriterSize(nil, 16*1024)
},
}
return &p
}
func (b *BufioWriterPool) Get(w io.Writer) *bufio.Writer {
writer := b.pool.Get().(*bufio.Writer)
writer.Reset(w)
return writer
}
func (b *BufioWriterPool) Put(w *bufio.Writer) {
b.pool.Put(w)
}
* Wrap files with buffers
f, err := fileCreate(m.FileName)
fileBuffer := fileBuffers.Get(f)
defer fileBuffers.Put(fileBuffer)
* Linux disk caching
- sysctl -a | grep dirty
- by default on centos: vm.dirty_background_ratio = 10, vm.dirty_ratio = 20
- decreasing cache: lower latency, might be lower overall throughput
- increasing cache: higher throughput, risk of long pauses, higher risk to loose data
* CGO/syscalls
- every cgo/syscall may potentially cause new os thread creation
- keep number of syscalls/cgo calls limited or crash with "can't create thread"
- default process limit is usually 1024 (ulimit -u)
- golang os thread limit is hardcoded to 10000
- openssl, cgo, syslog - too many syscalls
* Limit syscalls
var threadLimit = make(chan struct{}, 1000)
func acquireThread() {
threadLimit <- struct{}{}
}
func releaseThread() {
<-threadLimit
}
slow, use with caution where neccessary
batch cgo/syscalls
* stdlib
- net/smtp.Client: dotWriter is slow, iterates 1 character at a time, doesn't reuse buffers
- rewrote it to read lines and buffer pools: 2x speedup
- TCPConn has sendfile optimization
func (c *TCPConn) ReadFrom(r io.Reader) (int64, error) {
if n, err, handled := sendFile(c.fd, r); handled {
return n, err
}
return genericReadFrom(c, r)
}
- remove dotWriter completely, write to disk in wire format, can use sendfile now! +20% speedup
* stdlib
- http request parsing: doesn't reuse buffers, converts bytes to strings - more garbage, excessive copying
func slicebytetostring(b []byte) string {
if raceenabled && len(b) > 0 {
racereadrangepc(unsafe.Pointer(&b[0]),
uintptr(len(b)),
getcallerpc(unsafe.Pointer(&b)),
funcPC(slicebytetostring))
}
s, c := rawstring(len(b))
copy(c, b)
return s
}
* obvious things to avoid
- string concatenation (use buffers instead)
- string to bytes and bytes to string conversions (stay in bytes)
- defers in one-line functions (micro-optimization)
- connections without timeouts (net/smtp.Client)
- non-buffered IO
* SendGrid is HIRING!!!
.link https://sendgrid.com/careers
4 difference offices:
- Denver, Colorado
- Boulder, Colorado
- Orange County, California
- Redwood City, California
Hiring for all levels and positions
*Come* *work* *with* *us!*