Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server+util: Limit max request sizes, prealloc request buffers #6868

Conversation

philipaconrad
Copy link
Contributor

@philipaconrad philipaconrad commented Jul 10, 2024

Why the changes in this PR are needed?

In a recent bugfix PR, I discovered some performance issues around how request bodies are handled in OPA, particularly when the request is gzipped. This PR is a follow-up to that one, and changes how OPA reads request bodies.

The main performance benefits of this PR come from reducing the workload of the Golang GC, and also by removing redundant Read operations on the request body.

What are the changes in this PR?

This PR preallocates the entire buffer for a request body (and also for its decompressed payload) when possible. This reduces max RSS for large requests by around 5-10%, and reduces GC overhead by 33-66% for large requests, while not penalizing small (<1kb) requests.

The changes are structured as follows:

  • docs/: Updated for the new config options.
  • plugins/server/decoding: New always-on plugin, modeled after the server/encoding plugin.
    • Includes new handler type for the OPA HTTP server.
  • server/server.go: Integrates the new request handler + uses the request body reading function.
  • util/read_gzip_body.go: Allocs whole buffers up-front now, and checks gzip blob sizes before decompressing.

Other small changes were needed in a variety of spots to account for new method signatures, and so on.

New configuration options: server.decoding

This PR introduces two new server config fields under server.decoding. Here's an example configuration file:

# Only allow up to 128 MB request bodies, and up to 256 MB decompressed gzip payloads.
server:
  decoding:
    max_length: 134217728
    gzip:
      max_length: 268435456

These fields keep in line with the naming scheme from server.encoding, but I'm open to other ideas! 😅

Defending against huge requests

On main currently, an attacker can tie up a huge amount of resources for an OPA instance with a large, malicious request. Gzipped payloads amplify the resources that can ultimately be tied up. (Example: a 1 GB JSON object with repetitive values can compress down to 5 MB or smaller with gzip.)

In this PR, buffers are allocated all-at-once, which is hazardous-- an attacker could submit a huge request, and cause OPA to experience an out of memory (OOM) error just from allocating the buffers for the request.

In the case of gzip payloads, the raw payload size may be small, but with a forged gzip size trailer field, the attack could cause the buffer allocation logic to believe that the gzip blob will expand to up to 4 GB in size when decompressed! This makes preallocating the decompression buffer a dangerous thing to do without some sort of validation beforehand.

In this PR, the proposed solution is validating the size of incoming request bodies against server-configured limits before allocating any buffers. That forces attackers in both scenarios to have to work a bit harder to do anything nefarious.

Where are the new validation checks?

The defenses against huge requests happen in 2x places:

  • For large request payloads, these are checked in the new DecodingLimits handler in the OPA HTTP server. This allows us to reject those requests before they can significantly tie up an OPA server's resources.
  • For large gzip payloads, these are checked in the util.ReadMaybeCompressedBody function (used whenever we want to read the request body contents), which checks the gzip blob's 4-byte size trailer field against the server's configured size limits.

Notes to assist PR review:

I've moved benchmark sets into comments in the PR discussion thread, since this top PR description was getting a bit crowded. 😅

Scripts used:

summarize-gctrace.awk
# AWK script for summarizing GODEBUG=gctrace=1 logs.
# Known to work for Go 1.22, but the log format is subject to change.
# The format is the following (pulled from the Go `runtime` package docs):
#   gc # @#s #%: #+#+# ms clock, #+#/#/#+# ms cpu, #->#-># MB, # MB goal, # MB stacks, #MB globals, # P
#                ^$5             ^$10                                                               ^$28
# Example:
#   gc 11 @3.169s 1%: 0.025+509+0.017 ms clock, 0.10+0/261/0.81+0.068 ms cpu, 3585->3585->2561 MB, 3585 MB goal, 0 MB stacks, 0 MB globals, 4 P

BEGIN {
    FS = "[\t +/]+"
}

# Some simple assertions to prevent matching non-gctrace logs.
$1 != "gc" { non_matching_rows++; next; }
$9 == "cpu" { non_matching_rows++; next; }

# The main pattern-action block:
{
    num_threads = $28

    clock_stw_sweep_term += $5
    clock_conc_mark_scan += $6
    clock_stw_mark_term += $7

    cur_clock = $5 + $6 + $7
    total_clock += cur_clock
    max_pause_clock = (max_pause_clock > cur_clock) ? max_pause_clock : cur_clock

    cpu_ms_stw_sweep_term += $10
    cpu_ms_assist_gc += $11
    cpu_ms_background_gc += $12
    cpu_ms_idle_gc += $13
    cpu_ms_stw_mark_term += $14
    total_cpu += $11 + $12 + $13
}

END {
    print  "--- Wall clock time summary -------------------"
    printf "  STW sweep termination ......... %10.3f ms\n", clock_stw_sweep_term
    printf "  Concurrent mark and scan ...... %10.3f ms\n", clock_conc_mark_scan
    printf "  STW mark termination .......... %10.3f ms\n", clock_stw_mark_term
    printf "                          Total = %10.3f ms (%.3f s)\n", total_clock, total_clock / 1000
    printf "               Longest GC Pause = %10.3f ms\n", max_pause_clock
    print  ""
    print  "--- CPU time breakdown ------------------------"
    printf "  STW sweep termination ......... %10.3f ms\n", cpu_ms_stw_sweep_term
    printf "    Mark/Scan assist time ....... %10.3f ms\n", cpu_ms_assist_gc
    printf "    Mark/Scan background GC time  %10.3f ms\n", cpu_ms_background_gc
    printf "    Mark/Scan idle GC time ...... %10.3f ms\n", cpu_ms_idle_gc
    printf "  STW mark termination .......... %10.3f ms\n", cpu_ms_stw_mark_term
    printf "                          Total = %10.3f ms (%.3f s) (Does not include STW phases)\n", total_cpu, total_cpu / 1000
    print  ""
    print  "Num threads:", num_threads
    print  "Non-matching rows:", non_matching_rows
}
geninput.py
N = 1024 * 1024
oneKBString = "A" * 1024

if __name__ == "__main__":
    print('{"input":{')
    for i in range(0, N):
        print('"{}": "{}",'.format(i, oneKBString))
    print('"{}": "{}"'.format(N, oneKBString))
    print("}}")

Further comments:

Remaining TODO items:

  • Add draft release notes
  • Add more tests?
  • Add Go benchmark?

Things I'd love reviewer comments/opinions on:

  • How big should our default max request size be? How about the max gzip payload size?
  • Could the error messages be improved?
  • Could the config keys have better names?

@philipaconrad philipaconrad added optimization performance go Pull requests that update Go code labels Jul 10, 2024
@philipaconrad philipaconrad self-assigned this Jul 10, 2024
Copy link

netlify bot commented Jul 10, 2024

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit db57672
🔍 Latest deploy log https://app.netlify.com/sites/openpolicyagent/deploys/669e86d42e52d7000883fef4
😎 Deploy Preview https://deploy-preview-6868--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@philipaconrad
Copy link
Contributor Author

philipaconrad commented Jul 11, 2024

1 GB gzipped input POST benchmark

Summary:

  • Around a 100 MB reduction in max RSS.
  • Allocs graph indicates a near 50% reduction in allocated memory, but much of that is almost certainly virtual memory that is not showing up as part of the RSS.
  • Around a 66% reduction in GC time.
  • Fewer GC collections (main: 31x, PR: 8x)

CLI command for request: python3 geninput.py | gzip | curl --verbose -X POST -H "Content-Encoding: gzip" --data-binary @- http://127.0.0.1:8181/v1/data

Results from main
gotraceui rendering from main

Purple indicates GC activity.

image

GC trace from main
$ GOMAXPROCS=4 GODEBUG=gctrace=1 /usr/bin/time -v ./opa-main run -s -c config.yaml 2> >(tee >(awk -f summarize-gctrace.awk >&2) >&2)
gc 1 @0.109s 2%: 0.093+3.5+1.4 ms clock, 0.37+3.1/0.94/3.9+5.9 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 2 @0.123s 2%: 0.028+2.5+0.054 ms clock, 0.11+0.19/1.5/1.7+0.21 ms cpu, 3->4->3 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
{"addrs":[":8181"],"diagnostic-addrs":[],"level":"info","msg":"Initializing server. OPA is running on a public (0.0.0.0) network interface. Unless you intend to expose OPA outside of the host, binding to the localhost interface (--addr localhost:8181) is recommended. See https://www.openpolicyagent.org/docs/latest/security/#interface-binding","time":"2024-07-11T01:15:29-04:00"}
gc 3 @0.353s 1%: 0.030+1.7+0.019 ms clock, 0.12+0.13/1.5/3.0+0.076 ms cpu, 7->7->4 MB, 7 MB goal, 0 MB stacks, 0 MB globals, 4 P
{"client_addr":"127.0.0.1:53020","level":"info","msg":"Received request.","req_id":1,"req_method":"POST","req_path":"/v1/data","time":"2024-07-11T01:15:42-04:00"}
gc 4 @13.343s 0%: 0.034+0.99+0.040 ms clock, 0.13+0.29/0.46/0.96+0.16 ms cpu, 8->9->5 MB, 9 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 5 @13.346s 0%: 0.026+0.94+0.053 ms clock, 0.10+0/0.90/1.1+0.21 ms cpu, 9->9->5 MB, 11 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 6 @13.348s 0%: 0.019+1.6+0.013 ms clock, 0.077+0/0.94/2.2+0.053 ms cpu, 11->11->6 MB, 11 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 7 @13.351s 0%: 0.029+3.1+0.013 ms clock, 0.11+0/0.98/3.3+0.055 ms cpu, 14->14->8 MB, 14 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 8 @13.355s 0%: 0.025+3.9+0.005 ms clock, 0.10+0/3.8/1.9+0.023 ms cpu, 19->19->10 MB, 19 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 9 @13.362s 0%: 0.034+6.1+0.016 ms clock, 0.13+0/0.92/6.6+0.066 ms cpu, 28->28->14 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 10 @13.370s 0%: 0.024+9.3+0.015 ms clock, 0.097+0/0.89/9.9+0.060 ms cpu, 42->42->19 MB, 42 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 11 @13.381s 0%: 0.028+2.4+0.012 ms clock, 0.11+0/2.3/2.0+0.051 ms cpu, 39->39->23 MB, 40 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 12 @13.385s 0%: 0.021+2.3+0.015 ms clock, 0.086+0/0.93/3.1+0.062 ms cpu, 48->48->28 MB, 48 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 13 @13.391s 0%: 0.029+16+0.014 ms clock, 0.11+0/1.0/17+0.057 ms cpu, 59->59->34 MB, 59 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 14 @13.411s 0%: 0.039+3.9+0.012 ms clock, 0.15+0/0.89/4.5+0.049 ms cpu, 73->73->42 MB, 73 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 15 @13.419s 0%: 0.021+10+0.013 ms clock, 0.084+0/10/1.1+0.055 ms cpu, 90->90->52 MB, 90 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 16 @13.433s 0%: 0.015+32+0.012 ms clock, 0.062+0/0.77/33+0.050 ms cpu, 112->112->64 MB, 112 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 17 @13.470s 0%: 0.017+8.5+0.005 ms clock, 0.069+0/0.82/9.0+0.020 ms cpu, 139->139->79 MB, 139 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 18 @13.485s 0%: 0.028+17+0.013 ms clock, 0.11+0/0.87/18+0.053 ms cpu, 173->173->98 MB, 173 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 19 @13.512s 0%: 0.030+62+0.012 ms clock, 0.12+0/0.91/62+0.048 ms cpu, 215->215->121 MB, 215 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 20 @13.586s 0%: 0.026+26+0.014 ms clock, 0.10+0/0.94/27+0.058 ms cpu, 268->268->151 MB, 268 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 21 @13.625s 0%: 0.063+39+0.013 ms clock, 0.25+0/39/1.7+0.054 ms cpu, 334->334->187 MB, 334 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 22 @13.679s 0%: 0.037+122+0.012 ms clock, 0.15+0/0.86/123+0.049 ms cpu, 417->417->233 MB, 417 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 23 @13.819s 0%: 0.028+30+0.013 ms clock, 0.11+0/1.5/32+0.055 ms cpu, 520->520->291 MB, 520 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 24 @13.872s 0%: 0.033+70+0.016 ms clock, 0.13+0/70/1.3+0.064 ms cpu, 649->649->362 MB, 649 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 25 @13.972s 0%: 0.032+236+0.052 ms clock, 0.13+0/0.90/237+0.20 ms cpu, 810->811->452 MB, 811 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 26 @14.242s 0%: 0.029+48+0.056 ms clock, 0.11+0/1.7/49+0.22 ms cpu, 1012->1012->564 MB, 1012 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 27 @14.333s 0%: 0.029+128+0.012 ms clock, 0.11+0/0.79/129+0.049 ms cpu, 1264->1264->704 MB, 1264 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 28 @14.513s 0%: 0.027+443+0.011 ms clock, 0.10+0/0.76/443+0.045 ms cpu, 1580->1580->879 MB, 1580 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 29 @15.026s 0%: 0.036+91+0.055 ms clock, 0.14+0/91/3.1+0.22 ms cpu, 1973->1973->1098 MB, 1974 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 30 @17.211s 0%: 0.12+1.2+0.13 ms clock, 0.50+0.15/0.93/1.0+0.54 ms cpu, 3146->3146->2634 MB, 3146 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 31 @20.638s 0%: 0.029+69+0.052 ms clock, 0.11+0.12/68/109+0.21 ms cpu, 5265->5286->3708 MB, 5269 MB goal, 0 MB stacks, 0 MB globals, 4 P
{"client_addr":"127.0.0.1:53020","level":"info","msg":"Sent response.","req_id":1,"req_method":"POST","req_path":"/v1/data","resp_bytes":14,"resp_duration":9793.283196,"resp_status":200,"time":"2024-07-11T01:15:52-04:00"}
{"level":"info","msg":"Shutting down...","time":"2024-07-11T01:16:02-04:00"}
{"level":"info","msg":"Server shutdown.","time":"2024-07-11T01:16:02-04:00"}
	Command being timed: "./opa-main run -s -c config.yaml"
	User time (seconds): 10.15
	System time (seconds): 1.80
	Percent of CPU this job got: 34%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:34.27
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 4446840
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 337
	Minor (reclaiming a frame) page faults: 1125954
	Voluntary context switches: 6202
	Involuntary context switches: 159
	Swaps: 0
	File system inputs: 52520
	File system outputs: 2968
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
--- Wall clock time summary -------------------
  STW sweep termination .........      1.059 ms
  Concurrent mark and scan ......   1490.930 ms
  STW mark termination ..........      2.172 ms
                          Total =   1494.161 ms (1.494 s)
               Longest GC Pause =    443.038 ms

--- CPU time breakdown ------------------------
  STW sweep termination .........      4.185 ms
    Mark/Scan assist time .......      3.980 ms
    Mark/Scan background GC time     307.700 ms
    Mark/Scan idle GC time ......   1340.360 ms
  STW mark termination ..........      9.022 ms
                          Total =   1652.040 ms (1.652 s) (Does not include STW phases)

Num threads: 4
Non-matching rows: 28
Allocs graph for main

allocs-main

Heap graph for main

heap-main

Results from this PR
gotraceui rendering from main

Purple indicates GC activity.

image

GC trace from this PR
$ GOMAXPROCS=4 GODEBUG=gctrace=1 /usr/bin/time -v ./opa-pr run -s -c config.yaml 2> >(tee >(awk -f summarize-gctrace.awk >&2) >&2)
gc 1 @0.010s 1%: 0.020+0.65+0.014 ms clock, 0.080+0.24/0.52/0.90+0.058 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
{"addrs":[":8181"],"diagnostic-addrs":[],"level":"info","msg":"Initializing server. OPA is running on a public (0.0.0.0) network interface. Unless you intend to expose OPA outside of the host, binding to the localhost interface (--addr localhost:8181) is recommended. See https://www.openpolicyagent.org/docs/latest/security/#interface-binding","time":"2024-07-11T01:19:31-04:00"}
gc 2 @0.396s 0%: 0.046+1.4+0.026 ms clock, 0.18+0.38/1.2/2.3+0.10 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 3 @0.409s 0%: 0.041+0.91+0.006 ms clock, 0.16+0.18/0.73/1.8+0.025 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 4 @0.417s 0%: 0.044+1.2+0.024 ms clock, 0.17+0.13/1.0/1.9+0.098 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
{"client_addr":"127.0.0.1:36074","level":"info","msg":"Received request.","req_id":1,"req_method":"POST","req_path":"/v1/data","time":"2024-07-11T01:19:41-04:00"}
gc 5 @10.248s 0%: 0.032+0.57+0.003 ms clock, 0.13+0.11/0.54/1.2+0.014 ms cpu, 7->7->7 MB, 7 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 6 @10.251s 0%: 0.023+0.60+0.047 ms clock, 0.093+0.096/0.51/0.99+0.18 ms cpu, 1045->1045->1045 MB, 1045 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 7 @13.034s 0%: 0.065+528+0.014 ms clock, 0.26+0/268/1.5+0.056 ms cpu, 3093->3093->2575 MB, 3093 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 8 @16.677s 0%: 0.038+54+0.011 ms clock, 0.15+0.080/53/105+0.044 ms cpu, 5148->5158->3600 MB, 5152 MB goal, 0 MB stacks, 0 MB globals, 4 P
{"client_addr":"127.0.0.1:36074","level":"info","msg":"Sent response.","req_id":1,"req_method":"POST","req_path":"/v1/data","resp_bytes":14,"resp_duration":9027.930343,"resp_status":200,"time":"2024-07-11T01:19:50-04:00"}
{"level":"info","msg":"Shutting down...","time":"2024-07-11T01:19:55-04:00"}
{"level":"info","msg":"Server shutdown.","time":"2024-07-11T01:19:55-04:00"}
	Command being timed: "./opa-pr run -s -c config.yaml"
	User time (seconds): 8.36
	System time (seconds): 1.41
	Percent of CPU this job got: 39%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 4316048
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1075120
	Voluntary context switches: 3728
	Involuntary context switches: 63
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
--- Wall clock time summary -------------------
  STW sweep termination .........      0.309 ms
  Concurrent mark and scan ......    587.330 ms
  STW mark termination ..........      0.145 ms
                          Total =    587.784 ms (0.588 s)
               Longest GC Pause =    528.079 ms

--- CPU time breakdown ------------------------
  STW sweep termination .........      1.223 ms
    Mark/Scan assist time .......      1.216 ms
    Mark/Scan background GC time     325.500 ms
    Mark/Scan idle GC time ......    115.590 ms
  STW mark termination ..........      0.575 ms
                          Total =    442.306 ms (0.442 s) (Does not include STW phases)

Num threads: 4
Non-matching rows: 28
Allocs graph from this PR

allocs-pr

Heap graph from this PR

heap-pr

@philipaconrad
Copy link
Contributor Author

philipaconrad commented Jul 11, 2024

ℹ️ I discovered after putting up the draft PR that using a sync.Pool for the gzip.Reader instances provides a surprising performance boost (~9% RPS improvement!), mostly by avoiding needless allocations + GC later. (Benchmarks will need to be updated for this. 😅 )

@philipaconrad philipaconrad marked this pull request as ready for review July 15, 2024 18:02
@philipaconrad philipaconrad force-pushed the philip/server-decoding-gzip branch from 88acc1a to 2f79bcb Compare July 18, 2024 15:08
Copy link
Member

@ashutosh-narkar ashutosh-narkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @philipaconrad! Thanks for the detailed issue description. Few initial thoughts 👇

docs/content/configuration.md Show resolved Hide resolved
server/server.go Show resolved Hide resolved
return r, nil, err
}
rawBody, err = io.ReadAll(plaintextBody)
rawBody, err = util.ReadMaybeCompressedBody(r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be backwards-incompatible. It would be good to avoid this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we can define a new method.

Copy link
Contributor Author

@philipaconrad philipaconrad Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashutosh-narkar Could you elaborate on the backwards-incompatible bits?

As far as I can tell, the updated util.ReadMaybeCompressedBody method produces the same end result types as before, and only generates new errors if the gzip payload is larger than the configured size limit. 🤔

Copy link
Member

@ashutosh-narkar ashutosh-narkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look fine to me. Few more comments and we can then get this in. Thanks!

server/server.go Outdated Show resolved Hide resolved
return r, nil, err
}
rawBody, err = io.ReadAll(plaintextBody)
rawBody, err = util.ReadMaybeCompressedBody(r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we can define a new method.

server/server_test.go Outdated Show resolved Hide resolved
util/decoding/context.go Outdated Show resolved Hide resolved
@philipaconrad
Copy link
Contributor Author

Locust Max RPS benchmarks (plain, gzip)

Summary:

  • When hammered with small requests, the PR increases Max RPS by around 9% for gzipped requests.
  • For gzipped requests, there's also a noticeable reduction in GC traffic as well.

CLI command for each benchmark:

  • ./benchmark.sh opa-main main: Runs a Locust benchmark against OPA built from main, then re-runs with gzipped requests.
  • ./benchmark.sh opa-pr pr: Runs a Locust benchmark against OPA built from this PR, then re-runs with gzipped requests.

Benchmarking scripts:


Results from main
  • Avg Req/s: 7728.26
GC trace from main
--- Wall clock time summary -------------------
  STW sweep termination .........    121.978 ms
  Concurrent mark and scan ......   1341.990 ms
  STW mark termination ..........     21.128 ms
                          Total =   1485.096 ms (1.485 s)
               Longest GC Pause =      7.058 ms

--- CPU time breakdown ------------------------
  STW sweep termination .........    249.780 ms
    Mark/Scan assist time .......    118.346 ms
    Mark/Scan background GC time     460.340 ms
    Mark/Scan idle GC time ......     58.684 ms
  STW mark termination ..........     34.603 ms
                          Total =    637.370 ms (0.637 s) (Does not include STW phases)

Num threads: 4
Non-matching rows: 310148
Allocs graph for main

allocs-main

Heap graph for main

heap-main

Results from this PR
  • Avg Req/s: 7913.60
GC trace from this PR
--- Wall clock time summary -------------------
  STW sweep termination .........    111.746 ms
  Concurrent mark and scan ......   1293.190 ms
  STW mark termination ..........     22.368 ms
                          Total =   1427.304 ms (1.427 s)
               Longest GC Pause =      6.691 ms

--- CPU time breakdown ------------------------
  STW sweep termination .........    216.280 ms
    Mark/Scan assist time .......    126.283 ms
    Mark/Scan background GC time     451.660 ms
    Mark/Scan idle GC time ......     51.504 ms
  STW mark termination ..........     28.746 ms
                          Total =    629.447 ms (0.629 s) (Does not include STW phases)

Num threads: 4
Non-matching rows: 313388
Allocs graph from this PR

allocs-pr

Heap graph from this PR

heap-pr


Results from main (gzip)
  • Avg Req/s: 6337.26
GC trace from main (gzip)
--- Wall clock time summary -------------------
  STW sweep termination .........    228.992 ms
  Concurrent mark and scan ......   2347.050 ms
  STW mark termination ..........     38.788 ms
                          Total =   2614.830 ms (2.615 s)
               Longest GC Pause =      9.204 ms

--- CPU time breakdown ------------------------
  STW sweep termination .........    458.590 ms
    Mark/Scan assist time .......    242.800 ms
    Mark/Scan background GC time     853.370 ms
    Mark/Scan idle GC time ......     60.655 ms
  STW mark termination ..........     58.624 ms
                          Total =   1156.825 ms (1.157 s) (Does not include STW phases)

Num threads: 4
Non-matching rows: 254518
Allocs graph for main (gzip)

allocs-main-gzip

Heap graph for main (gzip)

heap-main-gzip

Results from this PR (gzip)
  • Avg Req/s: 6961.82
GC trace from this PR (gzip)
--- Wall clock time summary -------------------
  STW sweep termination .........    139.031 ms
  Concurrent mark and scan ......   1314.770 ms
  STW mark termination ..........     24.905 ms
                          Total =   1478.706 ms (1.479 s)
               Longest GC Pause =      9.905 ms

--- CPU time breakdown ------------------------
  STW sweep termination .........    270.248 ms
    Mark/Scan assist time .......    101.513 ms
    Mark/Scan background GC time     429.460 ms
    Mark/Scan idle GC time ......     40.449 ms
  STW mark termination ..........     46.631 ms
                          Total =    571.422 ms (0.571 s) (Does not include STW phases)

Num threads: 4
Non-matching rows: 279382
Allocs graph from this PR (gzip)

allocs-pr-gzip

Heap graph from this PR (gzip)

heap-pr-gzip

@philipaconrad
Copy link
Contributor Author

philipaconrad commented Jul 19, 2024

ℹ️ @ashutosh-narkar I'm going to squash down my changes and rebase off main, in preparation for merging.

If you have suggestions/edits for the Changelog notes, I'm happy to amend them as needed! 🙂

@philipaconrad philipaconrad force-pushed the philip/server-decoding-gzip branch 2 times, most recently from 618906d to 91d12c5 Compare July 19, 2024 22:19
@philipaconrad philipaconrad force-pushed the philip/server-decoding-gzip branch from 91d12c5 to db57672 Compare July 22, 2024 16:20
@philipaconrad
Copy link
Contributor Author

philipaconrad commented Jul 22, 2024

ℹ️ Rebasing through the Github UI...
Just saw the feedback on removing an unused package function. I'll push up that edit shortly.

This commit introduces a few major changes:
 - (Breaking change) Limits now exist for maximum request body sizes.
 - Buffers are preallocated for reading request bodies.
 - Buffers are preallocated for decompressing request bodies.
 - Gzip decoder instances are reused in a `sync.Pool` across requests.

The effect on garbage collection is dramatically fewer GC pauses, giving
a roughly 9% RPS improvement in load tests with gzipped request bodies.
For larger request sizes, the number of GC pauses is dramatically
reduced, although the peak pause time may increase by a few percent.

Implementation notes:
 - The DecodingLimits handler enforces the max request body size both
   through a Content-Length check, and a MaxBytesReader wrapper around
   the payload.
 - The DecodingLimits handler passes the gzip payload size limit down
   using a context key.

Signed-off-by: Philip Conrad <[email protected]>
@philipaconrad philipaconrad force-pushed the philip/server-decoding-gzip branch from 01cecba to 6aa1f51 Compare July 22, 2024 16:50
@philipaconrad philipaconrad merged commit c5706ee into open-policy-agent:main Jul 22, 2024
28 checks passed
philipaconrad added a commit to philipaconrad/opa that referenced this pull request Jul 31, 2024
This commit fixes a request handling bug introduced in open-policy-agent#6868, which
caused OPA to treat all incoming chunked requests as if they had
zero-length request bodies.

The fix detects cases where the request body size is unknown in the
DecodingLimits handler, and propagates a request context key down to
the `util.ReadMaybeCompressedBody` function, allowing it to correctly
select between using the original `io.ReadAll` style for chunked
requests, or the newer preallocated buffers approach (for requests of
known size).

This change has a small, but barely visible performance impact for large
requests (<5% increase in GC pauses for a 1GB request JSON blob), and
minimal, if any, effect on RPS under load.

Fixes: open-policy-agent#6904

Signed-off-by: Philip Conrad <[email protected]>
philipaconrad added a commit to philipaconrad/opa that referenced this pull request Aug 1, 2024
This commit fixes a request handling bug introduced in open-policy-agent#6868, which
caused OPA to treat all incoming chunked requests as if they had
zero-length request bodies.

The fix detects cases where the request body size is unknown in the
DecodingLimits handler, and propagates a request context key down to
the `util.ReadMaybeCompressedBody` function, allowing it to correctly
select between using the original `io.ReadAll` style for chunked
requests, or the newer preallocated buffers approach (for requests of
known size).

This change has a small, but barely visible performance impact for large
requests (<5% increase in GC pauses for a 1GB request JSON blob), and
minimal, if any, effect on RPS under load.

Fixes: open-policy-agent#6904

Signed-off-by: Philip Conrad <[email protected]>
philipaconrad added a commit to philipaconrad/opa that referenced this pull request Aug 1, 2024
This commit fixes a request handling bug introduced in open-policy-agent#6868, which
caused OPA to treat all incoming chunked requests as if they had
zero-length request bodies.

The fix detects cases where the request body size is unknown in the
DecodingLimits handler, and propagates a request context key down to
the `util.ReadMaybeCompressedBody` function, allowing it to correctly
select between using the original `io.ReadAll` style for chunked
requests, or the newer preallocated buffers approach (for requests of
known size).

This change has a small, but barely visible performance impact for large
requests (<5% increase in GC pauses for a 1GB request JSON blob), and
minimal, if any, effect on RPS under load.

Fixes: open-policy-agent#6904

Signed-off-by: Philip Conrad <[email protected]>
philipaconrad added a commit that referenced this pull request Aug 1, 2024
This commit fixes a request handling bug introduced in #6868, which
caused OPA to treat all incoming chunked requests as if they had
zero-length request bodies.

The fix detects cases where the request body size is unknown in the
DecodingLimits handler, and propagates a request context key down to
the `util.ReadMaybeCompressedBody` function, allowing it to correctly
select between using the original `io.ReadAll` style for chunked
requests, or the newer preallocated buffers approach (for requests of
known size).

This change has a small, but barely visible performance impact for large
requests (<5% increase in GC pauses for a 1GB request JSON blob), and
minimal, if any, effect on RPS under load.

Fixes: #6904

Signed-off-by: Philip Conrad <[email protected]>
ashutosh-narkar pushed a commit to ashutosh-narkar/opa that referenced this pull request Aug 1, 2024
…ent#6906)

This commit fixes a request handling bug introduced in open-policy-agent#6868, which
caused OPA to treat all incoming chunked requests as if they had
zero-length request bodies.

The fix detects cases where the request body size is unknown in the
DecodingLimits handler, and propagates a request context key down to
the `util.ReadMaybeCompressedBody` function, allowing it to correctly
select between using the original `io.ReadAll` style for chunked
requests, or the newer preallocated buffers approach (for requests of
known size).

This change has a small, but barely visible performance impact for large
requests (<5% increase in GC pauses for a 1GB request JSON blob), and
minimal, if any, effect on RPS under load.

Fixes: open-policy-agent#6904

Signed-off-by: Philip Conrad <[email protected]>
(cherry picked from commit ee9ab0b)
ashutosh-narkar pushed a commit that referenced this pull request Aug 5, 2024
This commit fixes a request handling bug introduced in #6868, which
caused OPA to treat all incoming chunked requests as if they had
zero-length request bodies.

The fix detects cases where the request body size is unknown in the
DecodingLimits handler, and propagates a request context key down to
the `util.ReadMaybeCompressedBody` function, allowing it to correctly
select between using the original `io.ReadAll` style for chunked
requests, or the newer preallocated buffers approach (for requests of
known size).

This change has a small, but barely visible performance impact for large
requests (<5% increase in GC pauses for a 1GB request JSON blob), and
minimal, if any, effect on RPS under load.

Fixes: #6904

Signed-off-by: Philip Conrad <[email protected]>
(cherry picked from commit ee9ab0b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Pull requests that update Go code optimization performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants