server+util: Limit max request sizes, prealloc request buffers #6868

philipaconrad · 2024-07-10T20:29:18Z

Why the changes in this PR are needed?

In a recent bugfix PR, I discovered some performance issues around how request bodies are handled in OPA, particularly when the request is gzipped. This PR is a follow-up to that one, and changes how OPA reads request bodies.

The main performance benefits of this PR come from reducing the workload of the Golang GC, and also by removing redundant Read operations on the request body.

What are the changes in this PR?

This PR preallocates the entire buffer for a request body (and also for its decompressed payload) when possible. This reduces max RSS for large requests by around 5-10%, and reduces GC overhead by 33-66% for large requests, while not penalizing small (<1kb) requests.

The changes are structured as follows:

docs/: Updated for the new config options.
plugins/server/decoding: New always-on plugin, modeled after the server/encoding plugin.
- Includes new handler type for the OPA HTTP server.
server/server.go: Integrates the new request handler + uses the request body reading function.
util/read_gzip_body.go: Allocs whole buffers up-front now, and checks gzip blob sizes before decompressing.

Other small changes were needed in a variety of spots to account for new method signatures, and so on.

New configuration options: `server.decoding`

This PR introduces two new server config fields under server.decoding. Here's an example configuration file:

# Only allow up to 128 MB request bodies, and up to 256 MB decompressed gzip payloads.
server:
  decoding:
    max_length: 134217728
    gzip:
      max_length: 268435456

These fields keep in line with the naming scheme from server.encoding, but I'm open to other ideas! 😅

Defending against huge requests

On main currently, an attacker can tie up a huge amount of resources for an OPA instance with a large, malicious request. Gzipped payloads amplify the resources that can ultimately be tied up. (Example: a 1 GB JSON object with repetitive values can compress down to 5 MB or smaller with gzip.)

In this PR, buffers are allocated all-at-once, which is hazardous-- an attacker could submit a huge request, and cause OPA to experience an out of memory (OOM) error just from allocating the buffers for the request.

In the case of gzip payloads, the raw payload size may be small, but with a forged gzip size trailer field, the attack could cause the buffer allocation logic to believe that the gzip blob will expand to up to 4 GB in size when decompressed! This makes preallocating the decompression buffer a dangerous thing to do without some sort of validation beforehand.

In this PR, the proposed solution is validating the size of incoming request bodies against server-configured limits before allocating any buffers. That forces attackers in both scenarios to have to work a bit harder to do anything nefarious.

Where are the new validation checks?

The defenses against huge requests happen in 2x places:

For large request payloads, these are checked in the new DecodingLimits handler in the OPA HTTP server. This allows us to reject those requests before they can significantly tie up an OPA server's resources.
For large gzip payloads, these are checked in the util.ReadMaybeCompressedBody function (used whenever we want to read the request body contents), which checks the gzip blob's 4-byte size trailer field against the server's configured size limits.

Notes to assist PR review:

I've moved benchmark sets into comments in the PR discussion thread, since this top PR description was getting a bit crowded. 😅

Scripts used:

summarize-gctrace.awk

# AWK script for summarizing GODEBUG=gctrace=1 logs.
# Known to work for Go 1.22, but the log format is subject to change.
# The format is the following (pulled from the Go `runtime` package docs):
#   gc # @#s #%: #+#+# ms clock, #+#/#/#+# ms cpu, #->#-># MB, # MB goal, # MB stacks, #MB globals, # P
#                ^$5             ^$10                                                               ^$28
# Example:
#   gc 11 @3.169s 1%: 0.025+509+0.017 ms clock, 0.10+0/261/0.81+0.068 ms cpu, 3585->3585->2561 MB, 3585 MB goal, 0 MB stacks, 0 MB globals, 4 P

BEGIN {
    FS = "[\t +/]+"
}

# Some simple assertions to prevent matching non-gctrace logs.
$1 != "gc" { non_matching_rows++; next; }
$9 == "cpu" { non_matching_rows++; next; }

# The main pattern-action block:
{
    num_threads = $28

    clock_stw_sweep_term += $5
    clock_conc_mark_scan += $6
    clock_stw_mark_term += $7

    cur_clock = $5 + $6 + $7
    total_clock += cur_clock
    max_pause_clock = (max_pause_clock > cur_clock) ? max_pause_clock : cur_clock

    cpu_ms_stw_sweep_term += $10
    cpu_ms_assist_gc += $11
    cpu_ms_background_gc += $12
    cpu_ms_idle_gc += $13
    cpu_ms_stw_mark_term += $14
    total_cpu += $11 + $12 + $13
}

END {
    print  "--- Wall clock time summary -------------------"
    printf "  STW sweep termination ......... %10.3f ms\n", clock_stw_sweep_term
    printf "  Concurrent mark and scan ...... %10.3f ms\n", clock_conc_mark_scan
    printf "  STW mark termination .......... %10.3f ms\n", clock_stw_mark_term
    printf "                          Total = %10.3f ms (%.3f s)\n", total_clock, total_clock / 1000
    printf "               Longest GC Pause = %10.3f ms\n", max_pause_clock
    print  ""
    print  "--- CPU time breakdown ------------------------"
    printf "  STW sweep termination ......... %10.3f ms\n", cpu_ms_stw_sweep_term
    printf "    Mark/Scan assist time ....... %10.3f ms\n", cpu_ms_assist_gc
    printf "    Mark/Scan background GC time  %10.3f ms\n", cpu_ms_background_gc
    printf "    Mark/Scan idle GC time ...... %10.3f ms\n", cpu_ms_idle_gc
    printf "  STW mark termination .......... %10.3f ms\n", cpu_ms_stw_mark_term
    printf "                          Total = %10.3f ms (%.3f s) (Does not include STW phases)\n", total_cpu, total_cpu / 1000
    print  ""
    print  "Num threads:", num_threads
    print  "Non-matching rows:", non_matching_rows
}

geninput.py

N = 1024 * 1024
oneKBString = "A" * 1024

if __name__ == "__main__":
    print('{"input":{')
    for i in range(0, N):
        print('"{}": "{}",'.format(i, oneKBString))
    print('"{}": "{}"'.format(N, oneKBString))
    print("}}")

Further comments:

Remaining TODO items:

Add draft release notes
Add more tests?
Add Go benchmark?

Things I'd love reviewer comments/opinions on:

How big should our default max request size be? How about the max gzip payload size?
Could the error messages be improved?
Could the config keys have better names?

netlify · 2024-07-10T20:30:48Z

✅ Deploy Preview for openpolicyagent ready!

Name	Link
🔨 Latest commit	`db57672`
🔍 Latest deploy log	https://app.netlify.com/sites/openpolicyagent/deploys/669e86d42e52d7000883fef4
😎 Deploy Preview	https://deploy-preview-6868--openpolicyagent.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

philipaconrad · 2024-07-11T05:36:18Z

1 GB gzipped input POST benchmark

Summary:

Around a 100 MB reduction in max RSS.
Allocs graph indicates a near 50% reduction in allocated memory, but much of that is almost certainly virtual memory that is not showing up as part of the RSS.
Around a 66% reduction in GC time.
Fewer GC collections (main: 31x, PR: 8x)

CLI command for request: python3 geninput.py | gzip | curl --verbose -X POST -H "Content-Encoding: gzip" --data-binary @- http://127.0.0.1:8181/v1/data

Results from main

gotraceui rendering from main

Purple indicates GC activity.

GC trace from main

$ GOMAXPROCS=4 GODEBUG=gctrace=1 /usr/bin/time -v ./opa-main run -s -c config.yaml 2> >(tee >(awk -f summarize-gctrace.awk >&2) >&2)
gc 1 @0.109s 2%: 0.093+3.5+1.4 ms clock, 0.37+3.1/0.94/3.9+5.9 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 2 @0.123s 2%: 0.028+2.5+0.054 ms clock, 0.11+0.19/1.5/1.7+0.21 ms cpu, 3->4->3 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
{"addrs":[":8181"],"diagnostic-addrs":[],"level":"info","msg":"Initializing server. OPA is running on a public (0.0.0.0) network interface. Unless you intend to expose OPA outside of the host, binding to the localhost interface (--addr localhost:8181) is recommended. See https://www.openpolicyagent.org/docs/latest/security/#interface-binding","time":"2024-07-11T01:15:29-04:00"}
gc 3 @0.353s 1%: 0.030+1.7+0.019 ms clock, 0.12+0.13/1.5/3.0+0.076 ms cpu, 7->7->4 MB, 7 MB goal, 0 MB stacks, 0 MB globals, 4 P
{"client_addr":"127.0.0.1:53020","level":"info","msg":"Received request.","req_id":1,"req_method":"POST","req_path":"/v1/data","time":"2024-07-11T01:15:42-04:00"}
gc 4 @13.343s 0%: 0.034+0.99+0.040 ms clock, 0.13+0.29/0.46/0.96+0.16 ms cpu, 8->9->5 MB, 9 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 5 @13.346s 0%: 0.026+0.94+0.053 ms clock, 0.10+0/0.90/1.1+0.21 ms cpu, 9->9->5 MB, 11 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 6 @13.348s 0%: 0.019+1.6+0.013 ms clock, 0.077+0/0.94/2.2+0.053 ms cpu, 11->11->6 MB, 11 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 7 @13.351s 0%: 0.029+3.1+0.013 ms clock, 0.11+0/0.98/3.3+0.055 ms cpu, 14->14->8 MB, 14 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 8 @13.355s 0%: 0.025+3.9+0.005 ms clock, 0.10+0/3.8/1.9+0.023 ms cpu, 19->19->10 MB, 19 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 9 @13.362s 0%: 0.034+6.1+0.016 ms clock, 0.13+0/0.92/6.6+0.066 ms cpu, 28->28->14 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 10 @13.370s 0%: 0.024+9.3+0.015 ms clock, 0.097+0/0.89/9.9+0.060 ms cpu, 42->42->19 MB, 42 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 11 @13.381s 0%: 0.028+2.4+0.012 ms clock, 0.11+0/2.3/2.0+0.051 ms cpu, 39->39->23 MB, 40 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 12 @13.385s 0%: 0.021+2.3+0.015 ms clock, 0.086+0/0.93/3.1+0.062 ms cpu, 48->48->28 MB, 48 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 13 @13.391s 0%: 0.029+16+0.014 ms clock, 0.11+0/1.0/17+0.057 ms cpu, 59->59->34 MB, 59 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 14 @13.411s 0%: 0.039+3.9+0.012 ms clock, 0.15+0/0.89/4.5+0.049 ms cpu, 73->73->42 MB, 73 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 15 @13.419s 0%: 0.021+10+0.013 ms clock, 0.084+0/10/1.1+0.055 ms cpu, 90->90->52 MB, 90 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 16 @13.433s 0%: 0.015+32+0.012 ms clock, 0.062+0/0.77/33+0.050 ms cpu, 112->112->64 MB, 112 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 17 @13.470s 0%: 0.017+8.5+0.005 ms clock, 0.069+0/0.82/9.0+0.020 ms cpu, 139->139->79 MB, 139 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 18 @13.485s 0%: 0.028+17+0.013 ms clock, 0.11+0/0.87/18+0.053 ms cpu, 173->173->98 MB, 173 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 19 @13.512s 0%: 0.030+62+0.012 ms clock, 0.12+0/0.91/62+0.048 ms cpu, 215->215->121 MB, 215 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 20 @13.586s 0%: 0.026+26+0.014 ms clock, 0.10+0/0.94/27+0.058 ms cpu, 268->268->151 MB, 268 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 21 @13.625s 0%: 0.063+39+0.013 ms clock, 0.25+0/39/1.7+0.054 ms cpu, 334->334->187 MB, 334 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 22 @13.679s 0%: 0.037+122+0.012 ms clock, 0.15+0/0.86/123+0.049 ms cpu, 417->417->233 MB, 417 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 23 @13.819s 0%: 0.028+30+0.013 ms clock, 0.11+0/1.5/32+0.055 ms cpu, 520->520->291 MB, 520 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 24 @13.872s 0%: 0.033+70+0.016 ms clock, 0.13+0/70/1.3+0.064 ms cpu, 649->649->362 MB, 649 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 25 @13.972s 0%: 0.032+236+0.052 ms clock, 0.13+0/0.90/237+0.20 ms cpu, 810->811->452 MB, 811 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 26 @14.242s 0%: 0.029+48+0.056 ms clock, 0.11+0/1.7/49+0.22 ms cpu, 1012->1012->564 MB, 1012 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 27 @14.333s 0%: 0.029+128+0.012 ms clock, 0.11+0/0.79/129+0.049 ms cpu, 1264->1264->704 MB, 1264 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 28 @14.513s 0%: 0.027+443+0.011 ms clock, 0.10+0/0.76/443+0.045 ms cpu, 1580->1580->879 MB, 1580 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 29 @15.026s 0%: 0.036+91+0.055 ms clock, 0.14+0/91/3.1+0.22 ms cpu, 1973->1973->1098 MB, 1974 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 30 @17.211s 0%: 0.12+1.2+0.13 ms clock, 0.50+0.15/0.93/1.0+0.54 ms cpu, 3146->3146->2634 MB, 3146 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 31 @20.638s 0%: 0.029+69+0.052 ms clock, 0.11+0.12/68/109+0.21 ms cpu, 5265->5286->3708 MB, 5269 MB goal, 0 MB stacks, 0 MB globals, 4 P
{"client_addr":"127.0.0.1:53020","level":"info","msg":"Sent response.","req_id":1,"req_method":"POST","req_path":"/v1/data","resp_bytes":14,"resp_duration":9793.283196,"resp_status":200,"time":"2024-07-11T01:15:52-04:00"}
{"level":"info","msg":"Shutting down...","time":"2024-07-11T01:16:02-04:00"}
{"level":"info","msg":"Server shutdown.","time":"2024-07-11T01:16:02-04:00"}
	Command being timed: "./opa-main run -s -c config.yaml"
	User time (seconds): 10.15
	System time (seconds): 1.80
	Percent of CPU this job got: 34%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:34.27
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 4446840
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 337
	Minor (reclaiming a frame) page faults: 1125954
	Voluntary context switches: 6202
	Involuntary context switches: 159
	Swaps: 0
	File system inputs: 52520
	File system outputs: 2968
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
--- Wall clock time summary -------------------
  STW sweep termination .........      1.059 ms
  Concurrent mark and scan ......   1490.930 ms
  STW mark termination ..........      2.172 ms
                          Total =   1494.161 ms (1.494 s)
               Longest GC Pause =    443.038 ms

--- CPU time breakdown ------------------------
  STW sweep termination .........      4.185 ms
    Mark/Scan assist time .......      3.980 ms
    Mark/Scan background GC time     307.700 ms
    Mark/Scan idle GC time ......   1340.360 ms
  STW mark termination ..........      9.022 ms
                          Total =   1652.040 ms (1.652 s) (Does not include STW phases)

Num threads: 4
Non-matching rows: 28

Allocs graph for main

Heap graph for main

Results from this PR

gotraceui rendering from main

Purple indicates GC activity.

GC trace from this PR

$ GOMAXPROCS=4 GODEBUG=gctrace=1 /usr/bin/time -v ./opa-pr run -s -c config.yaml 2> >(tee >(awk -f summarize-gctrace.awk >&2) >&2)
gc 1 @0.010s 1%: 0.020+0.65+0.014 ms clock, 0.080+0.24/0.52/0.90+0.058 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
{"addrs":[":8181"],"diagnostic-addrs":[],"level":"info","msg":"Initializing server. OPA is running on a public (0.0.0.0) network interface. Unless you intend to expose OPA outside of the host, binding to the localhost interface (--addr localhost:8181) is recommended. See https://www.openpolicyagent.org/docs/latest/security/#interface-binding","time":"2024-07-11T01:19:31-04:00"}
gc 2 @0.396s 0%: 0.046+1.4+0.026 ms clock, 0.18+0.38/1.2/2.3+0.10 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 3 @0.409s 0%: 0.041+0.91+0.006 ms clock, 0.16+0.18/0.73/1.8+0.025 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 4 @0.417s 0%: 0.044+1.2+0.024 ms clock, 0.17+0.13/1.0/1.9+0.098 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 4 P
{"client_addr":"127.0.0.1:36074","level":"info","msg":"Received request.","req_id":1,"req_method":"POST","req_path":"/v1/data","time":"2024-07-11T01:19:41-04:00"}
gc 5 @10.248s 0%: 0.032+0.57+0.003 ms clock, 0.13+0.11/0.54/1.2+0.014 ms cpu, 7->7->7 MB, 7 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 6 @10.251s 0%: 0.023+0.60+0.047 ms clock, 0.093+0.096/0.51/0.99+0.18 ms cpu, 1045->1045->1045 MB, 1045 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 7 @13.034s 0%: 0.065+528+0.014 ms clock, 0.26+0/268/1.5+0.056 ms cpu, 3093->3093->2575 MB, 3093 MB goal, 0 MB stacks, 0 MB globals, 4 P
gc 8 @16.677s 0%: 0.038+54+0.011 ms clock, 0.15+0.080/53/105+0.044 ms cpu, 5148->5158->3600 MB, 5152 MB goal, 0 MB stacks, 0 MB globals, 4 P
{"client_addr":"127.0.0.1:36074","level":"info","msg":"Sent response.","req_id":1,"req_method":"POST","req_path":"/v1/data","resp_bytes":14,"resp_duration":9027.930343,"resp_status":200,"time":"2024-07-11T01:19:50-04:00"}
{"level":"info","msg":"Shutting down...","time":"2024-07-11T01:19:55-04:00"}
{"level":"info","msg":"Server shutdown.","time":"2024-07-11T01:19:55-04:00"}
	Command being timed: "./opa-pr run -s -c config.yaml"
	User time (seconds): 8.36
	System time (seconds): 1.41
	Percent of CPU this job got: 39%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 4316048
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1075120
	Voluntary context switches: 3728
	Involuntary context switches: 63
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
--- Wall clock time summary -------------------
  STW sweep termination .........      0.309 ms
  Concurrent mark and scan ......    587.330 ms
  STW mark termination ..........      0.145 ms
                          Total =    587.784 ms (0.588 s)
               Longest GC Pause =    528.079 ms

--- CPU time breakdown ------------------------
  STW sweep termination .........      1.223 ms
    Mark/Scan assist time .......      1.216 ms
    Mark/Scan background GC time     325.500 ms
    Mark/Scan idle GC time ......    115.590 ms
  STW mark termination ..........      0.575 ms
                          Total =    442.306 ms (0.442 s) (Does not include STW phases)

Num threads: 4
Non-matching rows: 28

Allocs graph from this PR

Heap graph from this PR

philipaconrad · 2024-07-11T06:57:32Z

ℹ️ I discovered after putting up the draft PR that using a sync.Pool for the gzip.Reader instances provides a surprising performance boost (~9% RPS improvement!), mostly by avoiding needless allocations + GC later. (Benchmarks will need to be updated for this. 😅 )

ashutosh-narkar

This looks great @philipaconrad! Thanks for the detailed issue description. Few initial thoughts 👇

docs/content/configuration.md

server/server.go

ashutosh-narkar · 2024-07-19T19:29:59Z

server/authorizer/authorizer.go

-			return r, nil, err
-		}
-		rawBody, err = io.ReadAll(plaintextBody)
+		rawBody, err = util.ReadMaybeCompressedBody(r)


This would be backwards-incompatible. It would be good to avoid this.

Probably we can define a new method.

@ashutosh-narkar Could you elaborate on the backwards-incompatible bits?

As far as I can tell, the updated util.ReadMaybeCompressedBody method produces the same end result types as before, and only generates new errors if the gzip payload is larger than the configured size limit. 🤔

ashutosh-narkar

The changes look fine to me. Few more comments and we can then get this in. Thanks!

server/server.go

ashutosh-narkar · 2024-07-19T19:36:13Z

server/authorizer/authorizer.go

-			return r, nil, err
-		}
-		rawBody, err = io.ReadAll(plaintextBody)
+		rawBody, err = util.ReadMaybeCompressedBody(r)


Probably we can define a new method.

server/server_test.go

util/decoding/context.go

philipaconrad · 2024-07-19T19:46:26Z

Locust Max RPS benchmarks (plain, gzip)

Summary:

When hammered with small requests, the PR increases Max RPS by around 9% for gzipped requests.
For gzipped requests, there's also a noticeable reduction in GC traffic as well.

CLI command for each benchmark:

./benchmark.sh opa-main main: Runs a Locust benchmark against OPA built from main, then re-runs with gzipped requests.
./benchmark.sh opa-pr pr: Runs a Locust benchmark against OPA built from this PR, then re-runs with gzipped requests.

Benchmarking scripts:

locust-benchmark-scripts.zip

Results from main

Avg Req/s: 7728.26

GC trace from main

--- Wall clock time summary -------------------
  STW sweep termination .........    121.978 ms
  Concurrent mark and scan ......   1341.990 ms
  STW mark termination ..........     21.128 ms
                          Total =   1485.096 ms (1.485 s)
               Longest GC Pause =      7.058 ms

--- CPU time breakdown ------------------------
  STW sweep termination .........    249.780 ms
    Mark/Scan assist time .......    118.346 ms
    Mark/Scan background GC time     460.340 ms
    Mark/Scan idle GC time ......     58.684 ms
  STW mark termination ..........     34.603 ms
                          Total =    637.370 ms (0.637 s) (Does not include STW phases)

Num threads: 4
Non-matching rows: 310148

Allocs graph for main

Heap graph for main

Results from this PR

Avg Req/s: 7913.60

GC trace from this PR

--- Wall clock time summary -------------------
  STW sweep termination .........    111.746 ms
  Concurrent mark and scan ......   1293.190 ms
  STW mark termination ..........     22.368 ms
                          Total =   1427.304 ms (1.427 s)
               Longest GC Pause =      6.691 ms

--- CPU time breakdown ------------------------
  STW sweep termination .........    216.280 ms
    Mark/Scan assist time .......    126.283 ms
    Mark/Scan background GC time     451.660 ms
    Mark/Scan idle GC time ......     51.504 ms
  STW mark termination ..........     28.746 ms
                          Total =    629.447 ms (0.629 s) (Does not include STW phases)

Num threads: 4
Non-matching rows: 313388

Allocs graph from this PR

Heap graph from this PR

Results from main (gzip)

Avg Req/s: 6337.26

GC trace from main (gzip)

--- Wall clock time summary -------------------
  STW sweep termination .........    228.992 ms
  Concurrent mark and scan ......   2347.050 ms
  STW mark termination ..........     38.788 ms
                          Total =   2614.830 ms (2.615 s)
               Longest GC Pause =      9.204 ms

--- CPU time breakdown ------------------------
  STW sweep termination .........    458.590 ms
    Mark/Scan assist time .......    242.800 ms
    Mark/Scan background GC time     853.370 ms
    Mark/Scan idle GC time ......     60.655 ms
  STW mark termination ..........     58.624 ms
                          Total =   1156.825 ms (1.157 s) (Does not include STW phases)

Num threads: 4
Non-matching rows: 254518

Allocs graph for main (gzip)

Heap graph for main (gzip)

Results from this PR (gzip)

Avg Req/s: 6961.82

GC trace from this PR (gzip)

--- Wall clock time summary -------------------
  STW sweep termination .........    139.031 ms
  Concurrent mark and scan ......   1314.770 ms
  STW mark termination ..........     24.905 ms
                          Total =   1478.706 ms (1.479 s)
               Longest GC Pause =      9.905 ms

--- CPU time breakdown ------------------------
  STW sweep termination .........    270.248 ms
    Mark/Scan assist time .......    101.513 ms
    Mark/Scan background GC time     429.460 ms
    Mark/Scan idle GC time ......     40.449 ms
  STW mark termination ..........     46.631 ms
                          Total =    571.422 ms (0.571 s) (Does not include STW phases)

Num threads: 4
Non-matching rows: 279382

Allocs graph from this PR (gzip)

Heap graph from this PR (gzip)

philipaconrad · 2024-07-19T22:06:55Z

ℹ️ @ashutosh-narkar I'm going to squash down my changes and rebase off main, in preparation for merging.

If you have suggestions/edits for the Changelog notes, I'm happy to amend them as needed! 🙂

philipaconrad · 2024-07-22T16:20:47Z

~~ℹ️ Rebasing through the Github UI...~~
Just saw the feedback on removing an unused package function. I'll push up that edit shortly.

This commit introduces a few major changes: - (Breaking change) Limits now exist for maximum request body sizes. - Buffers are preallocated for reading request bodies. - Buffers are preallocated for decompressing request bodies. - Gzip decoder instances are reused in a `sync.Pool` across requests. The effect on garbage collection is dramatically fewer GC pauses, giving a roughly 9% RPS improvement in load tests with gzipped request bodies. For larger request sizes, the number of GC pauses is dramatically reduced, although the peak pause time may increase by a few percent. Implementation notes: - The DecodingLimits handler enforces the max request body size both through a Content-Length check, and a MaxBytesReader wrapper around the payload. - The DecodingLimits handler passes the gzip payload size limit down using a context key. Signed-off-by: Philip Conrad <[email protected]>

This commit fixes a request handling bug introduced in open-policy-agent#6868, which caused OPA to treat all incoming chunked requests as if they had zero-length request bodies. The fix detects cases where the request body size is unknown in the DecodingLimits handler, and propagates a request context key down to the `util.ReadMaybeCompressedBody` function, allowing it to correctly select between using the original `io.ReadAll` style for chunked requests, or the newer preallocated buffers approach (for requests of known size). This change has a small, but barely visible performance impact for large requests (<5% increase in GC pauses for a 1GB request JSON blob), and minimal, if any, effect on RPS under load. Fixes: open-policy-agent#6904 Signed-off-by: Philip Conrad <[email protected]>

This commit fixes a request handling bug introduced in #6868, which caused OPA to treat all incoming chunked requests as if they had zero-length request bodies. The fix detects cases where the request body size is unknown in the DecodingLimits handler, and propagates a request context key down to the `util.ReadMaybeCompressedBody` function, allowing it to correctly select between using the original `io.ReadAll` style for chunked requests, or the newer preallocated buffers approach (for requests of known size). This change has a small, but barely visible performance impact for large requests (<5% increase in GC pauses for a 1GB request JSON blob), and minimal, if any, effect on RPS under load. Fixes: #6904 Signed-off-by: Philip Conrad <[email protected]>

…ent#6906) This commit fixes a request handling bug introduced in open-policy-agent#6868, which caused OPA to treat all incoming chunked requests as if they had zero-length request bodies. The fix detects cases where the request body size is unknown in the DecodingLimits handler, and propagates a request context key down to the `util.ReadMaybeCompressedBody` function, allowing it to correctly select between using the original `io.ReadAll` style for chunked requests, or the newer preallocated buffers approach (for requests of known size). This change has a small, but barely visible performance impact for large requests (<5% increase in GC pauses for a 1GB request JSON blob), and minimal, if any, effect on RPS under load. Fixes: open-policy-agent#6904 Signed-off-by: Philip Conrad <[email protected]> (cherry picked from commit ee9ab0b)

This commit fixes a request handling bug introduced in #6868, which caused OPA to treat all incoming chunked requests as if they had zero-length request bodies. The fix detects cases where the request body size is unknown in the DecodingLimits handler, and propagates a request context key down to the `util.ReadMaybeCompressedBody` function, allowing it to correctly select between using the original `io.ReadAll` style for chunked requests, or the newer preallocated buffers approach (for requests of known size). This change has a small, but barely visible performance impact for large requests (<5% increase in GC pauses for a 1GB request JSON blob), and minimal, if any, effect on RPS under load. Fixes: #6904 Signed-off-by: Philip Conrad <[email protected]> (cherry picked from commit ee9ab0b)

philipaconrad added optimization performance go Pull requests that update Go code labels Jul 10, 2024

philipaconrad self-assigned this Jul 10, 2024

philipaconrad marked this pull request as ready for review July 15, 2024 18:02

philipaconrad force-pushed the philip/server-decoding-gzip branch from 88acc1a to 2f79bcb Compare July 18, 2024 15:08

ashutosh-narkar reviewed Jul 18, 2024

View reviewed changes

docs/content/configuration.md Show resolved Hide resolved

server/server.go Show resolved Hide resolved

ashutosh-narkar reviewed Jul 19, 2024

View reviewed changes

philipaconrad requested a review from ashutosh-narkar July 19, 2024 22:03

philipaconrad force-pushed the philip/server-decoding-gzip branch 2 times, most recently from 618906d to 91d12c5 Compare July 19, 2024 22:19

ashutosh-narkar previously approved these changes Jul 19, 2024

View reviewed changes

philipaconrad force-pushed the philip/server-decoding-gzip branch from 91d12c5 to db57672 Compare July 22, 2024 16:20

philipaconrad dismissed ashutosh-narkar’s stale review via 01cecba July 22, 2024 16:50

philipaconrad force-pushed the philip/server-decoding-gzip branch from db57672 to 01cecba Compare July 22, 2024 16:50

philipaconrad force-pushed the philip/server-decoding-gzip branch from 01cecba to 6aa1f51 Compare July 22, 2024 16:50

philipaconrad merged commit c5706ee into open-policy-agent:main Jul 22, 2024
28 checks passed

BrewTestBot mentioned this pull request Jul 26, 2024

opa 0.67.0 Homebrew/homebrew-core#178529

Merged

philipaconrad mentioned this pull request Jul 30, 2024

Chunked HTTP requests broken in 0.67.0 #6904

Closed

philipaconrad mentioned this pull request Jul 31, 2024

util+server: Fix bug around chunked request handling. #6906

Merged

BrewTestBot mentioned this pull request Aug 5, 2024

opa 0.67.1 Homebrew/homebrew-core#180115

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server+util: Limit max request sizes, prealloc request buffers #6868

server+util: Limit max request sizes, prealloc request buffers #6868

philipaconrad commented Jul 10, 2024 •

edited

Loading

netlify bot commented Jul 10, 2024 •

edited

Loading

philipaconrad commented Jul 11, 2024 •

edited

Loading

philipaconrad commented Jul 11, 2024 •

edited

Loading

ashutosh-narkar left a comment

ashutosh-narkar Jul 19, 2024

ashutosh-narkar Jul 19, 2024

philipaconrad Jul 19, 2024 •

edited

Loading

ashutosh-narkar left a comment

ashutosh-narkar Jul 19, 2024

philipaconrad commented Jul 19, 2024

philipaconrad commented Jul 19, 2024 •

edited

Loading

philipaconrad commented Jul 22, 2024 •

edited

Loading

server+util: Limit max request sizes, prealloc request buffers #6868

server+util: Limit max request sizes, prealloc request buffers #6868

Conversation

philipaconrad commented Jul 10, 2024 • edited Loading

Why the changes in this PR are needed?

What are the changes in this PR?

New configuration options: server.decoding

Defending against huge requests

Where are the new validation checks?

Notes to assist PR review:

Scripts used:

Further comments:

netlify bot commented Jul 10, 2024 • edited Loading

✅ Deploy Preview for openpolicyagent ready!

philipaconrad commented Jul 11, 2024 • edited Loading

1 GB gzipped input POST benchmark

philipaconrad commented Jul 11, 2024 • edited Loading

ashutosh-narkar left a comment

Choose a reason for hiding this comment

ashutosh-narkar Jul 19, 2024

Choose a reason for hiding this comment

ashutosh-narkar Jul 19, 2024

Choose a reason for hiding this comment

philipaconrad Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

ashutosh-narkar left a comment

Choose a reason for hiding this comment

ashutosh-narkar Jul 19, 2024

Choose a reason for hiding this comment

philipaconrad commented Jul 19, 2024

Locust Max RPS benchmarks (plain, gzip)

philipaconrad commented Jul 19, 2024 • edited Loading

philipaconrad commented Jul 22, 2024 • edited Loading

philipaconrad commented Jul 10, 2024 •

edited

Loading

New configuration options: `server.decoding`

netlify bot commented Jul 10, 2024 •

edited

Loading

philipaconrad commented Jul 11, 2024 •

edited

Loading

philipaconrad commented Jul 11, 2024 •

edited

Loading

philipaconrad Jul 19, 2024 •

edited

Loading

philipaconrad commented Jul 19, 2024 •

edited

Loading

philipaconrad commented Jul 22, 2024 •

edited

Loading