Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1MB payload latency on localhost #5676

Closed
tony-clarke-amdocs opened this issue Sep 29, 2022 · 12 comments
Closed

1MB payload latency on localhost #5676

tony-clarke-amdocs opened this issue Sep 29, 2022 · 12 comments

Comments

@tony-clarke-amdocs
Copy link

tony-clarke-amdocs commented Sep 29, 2022

GRPC v1.48

Running the grpc benchmark on my laptop with: ./run_bench.sh -rpc_type unary -req 1000000 -resp 1000000 -r 1

I get results as follows:

================================================================================
r_1_c_1_req_1000000_resp_1000000_unary_1664417694
qps: 506.4
Latency: (50/90/99 %ile): 1.972018ms/2.313624ms/2.685183ms
Client CPU utilization: 0s
Client CPU profile: /tmp/client_r_1_c_1_req_1000000_resp_1000000_unary_1664417694.cpu
Client Mem Profile: /tmp/client_r_1_c_1_req_1000000_resp_1000000_unary_1664417694.mem
Server CPU utilization: 0s
Server CPU profile: /tmp/Server_r_1_c_1_req_1000000_resp_1000000_unary_1664417694.cpu
Server Mem Profile: /tmp/Server_r_1_c_1_req_1000000_resp_1000000_unary_1664417694.mem

The time taken seems a little more than I had anticipated (was hoping to be well under 1ms).

Trying to understand where the time is spent, I had the following suspects:

  1. User space -> kernel space -> User space for the network call. But using bufconn seems to suggest that this is not an issue.
  2. Proto codec writing the proto to the wire format and back. However timing this separately, it only seems to account for ~10% of the time.
  3. Http2 overhead. Reading this issue and running the sample app here seems to show that http2 is much slower than http1.1 (at least in this use case).

What do folks think?

@dfawley
Copy link
Member

dfawley commented Oct 4, 2022

It's been a long time since we've focused on performance, and I'm not sure what kinds of numbers to expect for a benchmark with those parameters. What is showing up in the client & server CPU profiles?

@tony-clarke-amdocs
Copy link
Author

Attached are the client and server pdf profiles. Hopefully something jumps out as been suspicious to someone.
server.pdf
client.pdf

@dfawley
Copy link
Member

dfawley commented Oct 4, 2022

Thanks for the profiles. Nothing really stands out to me there.

What are you testing for exactly in your scenario? Typically if you're only doing one RPC at a time (-r 1) then you would be using a small payload to measure latency. If you are testing for throughput, you would use a large (e.g. 1MB like your run), but with many RPCs concurrently. If you testing for QPS, you would use many RPCs concurrently but small payloads.

@tony-clarke-amdocs
Copy link
Author

What are you testing for exactly in your scenario?

We are trying to understand the latency as it relates to payload size between two process running on the same host. We want to add a proxy that talks GRPC to the application (like a side car) but running on the same host. But currently the extra latency is a show stopper. I am trying understand if the times I am seeing are reasonable or if I have just misconfigured something.

Payload/Protocol 0 1000 10000 100000 1000000
TCP Grpc 119083 114288 122452 290807 1930755
UDS Grpc 97265 101345 128182 291941 1981637

The benchmark proto definition is very simple, so the cost can't be in the marshaling. Times are in nanoseconds

@dfawley
Copy link
Member

dfawley commented Oct 4, 2022

The benchmark proto definition is very simple, so the cost can't be in the marshaling.

Maybe not the runtime CPU cost of marshaling, but it could be the cost of the allocations. Our benchmarks will be allocating 3MB per request and 4MB per response: 1. marshaling the request message (client; the request proto message is reused), 2. reading the received request (server), 3. unmarshaling the request (server) and 1. creating the response message (server), 2. marshaling the response message (server), 3. reading the response (client), 4. unmarshaling the response (client). You're achieving 500QPS * 7MB/Q, which is 3.5GB/sec in allocations. That actually seems pretty reasonable to me, but I'm not sure.

@dfawley
Copy link
Member

dfawley commented Oct 4, 2022

If your real-world use case doesn't involve sending 1MB messages at 500QPS then you might slow down the rate of RPCs and get a more realistic measurement of latency.

@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Oct 10, 2022
@tony-clarke-amdocs
Copy link
Author

I slowed the rate of RPCs down, putting various delays in between, but it didn't really seem to make any difference. I get similar performance when I try out the java-grpc, so I don't think it's anything related to the go implementation...just rather a limitation of GRPC and large payloads. Anyone has any ideas what else can be done to speed up (client and server both on localhost)?

@github-actions github-actions bot removed the stale label Oct 11, 2022
@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@dfawley
Copy link
Member

dfawley commented Oct 25, 2022

Sorry, I meant to update here before it was auto-closed:

I think for this, you ultimately will need something like #906. We've had other interest in this recently from some folks who might be able to do the implementation work and also implement a shared memory transport, so it's possible this could happen in the next few months.

@vominhtrius
Copy link

Sorry, I meant to update here before it was auto-closed:

I think for this, you ultimately will need something like #906. We've had other interest in this recently from some folks who might be able to do the implementation work and also implement a shared memory transport, so it's possible this could happen in the next few months.

hi @dfawley
Do you have any updates on "implement a shared memory transport" for grpc-go?

@dfawley
Copy link
Member

dfawley commented Mar 9, 2023

I recently heard a proof of concept for an in-memory transport performed well, but I don't think it's close to landing as a PR any time soon.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants