Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] Share HTTP connection pools between HTTP clients #21367

Merged
merged 1 commit into from
Nov 14, 2023

Conversation

Yuhta
Copy link
Contributor

@Yuhta Yuhta commented Nov 10, 2023

Currently each exchange source owns one unique SessionPool, so the connection is only reused for one specific point to point communication. This creates many connections between the same upstream and downstream nodes (for different tasks), decreasing the performance, increasing the chance of network error, and wasting precious ephemeral TCP ports (can only have 32K of them on one host) on client side.

Fix this by pooling the connections, decreasing the open socket count from 52K to 15K during peak, and have consistently lower socket usage during all time (see image attached).
ods_pxlcld_jimmy_lu_2023_11_10t15_21_56_724z_

Also set a connect time out and increase the error duration so that connection error can be retried.

@Yuhta Yuhta requested a review from a team as a code owner November 10, 2023 19:59
@Yuhta Yuhta marked this pull request as draft November 10, 2023 20:07
@Yuhta Yuhta force-pushed the connpool branch 6 times, most recently from acb97d5 to be6a1fa Compare November 11, 2023 03:26
@Yuhta Yuhta marked this pull request as ready for review November 12, 2023 16:04
Copy link
Contributor

@spershin spershin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

@@ -185,6 +185,7 @@ class HttpClientFactory {
}

~HttpClientFactory() {
eventBase_->runInEventBaseThread([pools = std::move(sessionPools_)] {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a way to destroy "sessionPools_" in even base thread?
Why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is pinned to one event base and the destructor need to release some resource associated with it. It is documented in the comment of the class.

Comment on lines 187 to 189
STR_PROP(kExchangeMaxErrorDuration, "3m"),
STR_PROP(kExchangeRequestTimeout, "10s"),
STR_PROP(kExchangeConnectTimeout, "0s"),
STR_PROP(kExchangeConnectTimeout, "20s"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add these changes in a separate change, in case something goes wrong, so we son't have to rollback the whole PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separate it here: #21379

@@ -100,13 +105,43 @@ class PrestoExchangeSource : public velox::exec::ExchangeSource {
uint32_t maxBytes,
uint32_t maxWaitSeconds) override;

static std::shared_ptr<ExchangeSource> create(
// HTTP connection pool for a specific endpoint with its associated event
// base. All the operations on the SessionPool must be performed on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: double space.

Copy link
Contributor Author

@Yuhta Yuhta Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think double space after a full sentence is the rule (especially for fixed-width font)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never heard about it.
Space is always single.

Copy link
Contributor Author

@Yuhta Yuhta Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searched online and it appears to be an old convention and now becomes a religious war :(
Double space allows me to use M-e and M-a to jump through the sentences in Emacs so I would prefer it for pragmatic reason.

Comment on lines 485 to 488
for (auto& [_, pool] : pools) {
pool.eventBase->runInEventBaseThread(
[sessionPool = std::move(pool.sessionPool)] {});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why do we need to clear sessionPool objects in the Event Base Thread?

  2. Does this 'runInEventBaseThread' run synchronously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's required by design
  2. No

Comment on lines 116 to 134
class ConnectionPools {
public:
~ConnectionPools() {
destroy();
}

const ConnectionPool& get(
const proxygen::Endpoint& endpoint,
folly::IOThreadPoolExecutor* ioExecutor);
void destroy();

private:
folly::Synchronized<folly::F14FastMap<
proxygen::Endpoint,
ConnectionPool,
proxygen::EndpointHash,
proxygen::EndpointEqual>>
pools_;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If feels very weird that ConnectionPools is declared inside of PrestoExchangeSource class.
Because it is actually owned by someone else.
It made me confused for a while.
Should we consider declaring ConnectionPools in the global namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it here to make it clear these connection pools are used by exchange source (not HTTP server or anything else)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta
We could add a comment for this class.
Usually classes declared within other classes, when the objects of the internal classes are created/managed/used by the external class.
Not our case.

Copy link
Contributor

@spershin spershin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta
Few general questions:

  1. With this change - would the connection pools for particular endpoints stay forever?
  2. Limiting # sessions per endpoint to 10 for all exchange sources in the flight - can it cause issues? Before this change it was 10 * number of exchange sources.

@Yuhta
Copy link
Contributor Author

Yuhta commented Nov 14, 2023

@Yuhta Few general questions:

  1. With this change - would the connection pools for particular endpoints stay forever?
  2. Limiting # sessions per endpoint to 10 for all exchange sources in the flight - can it cause issues? Before this change it was 10 * number of exchange sources.
  1. Yes, but they are just empty pools with no socket and little memory usage if not used. Won't be an issue in most cases because the number of hosts in a data center does not exceed 100K.
  2. No that is the limit number of connections cached. More than 10 can exist at same time and only 10 are kept alive after the current transactions end.

@Yuhta Yuhta force-pushed the connpool branch 3 times, most recently from ee84be5 to cbb3c97 Compare November 14, 2023 18:09
Currently each exchange source owns one unique `SessionPool`, so the connection
is only reused for one specific point to point communication.  This creates many
connections between the same upstream and downstream nodes, decreasing the
performance, increasing the chance of network error, and wasting precise
ephemeral TCP ports (can only have 32K of them on one host) on client side.

Fix this by pooling the connections, decreasing the open socket count from 52K
to 15K during peak, and have consistently lower socket usage during all time
(see image attached).

Also set a connect time out and increase the error duration so that connection
error can be retried.
@spershin spershin merged commit 279cd59 into prestodb:master Nov 14, 2023
51 checks passed
@wanglinsong wanglinsong mentioned this pull request Dec 8, 2023
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants