Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transport socket wrapper to collect OS-level TCP stats and export as envoy metrics #17682

Merged
merged 34 commits into from
Oct 25, 2021

Conversation

ggreenway
Copy link
Contributor

@ggreenway ggreenway commented Aug 11, 2021

Commit Message: This adds a new network filter for collecting linux-specific TCP stats and publishing them as envoy metrics, with a configurable stat_prefix so that stats can be attributed to a specific listener or cluster.
Commit Message: This adds a new transport socket wrapper for collecting OS-level TCP stats and publishing them as envoy metrics, in either the cluster or listener stats namespace (depending on upstream or downstream use). This allows attributing the TCP stats to specific listeners and clusters.
Additional Description:
Risk Level: Low
Testing: Added unit and integration tests; manually tested to make sure results look sane.
Docs Changes: Added docs for new stats and transport socket
Release Notes: Added
Platform Specific Features: Currently only supports Linux; other platform support can be added as needed
[Optional Runtime guard:]
Fixes #17617
[Optional Deprecated:]
[Optional API Considerations:]

This is useful for things like percents, which should be in the range 0.0-1.0.

Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
@repokitteh-read-only
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #17682 was opened by ggreenway.

see: more, trace.

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
envoyproxy/api-shepherds assignee is @mattklein123
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #17682 was opened by ggreenway.

see: more, trace.

@ggreenway
Copy link
Contributor Author

cc @jmarantz. I found that integer histograms made it ugly to deal with a percent metric. I had to make it in the range 0-10000 to get enough precision in the 0-5% range, where retransmissions are most of the time, which requires the receiver of the metric to know what the range is, and multiply/divide by a known factor to put it into human-readable range. Do you think adding a floating-point histogram type is the right way to solve this, or is there some better way to approach this. Also, if we go this route, should I split out the histogram changes into a separate PR?

@ggreenway
Copy link
Contributor Author

cc @mattklein123. WDYT of the changes I had to make on Connection to make this work? I'm not sure I love exposing getsockopt on every Connection, but it's const (logically) so hopefully shouldn't cause any issues?

* amount of boilerplate.
*/
template <class ConfigProto, class ProtocolOptionsProto = ConfigProto>
class UpstreamFactoryBase : public Server::Configuration::NamedUpstreamNetworkFilterConfigFactory {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the first upstream network filter; adding the boiler-plate-reducer, similar to the one for downstream network filters.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I dropped a few high level comments for thinking.

/wait

option (udpa.annotations.file_status).package_version_status = ACTIVE;

// [#protodoc-title: Linux Network Stats]
// Echo :ref:`configuration overview <config_network_filters_linuxnetworkstats>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@@ -117,6 +117,8 @@ class Histogram : public Metric {
* Records an unsigned value in the unit specified during the construction.
*/
virtual void recordValue(uint64_t value) PURE;

virtual void recordFloatValue(double value) PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc comments please

/**
* Calls `getsockopt` on the underlying socket.
*/
virtual Api::SysCallIntResult getSocketOption(int level, int optname, void* optval,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't love this either. A few options for thought:

  1. At minimum, I would make this a const method, even if it requires const_cast internally to implement, as I agree this is logically a const method.
  2. One other option would to actually return some type of generic connection info struct. Internally this would be mapped from the linux TCP_INFO struct, and in all other cases just return an empty absl::optional<>. If you went down this path, you could call the filter connection_info and just have it be documented that it currently only works on Linux. I like this option as eventually I could see this getting implemented on Windows and it would avoid a bunch of churn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the not-too-distant future, I want to implement support in envoy for MPTCP (which is currently being added to newer linux kernels). It has a similar API for getsockopt(fd, MPTCP_INFO, ...), which could use the same API if we leave it like this.

I agree with making it const if we use this api.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is still to provide a generic info API that maps between Linux specific data to a generic struct, and just leave it not implemented on non-Linux.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea that just occurred to me: maybe this should be a transport_socket instead of a network filter. It could use PassthroughSocket, wrap another "real" transport socket, and I think it gets all the correct calls (such as connection being closed before it is closed). And it has direct access to the ioHandle() which already exposes getOption() which wraps getsockopt(). That would avoid any changes to the connection interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that would work also. As I said I still think some level of abstraction here would not be very difficult and make it easy to support on other OS later, but ultimately that's up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at it briefly, looking at the stats supported by windows. They are similar, but different enough that it would be hard to abstract. For instance, linux tracks retransmissions by segments, and windows tracks them by bytes. Mac does both. Linux gives a metric for the minimum RTT over some undefined period of time (some complicated window algorithm). Mac gives the average RTT. Windows gives the minRTT over the lifetime of the connection.

So similar information can be derived, but I'm worried that it wouldn't be close enough to call them the same stat name, because a time series containing multiple platforms wouldn't make sense then. For all of these metrics, they're being measured in an OS-dependent way, so I'm worried that trying to make this cross-platform just ends up with weirdness in the platform differences.

On the other hand, they might be close enough to be useful in the same way. The metrics are at least advertised to be the same in some cases.

Either way, I think doing a transport socket causes less ugliness on the Connection interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to make it something abstracted, should it include quic? All the same metrics roughly apply to quic as well logically (though I haven't looked at what metrics quiche gives us access to).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it would be optimal to include QUIC. I guess my advise to abstract only as much as you need for your current case and we can go from there?

@jmarantz
Copy link
Contributor

RE histograms: if we can get enough dynamic range without changing the basic data type that seems better. I'm not sure what multipliers you are thinking of, but it certainly seems reasonable to add new Unit choices if that helps.

+1 for coming up with some platform-agnostic abstraction for the sys-call results.

@ggreenway
Copy link
Contributor Author

RE histograms: if we can get enough dynamic range without changing the basic data type that seems better. I'm not sure what multipliers you are thinking of, but it certainly seems reasonable to add new Unit choices if that helps.

+1 for coming up with some platform-agnostic abstraction for the sys-call results.

Can definitely get enough dynamic range from uint64. The problem is pushing the value out in meaingful units. If I multiple the 0-1 value by 10,000, I think that's enough dynamic range. But then I need a way when publishing to divide by 10,000 and convert to float, to publish a sane range.

I guess one way of doing that would be to add a new unit of Percent_10000, and when publishing a Percent_10000, convert to float and divide by 10000 at that time to publish in the correct range. Would that be more desirable? Does any of that code look at the units when writing out the underlying stat format (statsd, prometheus, etc)?

@ggreenway
Copy link
Contributor Author

@jmarantz one more question: should we publish a percent metric as 0-1, or 0-100? I don't have a strong preference. I searched and it looks like there is only 1 other metric that is a percent (percent of timeout budget used by http request), and that publishes 0-100. Prometheus docs say to use 0-1 for percents, but those docs say to use base units for everything (like seconds instead of milliseconds), and we don't follow that guidance hardly anywhere.

I don't have any preference on 0-1 vs 0-100, as long as there is enough significant digits in whatever we choose to get the data we want.

But I am against publishing a percent as 0-10000 in order to get the precision. That's making consuming the metric more difficult than it should be.

@jmarantz
Copy link
Contributor

I don't know that much about the various stats sinks deal with units, but I could do a survey and render an opinion.

I am not altogether against changing the rep to use a float or double but feel like that might have unintended consequence we might not be sufficiently testing for.

FWIW the term "basis points" is used in finance to denote 100ths of a percent: https://www.investopedia.com/terms/b/basispoint.asp -- would using that somewhat standard terminology help?

@ggreenway
Copy link
Contributor Author

I don't know that much about the various stats sinks deal with units, but I could do a survey and render an opinion.

Prometheus natively supports fractional values for most things, and this is encouraged (eg everything is supposed to be in seconds, even if the expected values are much less than 1 second).

Statsd seems to encode just enough information for the database to know how to store them, and has some strange conventions, but the docs seem to say that fractional values are supported. I think it's generally up to the receiver to know what the units are.

No idea for the others.

I am not altogether against changing the rep to use a float or double but feel like that might have unintended consequence we might not be sufficiently testing for.

I certainly don't want to change existing stats to use floating point numbers; it is substantially slower to deal with them.

I think it may make the most sense to encode some units in the metric, have the raw stored value be fixed-point, and convert to something sensible for the stat-sink in the output path.

There will be some complication in that: for prometheus, we support configurable histogram buckets. We use those to call hist_approx_count_below(). We'll need to scale the buckets here too; not unmanageable, but not free.

FWIW the term "basis points" is used in finance to denote 100ths of a percent: https://www.investopedia.com/terms/b/basispoint.asp -- would using that somewhat standard terminology help?

I'm familiar with the term; not sure how familiar others would be.

@davinci26
Copy link
Member

Re: cross-platform things

  1. Config/AP: I would rather if the config was cross platform and it was network_stats instead of linux_network_stats because that way a service mesh provider such Istio or OSM can add this filter to their config and then Envoy will know which one to use.

  2. Implementation/Wise: Can we have an internal struct (basically expand) EnvoyTcpInfo and each platform implements this class and internally we abstract the platform details. So we can define another function os_sys_calls or in IoHandle which just returns EnvoyTcpInfo.

@ggreenway
Copy link
Contributor Author

Re: cross-platform things

  1. Config/AP: I would rather if the config was cross platform and it was network_stats instead of linux_network_stats because that way a service mesh provider such Istio or OSM can add this filter to their config and then Envoy will know which one to use.
  2. Implementation/Wise: Can we have an internal struct (basically expand) EnvoyTcpInfo and each platform implements this class and internally we abstract the platform details. So we can define another function os_sys_calls or in IoHandle which just returns EnvoyTcpInfo.

Kind of. The problem is when the different platforms don't measure the same thing. If a platform doesn't measure something at all, we can just omit the stat.

But what if two platforms measure similar things, but they're not quite the same? Or they're trying to measure the same thing, but with different methodology that gives meaningfully different results? Do we create different stat names for the two platforms, with very similar names, or just use the same name?

@davinci26
Copy link
Member

I think that putting a prefix with the platform name in the stats might be useful regardless. I think we can add the prefix for all the stats.

I dont have any customer data at this point but I could see how someone who has a mixed OS cluster might be interested in TCP perf metric per platform.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 12, 2021
@github-actions
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Sep 20, 2021
@ggreenway
Copy link
Contributor Author

I'll get back to this eventually; got busy with other things for awhile.

@@ -104,8 +104,16 @@ class Histogram : public Metric {
Bytes,
Microseconds,
Milliseconds,
Percent, // A percent value stored as fixed-point, where the stored value is divided by
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmarantz wdyt about this for a fixed-point scaled percent representation for histograms? Changes can be viewed isolated here: 6b1f8aa

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we go with this can we split this out into an independent change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. But I wanted to iterate on these together at first to make sure the stats API we add is right for the use case.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool. LGTM with just a couple of small questions.

/wait-any


// Period to update stats while the connection is open. If unset, updates only happen when the
// connection is closed. Stats are always updated one final time when the connection is closed.
google.protobuf.Duration update_period = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should have some minimum sensible time here if this is set? wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be a minimum of 1ms, because we can't set timers for less than that. Other than that, if someone wants to burn CPU by querying this almost constantly, and getting really high granularity, I'm fine with letting them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK maybe add an annotation for this then? Otherwise LGTM.

/wait

void closeSocket(Network::ConnectionEvent event) override;

private:
absl::optional<struct tcp_info> querySocketInfo();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something but how does this compile? Doesn't this require knowing the size of tcp_info? I'm concerned that somehow this is using the wrong version with the missing fields? Otherwise is the forward decl needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. It may be that the compiler doesn't care if there's no caller of the function. And since it is private, the only caller is in tcp_stats.cc, which has the definition before the function is called. I don't know. C++ is weird and complicated.

Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
mattklein123
mattklein123 previously approved these changes Oct 22, 2021
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

mattklein123
mattklein123 previously approved these changes Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add way to get more detailed OS level network stats as Envoy metrics
4 participants