-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add helper functions for metric conversion [awsecscontainermetricsreceiver] #1089
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1089 +/- ##
==========================================
+ Coverage 88.84% 89.00% +0.15%
==========================================
Files 251 254 +3
Lines 11979 12161 +182
==========================================
+ Hits 10643 10824 +181
+ Misses 992 991 -1
- Partials 344 346 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
) | ||
|
||
// GenerateDummyMetrics generates two dummy metrics | ||
func GenerateDummyMetrics() consumerdata.MetricsData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this method be moved to a test file? Same with createGaugeIntMetric
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. Planing to totally remove it in our next PR when we will get our original Metrics generation code. Keeping it for now as its being used by previous code. Added a TODO note on top of it.
ContainerPrefix = "container." | ||
ResourceAttributeServiceNameKey = "service.name" | ||
ResourceAttributeServiceNameValue = "awsecscontainermetrics" | ||
MetricResourceType = "aoc.ecs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what does aoc
stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AWS Observability Collector-> Amazon distribution of OpenTelemetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, resource type doesn't exist in OTel protocol but is there right now for metrics since it still seems to use opencensus. So this value will generally be dropped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. Even if it gets utilized we want to use "aoc.ecs" to differentiate our OT metrics from ECS backend metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I point it out because it will go away - so if you do have an expectation of having it, it won't be there :P But I think if we put the receiver name in something like telemetry.sdk
than the information should still be preserved, in a way that matches in some sense what our apps send.
containerMetrics.MemoryReserved = *containerMetadata.Limits.Memory | ||
containerMetrics.CPUReserved = *containerMetadata.Limits.CPU | ||
|
||
taskMemLimit += containerMetrics.MemoryReserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these values aren't being used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
} | ||
|
||
func (acc *metricDataAccumulator) accumulate( | ||
startTime *timestamp.Timestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From usages of accumulate
method, I think the first parameter being passed in is the timestamp of a reporting interval? This should instead be the time when a cumulative metric was reset to 0. Sere comment here for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. I looked into the dockerstatsreceiver
and it's not setting the value I think. We do have PullStartedAt
timestamp and I think we can utilize it. However, need more thoughts on this. For now, not setting this value to get the default behavior. Will send a separate PR for this.
resourceAttributes[ResourceAttributeServiceNameKey] = ResourceAttributeServiceNameValue | ||
|
||
r := &resourcepb.Resource{ | ||
Type: MetricResourceType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this same type is being used for all resources from which metrics are being collected, both containers and tasks. Container metrics should be associated with a container resource and similarly a task resource for task metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @anuraaga mentioned, this will be dropped eventually. However, if it does get utilized, we prefer to use aoc.ecs
for all metrics we are receiving from this receiver.
taskMemLimit += containerMetrics.MemoryReserved | ||
taskCPULimit += containerMetrics.CPUReserved | ||
|
||
labelKeys, labelValues := containerLabelKeysAndValues(containerMetadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Labels/Attributes that describe a resource (container/task) should be collected as attributes on the resource object. Same for labels collected by taskLabelKeysAndValues
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @asuresh4 , I am little bit confused here. Can you explain a bit more? These labels describes the properties to differentiate the metrics. Then when/what should I set as metric labels. Also, these labels are supposed to be converted into metric dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kubelet_stats receiver should be good example to understand the concept. It collects metrics from different types of resources (containers, pods, nodes). The properties of resources are added as labels on the resource. An exporter exporting the metric would treat labels from the resource and the metric as dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Added those as Resource Attributes.
github.com/stretchr/testify v1.6.1 | ||
go.opentelemetry.io/collector v0.10.1-0.20200917170114-639b9a80ed46 | ||
go.uber.org/zap v1.16.0 | ||
google.golang.org/protobuf v1.25.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these are from importingjackfan.us.kg/golang/protobuf/ptypes/timestamp
and google.golang.org/protobuf/types/known/timestamppb
. Do you need both? Could probably just use timestamppb
in both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some flyby comments, but basically LGTM with @asuresh4's comments, thanks.
"time" | ||
|
||
metricspb "github.com/census-instrumentation/opencensus-proto/gen-go/metrics/v1" | ||
resourcepb "github.com/census-instrumentation/opencensus-proto/gen-go/resource/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asuresh4 @bogdandrutu Are metrics receivers still using opencensus proto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we changed most of the core to use the otlp and internal structs. Completely recommend for new components to avoid oc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hossain-rayhan you need to start using pdata.Metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bogdandrutu , before sending the data to next consumer I am using internaldata.OCToMetrics(md)
to convert our metrics to pdata.Metrics
. Wondering, isn't that enough like other receivers in the repo or we should strictly get rid of it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a temporary solution to make progress and not have to change all components once. And decided to use that for some old components that we did not have time to chnage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hossain-rayhan Yeah so basically you should only be converting at the last moment when passing down, but here in this sort of receiver-specific logic we want to be using pdata
, the OTel format. Or we just have to rewrite it right away. We're also having data-model issues because of using the old format (Resource type for example) and we want to make sure the model is right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bogdandrutu and @anuraaga. I understand we need to use pdata
to convert everything to OTel format eventually. I was planning to move forward with this to meet our internal deadline (9/30/2020). We can send a different PR after October 15th I guess. How do you guys feel about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue created: #1122
ContainerPrefix = "container." | ||
ResourceAttributeServiceNameKey = "service.name" | ||
ResourceAttributeServiceNameValue = "awsecscontainermetrics" | ||
MetricResourceType = "aoc.ecs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, resource type doesn't exist in OTel protocol but is there right now for metrics since it still seems to use opencensus. So this value will generally be dropped
TaskPrefix = "ecs.task." | ||
ContainerPrefix = "container." | ||
ResourceAttributeServiceNameKey = "service.name" | ||
ResourceAttributeServiceNameValue = "awsecscontainermetrics" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The service name corresponds with an application, not a backend, so for example AuthService
, SearchFrontend
, etc. We could fill this in with the ECS service name, or otherwise we shouldn't fill it since this isn't the correct semantics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, got it. But, for all of our AWS receivers, we are using the receiver name as service.name
. Because, this field will be utilized by our CW EMFExporter to generate different rules for different receivers. Especially for Container Insights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by AWS receivers? I think the only one we have is xray, which doesn't do this, and definitely shouldn't since we need to make sure the app's service name is used.
I'm not sure what you mean exactly by the rules, but anyways we can't just fill a semantic convention attribute with something that doesn't follow the spec. If anything, the telemetry.sdk matches closer to what this sort of receiver is doing. @bogdandrutu @tigrannajaryan any suggestion on that?
Also @hossain-rayhan it's important to take a step back and remember what this receiver is here for - it's to translate the container metrics data into the OpenTelemetry format / specification. This is because this data seems useful to users regardless of if they use cloudwatch or not. While we may need some, but hopefully not much, consideration for specific vendors like cloudwatch, that's not the intent here. If you haven't yet, you should go through in detail at least the Resource and Metrics semantics conventions of OTel spec before proceeding and make sure you are aligned with it https://github.com/open-telemetry/opentelemetry-specification/tree/master/specification/resource. That doesn't mean we want to block data that's needed, but it's important to follow the spec as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This receiver generates ECS container Metrics itself but not receiving any metrics from outside of OTel Collector. For the metrics generated inside the receiver, the idea is to put receiver name in service.name
attribute on these metrics. It's similar to the idea Prometheus receiver uses job_name
as service.name
for metrics it scrapes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean exactly by the rules, but anyways we can't just fill a semantic convention attribute with something that doesn't follow the spec. If anything, the telemetry.sdk matches closer to what this sort of receiver is doing. @bogdandrutu @tigrannajaryan any suggestion on that?
+1. We should not use "service.name" for receiver name. That is not the purpose of "service.name". "service.name" is supposed to describe the source that emits the metrics. Collector is just collecting the metric, it is an intermediary, it is not the source. Nor is "telemetry.sdk" intended for that.
The source that emits the metrics is the container here. If we know the name of the service that runs in the container we should set that. If we don't know we should not record it at all.
I do not know why we want to record the receiver name, perhaps you can clarify the use case. This can then be added as a semantic convention for OpenTelemetry as a whole or just for the Collector and will possibly end up in the "otel" namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hossain-rayhan If you can give some more detail about this usage that would be great. I think filling in wrong information is blocking this PR, so the easiest way to proceed would be to just remove setting the service name for now and we can figure out a way to handle what you need in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @anuraaga . I am removing it for now as this should not block the receiver. This is more related to the exporter logic as it's being utilized for supporting special customer use cases. If needed I can support it in a separate PR after further discussion.
"google.golang.org/protobuf/types/known/timestamppb" | ||
) | ||
|
||
func convertToOTMetrics(prefix string, m ECSMetrics, labelKeys []*metricspb.LabelKey, labelValues []*metricspb.LabelValue, timestamp *timestamppb.Timestamp) []*metricspb.Metric { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func convertToOTMetrics(prefix string, m ECSMetrics, labelKeys []*metricspb.LabelKey, labelValues []*metricspb.LabelValue, timestamp *timestamppb.Timestamp) []*metricspb.Metric { | |
func convertToOCMetrics(prefix string, m ECSMetrics, labelKeys []*metricspb.LabelKey, labelValues []*metricspb.LabelValue, timestamp *timestamppb.Timestamp) []*metricspb.Metric { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're actually converting to opencensus metrics here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
"time" | ||
|
||
metricspb "github.com/census-instrumentation/opencensus-proto/gen-go/metrics/v1" | ||
resourcepb "github.com/census-instrumentation/opencensus-proto/gen-go/resource/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hossain-rayhan Yeah so basically you should only be converting at the last moment when passing down, but here in this sort of receiver-specific logic we want to be using pdata
, the OTel format. Or we just have to rewrite it right away. We're also having data-model issues because of using the old format (Resource type for example) and we want to make sure the model is right
|
||
TaskPrefix = "ecs.task." | ||
ContainerPrefix = "container." | ||
ResourceAttributeServiceNameKey = "service.name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this? conventions.AttributeServiceName
BytesInMiB = 1024 * 1024 | ||
|
||
TaskPrefix = "ecs.task." | ||
ContainerPrefix = "container." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to prefix the metrics with container? If they have container label, they're container metrics right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this part of the OTel convention but given that other receivers follow this approach, I think we should do the same here for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I found some other receivers are doing the same like kubeletstatsreceiver
and dockerstatsreceiver
.
TaskPrefix = "ecs.task." | ||
ContainerPrefix = "container." | ||
ResourceAttributeServiceNameKey = "service.name" | ||
ResourceAttributeServiceNameValue = "awsecscontainermetrics" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by AWS receivers? I think the only one we have is xray, which doesn't do this, and definitely shouldn't since we need to make sure the app's service name is used.
I'm not sure what you mean exactly by the rules, but anyways we can't just fill a semantic convention attribute with something that doesn't follow the spec. If anything, the telemetry.sdk matches closer to what this sort of receiver is doing. @bogdandrutu @tigrannajaryan any suggestion on that?
Also @hossain-rayhan it's important to take a step back and remember what this receiver is here for - it's to translate the container metrics data into the OpenTelemetry format / specification. This is because this data seems useful to users regardless of if they use cloudwatch or not. While we may need some, but hopefully not much, consideration for specific vendors like cloudwatch, that's not the intent here. If you haven't yet, you should go through in detail at least the Resource and Metrics semantics conventions of OTel spec before proceeding and make sure you are aligned with it https://github.com/open-telemetry/opentelemetry-specification/tree/master/specification/resource. That doesn't mean we want to block data that's needed, but it's important to follow the spec as much as possible.
ContainerPrefix = "container." | ||
ResourceAttributeServiceNameKey = "service.name" | ||
ResourceAttributeServiceNameValue = "awsecscontainermetrics" | ||
MetricResourceType = "aoc.ecs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I point it out because it will go away - so if you do have an expectation of having it, it won't be there :P But I think if we put the receiver name in something like telemetry.sdk
than the information should still be preserved, in a way that matches in some sense what our apps send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Hi @bogdandrutu @tigrannajaryan @asuresh4 can we get this merged? |
|
||
taskMetrics := ECSMetrics{} | ||
timestamp := timestampProto(time.Now()) | ||
taskResources := taskResources(metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: taskResource
would be more accurate. Same with containerResources
(-> containerResource
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
||
acc.accumulate( | ||
taskResources, | ||
convertToOCMetrics(TaskPrefix, taskMetrics, nil, nil, timestamp), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the 3rd and 4th parameters to this method are always nil, I would remove those parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thought about it while writing this piece of code. Here, I kept the skeleton ready and the same method can be utilized to set metric labels. In our next PRs, we can just pass the LabelKeys
and LabelValues
and we are done. If we really don't utilize, I will remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, apart from couple of minor comments.
I'll merge once the last comments from @asuresh4 are addressed. |
cf75a60
to
91dac3d
Compare
* Split out processor READMEs * Split out exporter READMEs * Split out extension READMEs * Split out receiver READMEs * Add new line at end of READMEs
* Prepare for releasing v0.11.0 * Update CHANGELOG.md to reflect scope of v0.11.0 release * Update CHANGELOG.md Co-authored-by: Tyler Yahn <[email protected]> Co-authored-by: Tyler Yahn <[email protected]>
* Use a shorter timeout for AWS EC2 metadata requests Fix #1088 According to the docs, the value for `timeout` is in seconds: https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen. 1000 seconds seems slow and in some cases can block the startup of the program being instrumented (see #1088 as an example), because the request will hang indefinitely in non-AWS environments. Using a much shorter 1 second timeout seems like a reasonable workaround for this. * add changelog entry for timeout change * use 5s timeout for ECS and EKS, update changelog Co-authored-by: Srikanth Chekuri <[email protected]>
Description:
This change adds helper functions for converting ECS resources metrics to OT metrics.
Link to tracking Issue:
#457
Testing:
Unit test added.
Documentation:
README.md