Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configure_azure_monitor() takes abnormally long time #34902

Open
greatvovan opened this issue Mar 23, 2024 · 5 comments
Open

configure_azure_monitor() takes abnormally long time #34902

greatvovan opened this issue Mar 23, 2024 · 5 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. feature-request This issue requires a new behavior in the product in order be resolved. Monitor - Exporter Monitor OpenTelemetry Exporter needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@greatvovan
Copy link

greatvovan commented Mar 23, 2024

  • Package Name: azure-monitor-opentelemetry
  • Package Version: 1.3.0
  • Operating System: MacOS
  • Python Version: 3.10.13

Describe the bug

With default configuration (absence of configuration), time of execution of configure_azure_monitor() takes abnormally long time: ~10 seconds.

To Reproduce

long.py:

import time
from azure.monitor.opentelemetry import configure_azure_monitor

start = time.monotonic()
configure_azure_monitor()
print(time.monotonic() - start)

APPLICATIONINSIGHTS_CONNECTION_STRING="..." python long.py

Expected behavior

Reasonable time to configure (< 1 s).

Additional context

After running in debugger I discovered two main code places contributing to the delay, and both are related to checking the fact of running in an Azure VM.

  1. Resource detection.

Location: https://github.com/open-telemetry/opentelemetry-python-contrib/blob/37aba928d45713842941c7efc992726a79ea7d8a/resource/opentelemetry-resource-detector-azure/src/opentelemetry/resource/detector/azure/vm.py#L77

The way code gets there:

image

Then in https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/resource/opentelemetry-resource-detector-azure/src/opentelemetry/resource/detector/azure/vm.py

image

2. Statsbeat metrics

Location:

request_url = "{0}?{1}&{2}".format(
_AIMS_URI, _AIMS_API_VERSION, _AIMS_FORMAT)
response = requests.get(
request_url, headers={"MetaData": "True"}, timeout=5.0)

Call stack:

image

In both cases the delay is related to requests to this endpoint:

http://169.254.169.254/metadata/instance/compute
though, to different API versions. The first place has request timeout of 4 seconds, and the second place has 5 seconds, which together constitute almost the entire time of the startup delay.

Workarounds

  1. Exclude Azure resource detectors with help of setting OTEL_EXPERIMENTAL_RESOURCE_DETECTORS=otel environment variable. If not set, the library sets the default value, that includes App Service and Azure VM.
  2. Disable sending statsbeat using APPLICATIONINSIGHTS_STATSBEAT_DISABLED_ALL=TRUE

The above tweaks bring the configuration time down to ~0.8 s (and with OTEL_PYTHON_DISABLED_INSTRUMENTATIONS set to azure_sdk,django,fastapi,flask,psycopg2,requests,urllib,urllib3 it completes under 30 ms).

It took me hours to find the above options for fixing the startup time without touching the code. I think we need to make the library friendlier to running in non-Azure environments.

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Monitor - Exporter Monitor OpenTelemetry Exporter needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. labels Mar 23, 2024
Copy link

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jeremydvoss @lzchen.

@lzchen lzchen added the feature-request This issue requires a new behavior in the product in order be resolved. label Mar 26, 2024
@lzchen
Copy link
Member

lzchen commented Mar 26, 2024

@greatvovan

Thanks a bunch for taking a look into this and providing insights. The azure monitor distro is not designed particular with performance in mind. It is a one-stop-shop telemetry solution for users to get all telemetry pillars and a select number of default instrumentations bundled together to make an uber telemetry solution, so we focus more on simplicity and functionality over performance. With that being said, we deliberately did not expose the OTEL_EXPERIMENTAL_RESOURCE_DETECTORS or APPLICATIONINSIGHTS_STATSBEAT_DISABLED_ALL because those are meant to either be a default value or not set to collect valuable diagnostic information in exchange for cpu processing in the background. As well, you may have gotten your config time down to ~0.8s by setting OTEL_PYTHON_DISABLED_INSTRUMENTATIONS but what is the purpose of a distro that does not automatically collect anything, even if it is fast during startup?

I think we need to be more specific on the problem statement, as in, what is the problem we are trying to solve and for what cohort of customers? I think it should be reworded as: for those customers not in Azure environments, we should not have wait for the azure meta data service that you have pointed out causes the application to wait for almost 10s. We probably should find a way to run this in parallel or during runtime in the background.

@greatvovan
Copy link
Author

greatvovan commented Mar 26, 2024

@lzchen thank you for your response.

I agree, there is no point to disable all that in the prod, after all, the startup time is usually not so critical (though controversial in many cases), those are rather dirty workarounds to remove annoying delays on developer's machine and possibly on other non-cloud deployments (note that disabling azure_vm resource detection seemingly has zero impact on functionality when run not on an Azure VM, but still saves 5 s of time).

Regarding statsbeat, the problem seems not in the statsbeat itself, but in the way it is initialized, because it is again trying to reach that metadata endpoint, after which it works without delays.

As for the cohort of customers, I think, in the first place those are Azure application developers, because before instrumentation makes its way to the cloud it must be coded on the developers's machine first (and this is my case). Having 10 s delay on startup is quite annoying tbh. Secondly, a category of customers could be those who are running the software on-prem but have other ties to Azure, and are willing to send logs to the cloud. Probably having a parameter disabling all Azure-bound tests would be enough.

@lzchen
Copy link
Member

lzchen commented Mar 27, 2024

@greatvovan

I agree those are great points. We will investigate a way to optimize this so users will have a better experience.

@lzchen
Copy link
Member

lzchen commented Apr 2, 2024

#35039 and open-telemetry/opentelemetry-python-contrib#2387 reduces the timeout to 200ms. This should alleviate some of the start up time in cases where users are not in an Azure vm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. feature-request This issue requires a new behavior in the product in order be resolved. Monitor - Exporter Monitor OpenTelemetry Exporter needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

2 participants