Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resource id in virtual machine scale sets with azure_monitor output #5821

Merged
merged 3 commits into from
May 20, 2019

Conversation

danielnelson
Copy link
Contributor

Use alternate resource-id for virtual machine scale sets.

closes #5819

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

@danielnelson danielnelson added the fix pr to fix corresponding bug label May 8, 2019
@danielnelson
Copy link
Contributor Author

@johncrim Do you have any available time to help test? We should verify this fix on both a scaleset and non-scaleset virtual machine, if you could do either of these it would be very appreciated.

@johncrim
Copy link

johncrim commented May 9, 2019 via email

@danielnelson
Copy link
Contributor Author

Great, testing the .deb should definitely be sufficient on this issue.

@johncrim
Copy link

@danielnelson : I'm still getting the same error as before on servers in the VM ScaleSet. My hunch is that the logic to detect whether the VM is in a scaleset or a standalone VM isn't working. I'll review the changes and try to troubleshoot a bit more to see if I can help.

vm000:~$ systemctl status telegraf --all -n 20
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
   Loaded: loaded (/lib/systemd/system/telegraf.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2019-05-14 16:17:17 UTC; 1min 43s ago
     Docs: https://github.com/influxdata/telegraf
 Main PID: 72690 (telegraf)
    Tasks: 10
   Memory: 22.4M
      CPU: 223ms
   CGroup: /system.slice/telegraf.service
           └─72690 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

May 14 16:17:17 vm000 systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
May 14 16:17:17 vm000 systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
May 14 16:17:17 vm000 telegraf[72690]: 2019-05-14T16:17:17Z I! Starting Telegraf
May 14 16:17:17 vm000 telegraf[72690]: 2019-05-14T16:17:17Z I! Loaded inputs: cpu diskio mem net
May 14 16:17:17 vm000 telegraf[72690]: 2019-05-14T16:17:17Z I! Loaded aggregators:
May 14 16:17:17 vm000 telegraf[72690]: 2019-05-14T16:17:17Z I! Loaded processors:
May 14 16:17:17 vm000 telegraf[72690]: 2019-05-14T16:17:17Z I! Loaded outputs: azure_monitor
May 14 16:17:17 vm000 telegraf[72690]: 2019-05-14T16:17:17Z I! Tags enabled: host=vm000
May 14 16:17:17 vm000 telegraf[72690]: 2019-05-14T16:17:17Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"vm000", Flush Interval:10s
May 14 16:18:00 vm000 telegraf[72690]: 2019-05-14T16:18:00Z E! [agent] Error writing to output [azure_monitor]: failed to write batch: [404] 404 Not Found
May 14 16:18:10 vm000 telegraf[72690]: 2019-05-14T16:18:10Z E! [agent] Error writing to output [azure_monitor]: failed to write batch: [404] 404 Not Found
May 14 16:18:20 vm000 telegraf[72690]: 2019-05-14T16:18:20Z E! [agent] Error writing to output [azure_monitor]: failed to write batch: [404] 404 Not Found
May 14 16:18:30 vm000 telegraf[72690]: 2019-05-14T16:18:30Z E! [agent] Error writing to output [azure_monitor]: failed to write batch: [404] 404 Not Found

It still works normally on the standalone VM.

@johncrim
Copy link

A little more diagnostic info:

Querying the metadata service on a VM in the scaleset:

vm000:~$ curl -H Metadata:true "http://169.254.169.254/metadata/instance?api-version=2017-12-01"
{"compute":{"location":"westus2","name":"vm0_0","offer":"UbuntuServer","osType":"Linux","placementGroupId":"<guid>","platformFaultDomain":"0","platformUpdateDomain":"0","publisher":"Canonical","resourceGroupName":"rg","sku":"16.04-LTS","subscriptionId":"<guid>","tags":"","version":"16.04.201904240","vmId":"<guid>","vmScaleSetName":"vm0","vmSize":"Standard_B2s","zone":""},"network":...}

Querying the metadata service on a standalone VM:

jcdev:~$ curl -H Metadata:true "http://169.254.169.254/metadata/instance?api-version=2017-12-01"
{"compute":{"location":"westus2","name":"jcdev","offer":"UbuntuServer","osType":"Linux","placementGroupId":"","platformFaultDomain":"0","platformUpdateDomain":"0","publisher":"Canonical","resourceGroupName":"rg","sku":"16.04-LTS","subscriptionId":"<guid>","tags":"","version":"16.04.201904240","vmId":"<guid>","vmScaleSetName":"","vmSize":"Standard_B1ms","zone":""},"network":...}

In both cases I edited the response to remove any potentially sensitive info.

@johncrim
Copy link

@danielnelson : I'm pretty confident that I've identified the bug: The last segment of the resourceId on a VM ScaleSet needs to be the VM ScaleSet name. The current code is using the computer name.

Eg in the example above, the resource ID is currently:

/subscriptions/<guid>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachineScaleSets/vm0_0

And it should be:

/subscriptions/<guid>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachineScaleSets/vm0

Note that it would be a bit easier to troubleshoot if the URL were logged when a 404 occurs. I don't know if that idea violates any security standards in the telegraf code base, but it certainly would have saved me a bunch of time.

@danielnelson
Copy link
Contributor Author

Thanks for the testing, I believe I have fixed the issue in these new builds, can you give them a try?

If you still have problems, run Telegraf with --debug set (either on the cli or in the agent configuration) and it should print the computed resource URL on startup.

@johncrim
Copy link

@danielnelson - Thank you for the fix. Unfortunately, it's still not working on the VM in a scaleset. With --debug set, the logs show:

May 16 16:30:26 <vm hostname> systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
May 16 16:30:27 <vm hostname> telegraf[4064]: 2019-05-16T16:30:27Z I! Starting Telegraf
May 16 16:30:27 <vm hostname> telegraf[4064]: 2019-05-16T16:30:27Z I! Loaded inputs: mem net cpu diskio
May 16 16:30:27 <vm hostname> telegraf[4064]: 2019-05-16T16:30:27Z I! Loaded aggregators:
May 16 16:30:27 <vm hostname> telegraf[4064]: 2019-05-16T16:30:27Z I! Loaded processors:
May 16 16:30:27 <vm hostname> telegraf[4064]: 2019-05-16T16:30:27Z I! Loaded outputs: azure_monitor
May 16 16:30:27 <vm hostname> telegraf[4064]: 2019-05-16T16:30:27Z I! Tags enabled: host=<vm hostname>
May 16 16:30:27 <vm hostname> telegraf[4064]: 2019-05-16T16:30:27Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"<vm hostname>", Flush Interval:10s
May 16 16:30:27 <vm hostname> telegraf[4064]: 2019-05-16T16:30:27Z D! [agent] Connecting outputs
May 16 16:30:27 <vm hostname> telegraf[4064]: 2019-05-16T16:30:27Z D! [agent] Attempting connection to output: azure_monitor
May 16 16:30:27 <vm hostname> telegraf[4064]: 2019-05-16T16:30:27Z D! Writing to Azure Monitor URL: https://westus2.monitoring.azure.com/subscriptions/<guid>/resourceGroups/<resourcegroup>/providers/Micro
May 16 16:30:27 <vm hostname> telegraf[4064]: 2019-05-16T16:30:27Z D! [agent] Successfully connected to output: azure_monitor
May 16 16:30:27 <vm hostname> telegraf[4064]: 2019-05-16T16:30:27Z D! [agent] Starting service inputs
May 16 16:30:40 <vm hostname> telegraf[4064]: 2019-05-16T16:30:40Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:30:50 <vm hostname> telegraf[4064]: 2019-05-16T16:30:50Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:31:00 <vm hostname> telegraf[4064]: 2019-05-16T16:31:00Z D! [outputs.azure_monitor] buffer fullness: 239 / 10000 metrics.
May 16 16:31:00 <vm hostname> telegraf[4064]: 2019-05-16T16:31:00Z E! [agent] Error writing to output [azure_monitor]: failed to write batch: [404] 404 Not Found
May 16 16:31:10 <vm hostname> telegraf[4064]: 2019-05-16T16:31:10Z D! [outputs.azure_monitor] buffer fullness: 239 / 10000 metrics.
May 16 16:31:10 <vm hostname> telegraf[4064]: 2019-05-16T16:31:10Z E! [agent] Error writing to output [azure_monitor]: failed to write batch: [404] 404 Not Found
May 16 16:31:20 <vm hostname> telegraf[4064]: 2019-05-16T16:31:20Z D! [outputs.azure_monitor] buffer fullness: 239 / 10000 metrics.
May 16 16:31:20 <vm hostname> telegraf[4064]: 2019-05-16T16:31:20Z E! [agent] Error writing to output [azure_monitor]: failed to write batch: [404] 404 Not Found
May 16 16:31:30 <vm hostname> telegraf[4064]: 2019-05-16T16:31:30Z D! [outputs.azure_monitor] buffer fullness: 239 / 10000 metrics.
May 16 16:31:30 <vm hostname> telegraf[4064]: 2019-05-16T16:31:30Z E! [agent] Error writing to output [azure_monitor]: failed to write batch: [404] 404 Not Found
May 16 16:31:40 <vm hostname> telegraf[4064]: 2019-05-16T16:31:40Z D! [outputs.azure_monitor] buffer fullness: 239 / 10000 metrics.
May 16 16:31:40 <vm hostname> telegraf[4064]: 2019-05-16T16:31:40Z E! [agent] Error writing to output [azure_monitor]: failed to write batch: [404] 404 Not Found

If I manually set the resourceId in /etc/telegraf/telegraf.conf to:

resource_id = "/subscriptions/<guid>/resourceGroups/<resourcegroup>/providers/Microsoft.Compute/virtualMachineScaleSets/<vm scaleset name>"

Then, as before, metric reporting works as expected:

May 16 16:35:20 <vm hostname> systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
May 16 16:35:20 <vm hostname> telegraf[7938]: 2019-05-16T16:35:20Z I! Starting Telegraf
May 16 16:35:20 <vm hostname> telegraf[7938]: 2019-05-16T16:35:20Z I! Loaded inputs: cpu diskio mem net
May 16 16:35:20 <vm hostname> telegraf[7938]: 2019-05-16T16:35:20Z I! Loaded aggregators:
May 16 16:35:20 <vm hostname> telegraf[7938]: 2019-05-16T16:35:20Z I! Loaded processors:
May 16 16:35:20 <vm hostname> telegraf[7938]: 2019-05-16T16:35:20Z I! Loaded outputs: azure_monitor
May 16 16:35:20 <vm hostname> telegraf[7938]: 2019-05-16T16:35:20Z I! Tags enabled: host=<vm hostname>
May 16 16:35:20 <vm hostname> telegraf[7938]: 2019-05-16T16:35:20Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"<vm hostname>", Flush Interval:10s
May 16 16:35:20 <vm hostname> telegraf[7938]: 2019-05-16T16:35:20Z D! [agent] Connecting outputs
May 16 16:35:20 <vm hostname> telegraf[7938]: 2019-05-16T16:35:20Z D! [agent] Attempting connection to output: azure_monitor
May 16 16:35:20 <vm hostname> telegraf[7938]: 2019-05-16T16:35:20Z D! Writing to Azure Monitor URL: https://westus2.monitoring.azure.com/subscriptions/<guid>/resourceGroups/<resourcegroup>/providers/Micro
May 16 16:35:20 <vm hostname> telegraf[7938]: 2019-05-16T16:35:20Z D! [agent] Successfully connected to output: azure_monitor
May 16 16:35:20 <vm hostname> telegraf[7938]: 2019-05-16T16:35:20Z D! [agent] Starting service inputs
May 16 16:35:40 <vm hostname> telegraf[7938]: 2019-05-16T16:35:40Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:35:50 <vm hostname> telegraf[7938]: 2019-05-16T16:35:50Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:36:00 <vm hostname> telegraf[7938]: 2019-05-16T16:36:00Z D! [outputs.azure_monitor] wrote batch of 239 metrics in 428.641657ms
May 16 16:36:00 <vm hostname> telegraf[7938]: 2019-05-16T16:36:00Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:36:10 <vm hostname> telegraf[7938]: 2019-05-16T16:36:10Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:36:20 <vm hostname> telegraf[7938]: 2019-05-16T16:36:20Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:36:30 <vm hostname> telegraf[7938]: 2019-05-16T16:36:30Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:36:40 <vm hostname> telegraf[7938]: 2019-05-16T16:36:40Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:36:50 <vm hostname> telegraf[7938]: 2019-05-16T16:36:50Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:37:00 <vm hostname> telegraf[7938]: 2019-05-16T16:37:00Z D! [outputs.azure_monitor] wrote batch of 239 metrics in 460.359917ms
May 16 16:37:00 <vm hostname> telegraf[7938]: 2019-05-16T16:37:00Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:37:10 <vm hostname> telegraf[7938]: 2019-05-16T16:37:10Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:37:20 <vm hostname> telegraf[7938]: 2019-05-16T16:37:20Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:37:30 <vm hostname> telegraf[7938]: 2019-05-16T16:37:30Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:37:40 <vm hostname> telegraf[7938]: 2019-05-16T16:37:40Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:37:50 <vm hostname> telegraf[7938]: 2019-05-16T16:37:50Z D! [outputs.azure_monitor] buffer fullness: 0 / 10000 metrics.
May 16 16:38:00 <vm hostname> telegraf[7938]: 2019-05-16T16:38:00Z D! [outputs.azure_monitor] wrote batch of 239 metrics in 502.457162ms

Would it be possible to debug log the evaluated resource ID? I'll take a look at your changes again (I'm not a Go developer, but it's pretty easy to read).

@johncrim
Copy link

@danielnelson : This looks like the problem:

	if m.Compute.VMScaleSetName == "" {
		return fmt.Sprintf(
			resourceIDScaleSetTemplate,
			m.Compute.SubscriptionID,
			m.Compute.ResourceGroupName,
			m.Compute.VMScaleSetName,
		)
	} else {
		return fmt.Sprintf(
			resourceIDTemplate,
			m.Compute.SubscriptionID,
			m.Compute.ResourceGroupName,
			m.Compute.Name,
		)
	}

The if/else bodies are switched. If the VMScaleSetName is empty, use the VM template.

if m.Compute.VMScaleSetName == "" {
template = resourceIDTemplate
return fmt.Sprintf(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if m.Compute.VMScaleSetName != ""

@danielnelson
Copy link
Contributor Author

Thanks, the resource ID is essentially the azure monitor url in the debug output, but I think it would make sense to print it explicitly. I'll try to circle back on this later today but here is the builds with the fixed logic:

@johncrim
Copy link

Thanks @danielnelson. This .deb build works as expected on both type of Azure VM resources. I think you're good to go.

@danielnelson danielnelson added this to the 1.11.0 milestone May 20, 2019
@danielnelson danielnelson merged commit ad877fd into master May 20, 2019
@danielnelson danielnelson deleted the azure-scaleset-resource-id branch May 20, 2019 21:32
@danielnelson
Copy link
Contributor Author

Good news, thanks again for the testing

hwaastad pushed a commit to hwaastad/telegraf that referenced this pull request Jun 13, 2019
bitcharmer pushed a commit to bitcharmer/telegraf that referenced this pull request Oct 18, 2019
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
idohalevi pushed a commit to idohalevi/telegraf that referenced this pull request Sep 29, 2020
@anildesai61
Copy link

anildesai61 commented Nov 14, 2023

Hello @johncrim @danielnelson

i have ubuntu 18.04 LTS VMSS on azure , below are the steps which i followed:

apt install telegraf -y
telegraf --input-filter cpu:mem:diskio:net --output-filter azure_monitor config > /etc/telegraf/telegraf.conf
systemctl restart telegraf

added a line in /etc/telegraf/telegraf.conf file as:
resoure_id = "/subscriptions/%s/resourceGroups/%s/providers/Microsoft.Compute/virtualMachineScaleSets/%s"
But still telegraf metric not able to visible in the metric option at vmss, can you please help here it will be more helpful.

telegraf status

telegraf.service - Telegraf
Loaded: loaded (/lib/systemd/system/telegraf.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-11-14 18:01:09 UTC; 2min 6s ago
Docs: https://github.com/influxdata/telegraf
Main PID: 3505 (telegraf)
Tasks: 7 (limit: 4915)
CGroup: /system.slice/telegraf.service
└─3505 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

Nov 14 18:01:09 waf-teleg000000 telegraf[3505]: 2023-11-14T18:01:09Z I! Tags enabled: host=waf-teleg000000
Nov 14 18:01:09 waf-teleg000000 systemd[1]: Started Telegraf.
Nov 14 18:01:09 waf-teleg000000 telegraf[3505]: 2023-11-14T18:01:09Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"waf-teleg000000", Flush Interval:10s
Nov 14 18:02:09 waf-teleg000000 telegraf[3505]: 2023-11-14T18:02:09Z E! [agent] Error writing to outputs.azure_monitor: unable to fetch authentication credentials: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://centralindia.monitoring.azure.com/subscriptions/xxxx-xx-xx-xx-xxx-xxx-xxx-xxx-xxxx/resourceGroups/TELEGRAPH/providers/Microsoft.Compute/virtualMachineScaleSets/waf-telegraph1/metrics: StatusCode=400 -- Original Error: adal: Refresh request failed. Status Code = '400'. Response body: {"error":"invalid_request","error_description":"Identity not found"} Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fmonitoring.azure.com%2F

Telegraf config file.

#Send aggregate metrics to Azure Monitor
[[outputs.azure_monitor]]
Timeout for HTTP writes.
timeout = "20s"

#Set the namespace prefix, defaults to "Telegraf/".
namespace_prefix = "Telegraf/Apache"

#Azure Monitor doesn't have a string value type, so convert string
#fields to dimensions (a.k.a. tags) if enabled. Azure Monitor allows
#a maximum of 10 dimensions so Telegraf will only send the first 10
#alphanumeric dimensions.
strings_as_dimensions = false

#Both region and resource_id must be set or be available via the
#Instance Metadata service on Azure Virtual Machines.
#Azure Region to publish metrics against
region = "centralindia"

The Azure Resource ID against which metric will be logged, e.g.
resource_id = "/subscriptions/xxx-xx-x-xxxxx-xxx-xx/resourceGroups/TELEGRAPH/providers/Microsoft.Compute/virtualMachineScaleSets/waf-telegraph1"

Please share any example telegraf config file for vmss, i wanted to achieve based on the Apache requests in telegraf metric wanted to scale up vmss

@powersj
Copy link
Contributor

powersj commented Nov 14, 2023

@anildesai61 please stop putting comments on closed PR and issues. If you want support or help please use the slack or community forums.

@anildesai61
Copy link

@anildesai61 please stop putting comments on closed PR and issues. If you want support or help please use the slack or community forums.

Sure, I will open a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix pr to fix corresponding bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

azure_monitor support for VM scale sets
5 participants