Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inventory performance improvements and fixes fetching interfaces and services #202

Merged
merged 16 commits into from
May 19, 2020

Conversation

DouglasHeriot
Copy link
Contributor

@DouglasHeriot DouglasHeriot commented May 11, 2020

Fixes #142 #143
Implements #199 #200
(Replaces closed pull request #196)

Refactors how interfaces, ip addresses and services are fetched.

Overview

Introduces two new options:

  • fetch_all - when fetching interfaces, services, and IPs should all be fetched, or should the plugin filter on device/vm ids
  • max_uri_length - maximum length of a GET request. Affects how many &device_id=1&device_id=2 etc. can fit into a single request.

The previous flow:

  • Fetch "lookups" (sites, platforms, device types, etc.)
  • Fetch hosts
  • When iterating through hosts, fetch interfaces, ips, and services for each host

New flow:

  • Fetch hosts
  • Fetch lookups
  • Fetch interfaces and services. If fetch_all is True (default), get all. Otherwise only request for specific hosts
  • If fetch_all is True, get all IPs. Otherwise, if any of a host's interfaces have count_ipaddresses > 0, then fetch IPs for those hosts
  • When extracting groups and host vars, simply find in lookup tables instead of performing more HTTP requests.

Details

Issue #142 is solved by always querying for VMs and devices by ID and not name. Names are never passed in URLs anymore so there'll be no issues with spaces, or duplicate names. I also decided #199 to add a host var is_virtual to clearly declare whether a host is a device or vm. This is used in logic about whether a host id should be passed into a device or vm NetBox API.

There's 2 different ways I've optimised fetching interfaces/ips/services, controlled by the fetch_all option as different query_filter options will determine which method will have the fastest result.

  • Previously, there would be up to 3 requests per device. With 1000 devices, that's 3000 extra HTTP requests. Quickly get slow and puts unreasonable load on the NetBox server.
  • fetch_all: True - all interfaces, IPs that are assigned_to_interface=true, and all services are requested from NetBox. This will give the fastest result when you are requesting all hosts (ie. there are no query_filters)
  • fetch_all: False - interfaces and services are fetched only for specific hosts, appending &device_id=X or &virtual_machine_id=X to the requests. The optimisation here is that you can pass multiple IDs to a single request to reduce HTTP and server-side overhead. The new max_uri_length setting determines how many of these IDs will be passed into a single request. For reference, different versions of Internet Explorer is limited to somewhere around 1-2k. Chrome appears to have no limit. Apache server defaults to 8190, AWS ALB is limited to 8K. In reality I found I was getting a HTTP 400 error if I went much above 4K, so I decided 4000 was a sensible default. Users may lower this limit if their particular server won't accept these long requests.

Another change is the handling of virtual chassis #200. Previously child devices were returned as separate hosts. As the whole point of a virtual chassis is to represent devices that share a control plane, for the purpose of Ansible users you should not ever need to connect to child devices as individual hosts. A pull request ansible/ansible#60642 by @Yannis100 proposed removing child devices and I agree it makes sense.

I made that change as part of this pull request as it affects the logic of looking up a device's interfaces. Interfaces that belong to child devices will appear as part of the master device as in the Netbox UI and APIs.

Performance tests #143

Our corporate NetBox has grown since I last did these tests - now have 2515 devices, 3333 IPs, and 12202 interfaces.

All tests have interfaces: True and services: True

Settings Time HTTP Requests
Current 0.2.0 release 10m45s 5139
New, fetch_all: True (default) 1m11s 30
New, fetch_all: False 1m46s 48

So this is roughly a 10x improvement! fetch_all:True is the fastest in the default case.

With query_filters

It's also interesting to test the case with some query_filters defined. We usually use device_query_filters: - has_primary_ip: 'true' as our NetBox has many passive devices (eg. patch panels) that Ansible does not care about. In this case:

Settings Time HTTP Requests
Current 0.2.0 release 3m19s 2051
New, fetch_all: True (default) 50s 28
New, fetch_all: False 38s 30

It's interesting in this case that fetch_all: False is the fastest, and only request 2 extra HTTP requests. The requests return quicker as there's much less data.

Other minor changes

  • Added to the test data - an interface on the virtual chassis child device, some IPs, and some services.
  • Updated the inventory json compare script to remove timestamps and urls from the json stored in the repo to avoid that changing every commit.

netbox-community/netbox#4313 removed id__in as you can acheive the same by listing id multiple times.
… interfaces/services.

A side effect is it resolves netbox-community#142 fetching services for VMs

Includes better support virtual chasis - only take the master device and not the children. Some of this from ansible/ansible#60642

See the TODO comments for work still to be done before being ready to merge.
* Allow tuning for largest size permitted by your webserver, for optimum performance reducing number of required HTTP requests.

* Handle exceptions from within threads and raise on the main thread after being joined. This ensures that any HTTP errors from refresh_ methods will stop execution of the rest of the plugin. For example, you'll notice if you receive a HTTP 414 URI Too Long.

Added unit tests for these things.
Also added example of useful device_query_filters
I think I've improved this enough to add my name to it.
I found in my install I was getting HTTP 400 errors with 8000 length. 4000 length works. May depend on web server, CDN, etc.
@DouglasHeriot
Copy link
Contributor Author

Looks like adding the extra interfaces to the test data broke the integration tests and IDs have changed. Tomorrow I think I'll fix this by updating netbox-deploy.py to add the new interfaces at the end instead of breaking existing tests.

@DouglasHeriot DouglasHeriot force-pushed the performance branch 2 times, most recently from e6465d7 to 3df3078 Compare May 12, 2020 05:07
Netbox 2.6 seems to return these in a different order than 2.7 and 2.8. Order shouldn't be important.
@DouglasHeriot
Copy link
Contributor Author

I've almost finished fixing the tests. The issue I've run into is that Netbox 2.6 doesn't support querying for multiple device ids at the same time (eg. http://localhost:32768/api/ipam/ip-addresses/?limit=0&assigned_to_interface=true&device_id=1&device_id=4 returns only the IPs for device id 4, and not for 1)

This can be fixed by setting the new max_uri_length option to 0 to always use the shortest URIs possible (ie. query for 1 device at a time)

I'm not sure whether we want to:

  • Modify tests to set that, and communicate to users in documentation that when using fetch_all:False on Netbox 2.6 you also have to set max_uri_length:0. But in that case you'll probably find fetch True is faster anyway.
  • Have the inventory plugin detect the current Netbox version based on /api/docs/?format=openapi. Benefit of this would be some constants like ALLOWED_DEVICE_QUERY_PARAMETERS could be filled in automatically instead of having to be kept up to date. I think I might go down this path.

@DouglasHeriot DouglasHeriot marked this pull request as draft May 12, 2020 09:25
* Gets the API version, to allow working around netbox-community/netbox#3507 in tests
* While I'm at it, it also allows fetching allowed_device_query_parameters dynamically instead of hard-coding the list.
…_uri_length: 0

When fetching interfaces of virtual-chassis child devices in a separate request to the master, the interfaces will be returned twice.

Converted some lists to dicts to efficiently check if the interface ID already exists.

Made same change for services and IP Addresses, in case they have similar issues.
@DouglasHeriot DouglasHeriot marked this pull request as ready for review May 12, 2020 14:08
@DouglasHeriot
Copy link
Contributor Author

Ok this is ready for review now.

When tracking down the issue with the tests on NetBox 2.6, I discovered that netbox-community/netbox#3507 is the cause and was only fixed in v2.7.5.
I can workaround this in 2.6 by looking at api_version, but not in v2.7.0-v2.7.4 as the api_version is just 2.7. Users of those will have to manually set max_uri_length to 0, but it's probably faster to keep fetch_all: True instead.

I also came across an edge case I hadn't handled - when fetch_all was False and interfaces were requested in separate requests they were duplicated. I adjusted the test inventory yml files to include checking this case (max_uri_length: 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inventory bug: services queried by device name doesn't work for VMs, or names with spaces
2 participants