Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqlServer: Massive performance degradation in PerformanceCounters Query #7236

Closed
leepfrog-ger opened this issue Mar 26, 2020 · 7 comments · Fixed by #7242
Closed

SqlServer: Massive performance degradation in PerformanceCounters Query #7236

leepfrog-ger opened this issue Mar 26, 2020 · 7 comments · Fixed by #7242
Assignees
Labels
area/sqlserver bug unexpected problem or unintended behavior regression something that used to work, but is now broken
Milestone

Comments

@leepfrog-ger
Copy link

leepfrog-ger commented Mar 26, 2020

Relevant telegraf.conf:

[[inputs.sqlserver]]
  ## Specify instances to monitor with a list of connection strings.
  ## All connection parameters are optional.
  ## By default, the host is localhost, listening on default port, TCP 1433.
  ##   for Windows, the user is the currently running AD user (SSO).
  ##   See https://github.com/denisenkom/go-mssqldb for detailed connection
  ##   parameters.
  servers = [
    "Server=REDACTED"
  ]

  ## Optional parameter, setting this to 2 will use a new version
  ## of the collection queries that break compatibility with the original
  ## dashboards.
  query_version = 2

System info:

  • Telegraf v1.14.0-rc1
  • SQL Server 2014/2016 on Windows Server

Steps to reproduce:

  1. have a relatively busy SQL instance (e.g. 500 databases and 40 resource pools/groups)
  2. Use SQL profiler observe Execution times of PerformanceCounters queries

Expected behavior:

Similar performance to the telegraf version we used previously (from around mid 2019): in the ballpark on 500-800ms query duration for a busy server

Actual behavior:

Query durations between 60-90 seconds on the same busy server.

Additional info:

Running the old/new versions of the PerformanceCounters query manually shows the same behaviour: old query sub-second duration, new query 60+ seconds.

I've done some troubleshooting and it seems that the issue is due to a removed OPTION (RECOMPILE) query hint in the following commit: 1df88dd#diff-5359fe28fcc15fe8b504f2d8033ce59dL662

Adding it back leads to similar execution times than before. Some quick research leads me to believe that the execution plan is optimized very poorly without that option as the row estimates are vastly off.

I wonder why it was removed and if there is any reason to not add it back.

@danielnelson
Copy link
Contributor

@denzilribeiro Can you take a look? Is it possible statistics need rebuilt?

@danielnelson danielnelson added area/sqlserver bug unexpected problem or unintended behavior regression something that used to work, but is now broken labels Mar 26, 2020
@denzilribeiro
Copy link
Contributor

What version are you seeing the degradation against? A recompile each time does eat CPU - seems like you got stuck with a bad plan.

@danielnelson danielnelson added this to the 1.14.1 milestone Mar 26, 2020
@leepfrog-ger
Copy link
Author

From what I read you cannot have statistics for table variables (which this script uses). Therefore the query optimizer always assumes an estimated row count of "1" for building the plan. I am not sure why adding the recompile option changes things then, but the difference is massive as described above.

This is happening with the most recent version and can be reproduced over different SQL servers. Impact seems to increase significantly when having many databases/resource groups, maybe therefore it was not discovered earlier.

@denzilribeiro
Copy link
Contributor

denzilribeiro commented Mar 26, 2020

In compat 150 some of that is changed ( later/latest versions and compat) part of reason also wasn't caught - and yes realized there was a table variable now. All queries had option recompile and was overkill, when removing it an masse, didn't remember this one had an Table variable sorry.

Table variable deferred compilation

I will submit a PR tomorrow to fix.
If not on compat 150 in the interim this could help: https://www.mssqltips.com/sqlservertip/5662/table-variable-deferred-compilation-in-sql-server/

@denzilribeiro
Copy link
Contributor

If you could actual send me the Plan ( run the query and get an actual plan) would be great

@leepfrog-ger
Copy link
Author

Thanks for the explanation, makes sense. I think for now I'll be sticking with my custom patched telegraf binary as I dont have SQL 2019 running yet.

I've just invited you to a private repo that contains an execution plan from an affected system. That particular system has ~150 databases, ~20 resource pools, ~50 workload groups. The PerformanceCounter query returns around 2300 rows and takes 10 seconds query time without option(recompile).

@leepfrog-ger
Copy link
Author

I can confirm this is now working as intended with the most recent release (1.14.2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sqlserver bug unexpected problem or unintended behavior regression something that used to work, but is now broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants