Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

double close in refreshScalers() #6273

Closed
latk opened this issue Oct 25, 2024 · 1 comment · Fixed by #6309
Closed

double close in refreshScalers() #6273

latk opened this issue Oct 25, 2024 · 1 comment · Fixed by #6309
Labels
bug Something isn't working

Comments

@latk
Copy link

latk commented Oct 25, 2024

Report

I am seeing spurious panics where Keda tries to close an already-closed scaler. This points towards slightly incorrect handling of failing scalers in refreshScaler().

Expected Behavior

Scalers have a clear lifecycle and are always closed exactly once. If scalers fail, they are restarted safely.

Actual Behavior

Scalers can be double-closed, causing a panic.

I stumbled across this problem while investigating a slightly different error, that some scalers seem to occasionally stop producing new metrics. This seems to be correlated with transient scaler errors. I believe – but don't have evidence for this – that this relates to the same refreshScaler() code path, just with a single-close instead of a double-close?

Steps to Reproduce the Problem

  1. Use the Kafka scaler
  2. Be unlucky and experience transient errors with the Kafka broker, causing Keda to go into the refreshScaler() code path multiple times for the same scaler instance.

Logs from KEDA operator

 panic: close of closed channel

goroutine 388 [running]:
github.com/Shopify/sarama.(*client).Close(0xc0009050e0)
	/workspace/vendor/github.com/Shopify/sarama/client.go:271 +0x66
github.com/Shopify/sarama.(*clusterAdmin).Close(0xc00146c1d0?)
	/workspace/vendor/github.com/Shopify/sarama/admin.go:188 +0x22
github.com/kedacore/keda/v2/pkg/scalers.(*kafkaScaler).Close(0xc000c20018?, {0x0?, 0xc002ee6cc0?})
	/workspace/pkg/scalers/kafka_scaler.go:561 +0x2d
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).refreshScaler(0xc000415780, {0x474cfc8?, 0xc002114690?}, 0x0)
	/workspace/pkg/scaling/cache/scalers_cache.go:227 +0x324
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler(0xc000415780, {0x474cfc8, 0xc002114690}, 0x0, {0xc0028e7600, 0x20})
	/workspace/pkg/scaling/cache/scalers_cache.go:135 +0x1bf
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState(0x0?, {0x474cfc8, 0xc002114690}, 0xc00127ce00)
	/workspace/pkg/scaling/scale_handler.go:572 +0xb4a
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers(0xc0006e61c0, {0x474cfc8, 0xc002114690}, {0x3e7de80?, 0xc00127ce00?}, {0x4736b50, 0xc00293a510})
	/workspace/pkg/scaling/scale_handler.go:236 +0x271
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop(0x0?, {0x474cfc8, 0xc002114690}, 0xc001df9b80, {0x3e7de80, 0xc00127ce00}, {0x4736b50, 0xc00293a510}, 0x0?)
	/workspace/pkg/scaling/scale_handler.go:175 +0x405

Unfortunately the available log seems to be misconfigured, and I can't identify the error messages that led up to this failure.

KEDA Version

2.11.1

Kubernetes Version

None

Platform

None

Scaler Details

Kafka

Anything else?

While the error has been observed in Keda 2.11.1, the relevant code hasn't changed since then. Here is the relevant part of the refreshScaler() function, last changed with PR #4093:

sb := c.Scalers[id]
defer sb.Scaler.Close(ctx)
ns, sConfig, err := sb.Factory()
if err != nil {
return nil, err
}
if id < 0 || id >= len(c.Scalers) {
return nil, fmt.Errorf("scaler with id %d not found, len = %d, cache has been probably already invalidated", id, len(c.Scalers))
}
c.Scalers[id] = ScalerBuilder{
Scaler: ns,
ScalerConfig: *sConfig,
Factory: sb.Factory,
}

One interpretation: resource handling problem.

There are two relevant resources:

  • the old sb.Scaler, which is always closed via the defer
  • the new scaler ns, which is never closed

This is fine along the happy path: the scaler-builder is replaced with the new scaler, and the old scaler is closed.

However, things break if sb.Factory() returns an error, or if the bounds check fails:

  • old scaler is not replaced and still closed, leaving the closed scaler in the cache
  • new scaler fails to be closed

The next time metrics are fetched from the (now closed) scaler, there will be errors, and we will enter the refreshScaler() function again, and will close the old scaler for the second time.

I'm not experienced enough in Go to know how to fix this. I assume that the sb.Scaler.Close() should be moved until after registering the new ScalerBuilder, essentially reverting that part of #4093. And there should probably be a ns.Close() when returning an error from the bounds check. I am also confused by the existence of the bounds check, as a similar check is done a few lines before the quoted snippet. The second bounds check – and with it the diverging control flow – would probably be unnecessary if mutating the sb object in-place. Sketch:

sb := c.Scalers[id]
oldScaler := sb.Scaler

ns, sConfig, err := sb.Factory() 
if err != nil { 
    return nil, err 
}

sb.Scaler = oldScaler
sb.ScalerConfig = sConfig

oldScaler.Close()

Another interpretation: concurrency problem.

Reading through the Sarama client.Close() code, it is only possible for this channel-close to panic if the Close() method is invoked multiple times concurrently. The remainder of the Close() method is protected by a lock. This in turn could happen if Keda can invoke the same Scaler.Close() concurrently.

The bounds check that was added in #4093 seems to try to defend against concurrent modification of the ScalersCache, but it's not entirely safe because the check + update is not atomic. The bounds check just makes it less likely that concurrent modification can crash the program.

An actual fix would likely require introducing a lock for ScalersCache modifications, in particular for the ScalersCache.refreshScaler() and ScalersCache.Close() operations. It could also help to make the ScalersCache.Scalers field private.

@latk latk added the bug Something isn't working label Oct 25, 2024
@JorTurFer
Copy link
Member

JorTurFer commented Nov 3, 2024

Thanks for reporting this issue and also thanks for the research 🙇
I've checked sarama client's code and the only scenario that I can see is the concurrent one because as you said, the sarama client protects itself:
image

I think that the problem could be a concurrent retry. Nevertheless, I think that defer sb.Scaler.Close(ctx) has to be moved too to not keep closed scalers in the cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Ready To Ship
Development

Successfully merging a pull request may close this issue.

2 participants