Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNVERIFIED: fix nil panic due to nil msBuilder #6184

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Groxx
Copy link
Member

@Groxx Groxx commented Jul 22, 2024

admin kafka rereplicate is hitting nil panics on some workflows, and this appears to be the source: a nil mutablestate builder.

I've added it here and am running tests and it may be the fix, but I am NOT yet sure:

  • why it was not added in the past
  • why we did not notice until recently
  • why this cyclic builder reference exists in the first place
  • if this is the correct builder to use (I merely chose it because it was the closest one)

An example of the panic:

github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error
	external/com_github_uber_cadence/common/log/loggerimpl/logger.go:150
github.com/uber/cadence/common/log.CapturePanic
	external/com_github_uber_cadence/common/log/panic.go:46
github.com/uber/cadence/service/history/handler.(*handlerImpl).ReplicateEventsV2.func1
	external/com_github_uber_cadence/service/history/handler/handler.go:1439
runtime.gopanic
	GOROOT/src/runtime/panic.go:770
github.com/uber/cadence/service/history/ndc.(*historyReplicatorImpl).applyEvents.func1
	external/com_github_uber_cadence/service/history/ndc/history_replicator.go:218
runtime.gopanic
	GOROOT/src/runtime/panic.go:770
github.com/uber/cadence/service/history/ndc.(*transactionManagerForNewWorkflowImpl).executeTransaction.func1
	external/com_github_uber_cadence/service/history/ndc/new_workflow_transaction_manager.go:330
runtime.gopanic
	GOROOT/src/runtime/panic.go:770
runtime.panicmem
	GOROOT/src/runtime/panic.go:261
runtime.sigpanic
	GOROOT/src/runtime/signal_unix.go:881
github.com/uber/cadence/service/history/execution.(*HistoryBuilder).AddWorkflowExecutionTerminatedEvent
	external/com_github_uber_cadence/service/history/execution/history_builder.go:345
github.com/uber/cadence/service/history/execution.(*mutableStateBuilder).AddWorkflowExecutionTerminatedEvent
	external/com_github_uber_cadence/service/history/execution/mutable_state_builder.go:3313
github.com/uber/cadence/service/history/execution.(*workflowImpl).terminateWorkflow
	external/com_github_uber_cadence/service/history/execution/workflow.go:266
github.com/uber/cadence/service/history/execution.(*workflowImpl).SuppressBy
	external/com_github_uber_cadence/service/history/execution/workflow.go:192
github.com/uber/cadence/service/history/ndc.(*transactionManagerForNewWorkflowImpl).createAsZombie
	external/com_github_uber_cadence/service/history/ndc/new_workflow_transaction_manager.go:219
github.com/uber/cadence/service/history/ndc.(*transactionManagerForNewWorkflowImpl).executeTransaction
	external/com_github_uber_cadence/service/history/ndc/new_workflow_transaction_manager.go:346
github.com/uber/cadence/service/history/ndc.(*transactionManagerForNewWorkflowImpl).dispatchForNewWorkflow
	external/com_github_uber_cadence/service/history/ndc/new_workflow_transaction_manager.go:114
github.com/uber/cadence/service/history/ndc.(*transactionManagerImpl).createWorkflow
	external/com_github_uber_cadence/service/history/ndc/transaction_manager.go:200
github.com/uber/cadence/service/history/ndc.(*historyReplicatorImpl).applyStartEvents
	external/com_github_uber_cadence/service/history/ndc/history_replicator.go:304
github.com/uber/cadence/service/history/ndc.(*historyReplicatorImpl).applyEvents
	external/com_github_uber_cadence/service/history/ndc/history_replicator.go:226
github.com/uber/cadence/service/history/ndc.(*historyReplicatorImpl).ApplyEvents
	external/com_github_uber_cadence/service/history/ndc/history_replicator.go:197
github.com/uber/cadence/service/history/engine/engineimpl.(*historyEngineImpl).ReplicateEventsV2
	external/com_github_uber_cadence/service/history/engine/engineimpl/historyEngine.go:2838
github.com/uber/cadence/service/history/handler.(*handlerImpl).ReplicateEventsV2
	external/com_github_uber_cadence/service/history/handler/handler.go:1466
github.com/uber/cadence/service/history/wrappers/ratelimited.(*historyHandler).ReplicateEventsV2
	external/com_github_uber_cadence/service/history/wrappers/ratelimited/handler_generated.go:203
github.com/uber/cadence/service/history/wrappers/grpc.GRPCHandler.ReplicateEventsV2
	external/com_github_uber_cadence/service/history/wrappers/grpc/grpc_handler_generated.go:171
github.com/uber/cadence/.gen/proto/history/v1.(*_HistoryAPIYARPCHandler).ReplicateEventsV2
	external/com_github_uber_cadence/.gen/proto/history/v1/service.pb.yarpc.go:1595
go.uber.org/yarpc/encoding/protobuf.(*unaryHandler).Handle
	external/org_uber_go_yarpc/encoding/protobuf/inbound.go:56
... many yarpc middlewares later ...
google.golang.org/grpc.(*Server).processStreamingRPC
	external/org_golang_google_grpc/server.go:1550
google.golang.org/grpc.(*Server).handleStream
	external/org_golang_google_grpc/server.go:1636
google.golang.org/grpc.(*Server).serveStreams.func1.2

This appears to be a nil mutable state builder at service/history/execution/history_builder.go:345, which looks like it can only come from HistoryBuilder instances created by NewHistoryBuilderFromEvents because it doesn't set that field.

I have not yet figured out why that function doesn't set that field though.

Copy link

codecov bot commented Jul 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.78%. Comparing base (33d93e0) to head (8385c45).
Report is 94 commits behind head on master.

Additional details and impacted files
Files with missing lines Coverage Δ
service/history/execution/history_builder.go 94.66% <100.00%> (+0.01%) ⬆️
service/history/execution/mutable_state_builder.go 61.67% <100.00%> (ø)
service/history/execution/state_builder.go 66.87% <100.00%> (ø)

... and 12 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33d93e0...8385c45. Read the comment docs.

Copy link
Contributor

@3vilhamster 3vilhamster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. Was there a specific workflow that caused this, or was it a general panic somewhere?
After this fix, could you verify if the problem has been resolved?
The interconnection between hBuilder and mutable state looks like a weird architecture, and maybe requires a rework, but the change is fine as a temporary solution.

@davidporter-id-au
Copy link
Member

I don't have enough context to know the right place either, but if it builds and runs, probably ok then

@Groxx
Copy link
Member Author

Groxx commented Jul 30, 2024

Was there a specific workflow that caused this, or was it a general panic somewhere?

There's some kind of workflow that causes a consistent panic here, I've seen it a couple times with different workflows (I even have an older task from several months ago to look into it, and it just happened again). I suspect it has to do with it ending on a terminated event, but that's all I've got at the moment - a nil panic here, on a thing that seems like it should probably never be nil (but clearly is sometimes).

After this fix, could you verify if the problem has been resolved?

I haven't been able to reproduce it, so unfortunately no. But the code paths involved in setting (or not setting) this field are really only 2, so I'm pretty confident it fixes this panic.
Whether that fixes the core "this workflow fails to admin kafka rereplicate" or just causes the error to appear somewhere else: unfortunately no idea at the moment. The fact that this field is seemingly intentionally not populated, for years, has me questioning even the basics about what its intent is, so this first draft is largely to fish for easy answers from others before exploring at random.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants