Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlstats/insights: fix mem leaks on session close #128400

Merged
merged 4 commits into from
Aug 17, 2024

Conversation

xinhaoz
Copy link
Member

@xinhaoz xinhaoz commented Aug 6, 2024

insights: ensure releasing to Insight pool clears slice

Ensure that when we return objects into the Insight pool
that we release the statmeents in the slice. This fixes
an issue in the logic that returned this object to the pool
which did not nil the Statements slice, making it possible
for these slices to grow in capacity in the node's lifetime,
holding onto garbage Statement objects.

This is a lead up to #128199 which will likely remove
this pool and reuse the existing statmentBuf pool.

Epic: none
Release note: None

insights: add testing knobs

Add insights specific testing knobs. We'll add some
knobs in later commts.

Epic: none
Release note: None

sqlstats/insights: free memory allocated per session on session close

The insights locking registry buffers statement insight objects
by session id until receiving a transaction insight, when the
buffer is emptied. These buffers can leak if the session is closed
midway through a transaction since the registry will never
receive a transaction event for the session. This commit ensures
we clear any memory allocated in insights for a session by
sending an event to clear the container's session entry, if it
exists, on session close.

A testing knob was added, OnCloseSession, which is called
when the locking registry clears a session.

Epic: none
Fixes: #128213

Release note (bug fix): Fixes a memory leak where statement
insight objects may leak if the session is closed without
the transaction finishing.

insights: move insights testing knobs from sqlstats

Move some insights testing knobs that were on the
sqlstats testing knobs to insights pkg.

Epic: none

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@xinhaoz
Copy link
Member Author

xinhaoz commented Aug 6, 2024

I think we might want to add a memory monitor to the insights system but that work seems a bit more involved (gotta inject some contexts into some funcs) so I think I'd rather get this in first for backportability. LMK what y'all think though.

@xinhaoz xinhaoz force-pushed the insights-mem-leak branch from 8365aab to 179d1cb Compare August 6, 2024 19:41
@xinhaoz xinhaoz marked this pull request as ready for review August 7, 2024 14:14
@xinhaoz xinhaoz requested review from a team as code owners August 7, 2024 14:14
@xinhaoz xinhaoz requested review from kyle-a-wong and dhartunian and removed request for a team August 7, 2024 14:14
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, 3 of 11 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kyle-a-wong and @xinhaoz)


-- commits line 7 at r1:
This wording is a bit vague regarding impact. Can you provide a bit more clarity on why this pool is safe to remove? Is it possible that an idiosyncratic workload could generate an insights object for every execution? For instance, if a lot of failures accumulate, do we generate an insight for every single one?

There's a prior issue here that led to some memory pooling changes: #89918
Can we run the same benchmark and confirm that we're not making this worse?


-- commits line 14 at r1:
nit: release note


-- commits line 36 at r2:
nit: space


pkg/sql/sqlstats/insights/ingester.go line 148 at r2 (raw file):

		} else if e.transaction != nil {
			i.registry.ObserveTransaction(e.sessionID, e.transaction)
		} else if e.sessionID != (clusterunique.ID{}) {

There should be a comment here clarifying that this is a "synthetic" event generated in ClearSession, otherwise it's confusing as to why this clause can be triggered.


pkg/sql/sqlstats/insights/provider.go line 50 at r2 (raw file):

// ActiveSessions returns the IDs of all sessions that are currently being observed.
// Used for testing.
func (p *Provider) ActiveSessions() []clusterunique.ID {

nit: I would rename this to include a ForTesting suffix so that it's not misused in the future.


pkg/sql/sqlstats/insights/registry.go line 31 at r2 (raw file):

		// statements can be accessed concurrently in testing.
		statements map[clusterunique.ID]*statementBuf

I'm not sure if the test is adequate motivation to introduce a mutex here. One of the features of the lockingRegistry seems to be that it's accessed in a single async task and invoked in batches. This change makes that nature confusing.

Is there a way that the tests can be modified to avoid this?

@xinhaoz
Copy link
Member Author

xinhaoz commented Aug 9, 2024

pkg/sql/sqlstats/insights/registry.go line 31 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I'm not sure if the test is adequate motivation to introduce a mutex here. One of the features of the lockingRegistry seems to be that it's accessed in a single async task and invoked in batches. This change makes that nature confusing.

Is there a way that the tests can be modified to avoid this?

One alternative I was thinking about is to introduce a testing knob callback that we can call when the clear session event is processed. We can manage the sync structure in the test, then and not in the locking registry directly. WDYT?

@xinhaoz
Copy link
Member Author

xinhaoz commented Aug 9, 2024

-- commits line 7 at r1:

Previously, dhartunian (David Hartunian) wrote…

This wording is a bit vague regarding impact. Can you provide a bit more clarity on why this pool is safe to remove? Is it possible that an idiosyncratic workload could generate an insights object for every execution? For instance, if a lot of failures accumulate, do we generate an insight for every single one?

There's a prior issue here that led to some memory pooling changes: #89918
Can we run the same benchmark and confirm that we're not making this worse?

The fix for that issue was applied when we generated insights objects at the statement level (it wrapped each stmt insight object). Now we generate an insight object per txn. Initial reasoning for deleting this was that we don't pool the allocation for Statement and Transaction objects, which are created more or just as frequently as this object is. The PR addressing the initial memory issue added pooling in some other areas so it could be that this specific one was not as impactful. Maybe it does not hurt to keep it around for the problematic workloads you described? I think we should be pooling the Statement and Transaction objects in the insights package if we are doing it here for this, thoughts?

Copy link
Member Author

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @kyle-a-wong)


-- commits line 7 at r1:

Previously, xinhaoz (Xin Hao Zhang) wrote…

The fix for that issue was applied when we generated insights objects at the statement level (it wrapped each stmt insight object). Now we generate an insight object per txn. Initial reasoning for deleting this was that we don't pool the allocation for Statement and Transaction objects, which are created more or just as frequently as this object is. The PR addressing the initial memory issue added pooling in some other areas so it could be that this specific one was not as impactful. Maybe it does not hurt to keep it around for the problematic workloads you described? I think we should be pooling the Statement and Transaction objects in the insights package if we are doing it here for this, thoughts?

Thought about a little more and ran some ycsbA to look at the heap. I think we should pursue pooling the insight Statement and Transaction objects. Going to do that in a separate PR and keep this PR ot just clearing the object when we return it to the pool, without adding or removing things :)


pkg/sql/sqlstats/insights/registry.go line 31 at r2 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

One alternative I was thinking about is to introduce a testing knob callback that we can call when the clear session event is processed. We can manage the sync structure in the test, then and not in the locking registry directly. WDYT?

Going to pursue this.

@xinhaoz xinhaoz force-pushed the insights-mem-leak branch from 179d1cb to c469bc9 Compare August 9, 2024 20:45
@xinhaoz xinhaoz requested a review from a team as a code owner August 9, 2024 20:45
Copy link
Member Author

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @kyle-a-wong)


-- commits line 14 at r1:

Previously, dhartunian (David Hartunian) wrote…

nit: release note

Done


-- commits line 36 at r2:

Previously, dhartunian (David Hartunian) wrote…

nit: space

Done.


pkg/sql/sqlstats/insights/ingester.go line 148 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

There should be a comment here clarifying that this is a "synthetic" event generated in ClearSession, otherwise it's confusing as to why this clause can be triggered.

Done.

@xinhaoz xinhaoz force-pushed the insights-mem-leak branch 5 times, most recently from f0739a6 to 578c250 Compare August 12, 2024 15:00
@xinhaoz xinhaoz requested a review from dhartunian August 12, 2024 19:28
@xinhaoz xinhaoz force-pushed the insights-mem-leak branch from 578c250 to 7ac19b6 Compare August 12, 2024 19:35
Ensure that when we return objects into the Insight pool
that we release the statmeents in the slice. This fixes
an issue in the logic that returned this object to the pool
which did not nil the `Statements` slice, making it possible
for these slices to grow in capacity in the node's lifetime,
holding onto garbage Statement objects.

This is a lead up to cockroachdb#128199 which will likely remove
this pool and reuse the existing statmentBuf pool.

Epic: none
Release note: None
@xinhaoz xinhaoz force-pushed the insights-mem-leak branch from 7ac19b6 to b727ee1 Compare August 13, 2024 15:05
Add insights specific testing knobs. We'll add some
knobs in later commts.

Epic: none
Release note: None
The insights locking registry buffers statement insight objects
by session id until receiving a transaction insight, when the
buffer is emptied. These buffers can leak if the session is closed
midway through a transaction since the registry will never
receive a transaction event for the session. This commit ensures
we clear any memory allocated in insights for a session by
sending an event to clear the container's session entry, if it
exists, on session close.

A testing knob was added, OnCloseSession, which is called
when the locking registry clears a session.

Epic: none
Fixes: cockroachdb#128213

Release note (bug fix): Fixes a memory leak where statement
insight objects may leak if the session is closed without
the transaction finishing.
@xinhaoz xinhaoz force-pushed the insights-mem-leak branch from b727ee1 to d8c9e2d Compare August 13, 2024 23:00
Move some insights testing knobs that were on the
sqlstats testing knobs to insights pkg.

Epic: none

Release note: None
@xinhaoz xinhaoz force-pushed the insights-mem-leak branch from d8c9e2d to 2da6863 Compare August 14, 2024 16:09
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work. thanks for the separate commits. :lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kyle-a-wong)

@xinhaoz
Copy link
Member Author

xinhaoz commented Aug 17, 2024

TFTR!
bors r+

@craig craig bot merged commit 1993fc0 into cockroachdb:master Aug 17, 2024
23 checks passed
michae2 pushed a commit to michae2/cockroach that referenced this pull request Aug 19, 2024
128400: sqlstats/insights: fix mem leaks on session close r=xinhaoz a=xinhaoz

insights: ensure releasing to Insight pool clears slice

Ensure that when we return objects into the Insight pool
that we release the statmeents in the slice. This fixes
an issue in the logic that returned this object to the pool
which did not nil the `Statements` slice, making it possible
for these slices to grow in capacity in the node's lifetime,
holding onto garbage Statement objects.

This is a lead up to cockroachdb#128199 which will likely remove
this pool and reuse the existing statmentBuf pool.

Epic: none
Release note: None

insights: add testing knobs

Add insights specific testing knobs. We'll add some
knobs in later commts.

Epic: none
Release note: None

sqlstats/insights: free memory allocated per session on session close

The insights locking registry buffers statement insight objects
by session id until receiving a transaction insight, when the
buffer is emptied. These buffers can leak if the session is closed
midway through a transaction since the registry will never
receive a transaction event for the session. This commit ensures
we clear any memory allocated in insights for a session by
sending an event to clear the container's session entry, if it
exists, on session close.

A testing knob was added, OnCloseSession, which is called
when the locking registry clears a session.

Epic: none
Fixes: cockroachdb#128213

Release note (bug fix): Fixes a memory leak where statement
insight objects may leak if the session is closed without
the transaction finishing.

insights: move insights testing knobs from sqlstats

Move some insights testing knobs that were on the
sqlstats testing knobs to insights pkg.

Epic: none

Release note: None

Co-authored-by: Xin Hao Zhang <[email protected]>
xinhaoz added a commit to xinhaoz/cockroach that referenced this pull request Sep 11, 2024
In cockroachdb#128400 we nil'd the entries in the Insight.Statements slice on
`releaseInsight`, which is called when evicting insights from the cache.
This caused a race leading to a nil pointer dereference:
The ListExecutionInsights status server grpc was making  shallow
copies of the insights objects from the cache. After releasing the
RLock we marshal the shallow copies into the grpc response. If
insights are simultaneously getting evicted during that process,
they clear the shared `Statements` slice leading to a nil pointer
dereference in the marhsaling code.

We address this by creating a new slice for each insight when creating
the grpc response. Additionally, Insights retrieved from the pool are
modified so that they reuse the Statements slice that was prepped
during `releaseInsight`, something it was not previously doing.

Epic: none
Fixes: cockroachdb#130290
Fixes: cockroachdb#129936

Release note: None
@xinhaoz xinhaoz deleted the insights-mem-leak branch January 10, 2025 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

insights: ensure per-session statementBufs are cleared on session close
3 participants