DATAS tuning fixes #105545

Maoni0 · 2024-07-26T07:51:07Z

This includes the following tuning fixes

unusable fragmentation computation - this was causing gen1's unnecessarily. Also avoid volatility.
don't rethread gen2 FL items during HC change because it might take too long. Instead trigger a BGC and disable gen1 GCs during this BGC. During BGC sweep we refresh the gen2 FL completely.
reverse the order of calling distribute_free_regions/age_free_regions and setting BCD - what happened was each time we had to get new regions for alloc because BCD can be >> BCS. And since we don't get regions from decommit list, we keep accumulating more. For this I also cleaned up decommit_ephemeral_segment_pages and related methods and only have them defined when needed.
initialize generation_skip_ratio to 100, not 0 when we recommission a heap - this was causing every eph GC right after an HC change to unnecessarily be a gen1 GC
set dd_fragmentation for gen1 during BGC sweep otherwise we can hit the assert assert (free_list_space_decrease <= dd_fragmentation (dd)); in merge_fl_from_other_heaps because frag simply hasn't been updated after a BGC. This can also be reproed by checking for frag at the beginning of a GC.
if we haven't done a gen2 GC yet, and we are limited by DATAS for HC, we should trigger a gen2 immediately. this definitely helps for microbenchmarks but I'll need to make this more generic and not limited to just the initial gen2. But that'll come in a separate PR.

and some instrumentation changes

do_pre_gc/do_post_gc
other dprintf changes for DATAS investigation (so all you need to do is turning logging on)

dotnet-policy-service · 2024-07-26T07:51:44Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

…f runs; fix to get STRESS_DYNAMIC_HEAP_COUNT compiled

… recommission; fix for gen1 hitting assert (free_list_space_decrease <= dd_fragmentation (dd))

…tribute_free and trigger an initial gen2 if needed

src/coreclr/gc/gc.cpp

Maoni0 · 2024-08-12T07:13:12Z

results from selected aspnet benchmarks -

Benchmark Name	Average of Max heap size / baseline	Average of Max heap size / fix	Average of Max heap size / Comparison %	Average of RPS / baseline	Average of RPS / fix	Average of RPS / Comparison %	Average of Mean latency / baseline	Average of Mean latency / fix	Average of Mean latency / Comparison %
ConnectionCloseHttpSys	17.634	21.379	21.236	102.514	103.404	0.868	0.190	0.188	-1.053
FortunesEf	81.203	90.652	11.636	276.867	280.726	1.394	1.466	1.404	-4.229
FortunesPlatformEF	140.653	147.143	4.615	383.179	385.183	0.523	1.604	1.536	-4.239
JsonHttps	6.492	6.485	-0.102	704.963	706.383	0.201	0.428	0.416	-2.804
JsonMvc	48.929	35.595	-27.250	596.022	588.107	-1.328	0.926	0.972	4.968
MultipleQueriesPlatform	118.580	109.586	-7.584	45.150	44.921	-0.506	11.904	11.962	0.487
PlaintextMvc	50.893	34.822	-31.577	2,955.860	2,759.418	-6.646	1.146	1.096	-4.363
PlaintextWithParametersNoFilter	47.119	36.095	-23.397	3,585.046	3,607.935	0.638	1.188	1.086	-8.586
SingleQueryPlatform	86.337	91.659	6.165	539.281	537.285	-0.370	1.254	1.260	0.478
Stage1GrpcServerGC	85.289	91.799	7.633	1,100.797	1,104.153	0.305	1.636	1.642	0.367
Stage1Pgo	37.703	31.341	-16.873	736.405	730.666	-0.779	0.406	0.412	1.478
UpdatesPlatform	113.116	112.363	-0.665	28.050	28.184	0.478	18.838	18.518	-1.699

in general I'm seeing good results - they reduce volatility and avoids OOM in certain scenarios. some benchmarks use more memory but with increased RPS, mostly due to the fact we are updating the stable size more accurately.

unfortunately this does make PlaintextMvc's RPS lower albeit with lower memory usage - the continuous gen1 problems due to the incorrect unusable frag is fixed which is definitely improvement but there are other problems that makes baseline have better RPS by chance.

baseline did a gen1 right after HC change because skip ratio is 0 for the new heaps and that made us do a gen1. This gen1 increased the gen2 size from ~1mb to 3.4mb. Fix did a BGC with a gen0 GC and didn’t increase the gen2 obj size -

After this baseline kept doing gen1s due to the unusable frag and kept promoting while fix kept not promoting therefore gen2 size stayed small. The unusable frag problem is fixed but the promotion decision is still a problem especially in the baseline. We promote because we kept hitting the 1st condition in decide_on_promotion_surv

if ((threshold > (older_gen_size)) || (promoted > threshold))

but the thing is because we have >10 heaps some heap’s gen2 is simply very small so of course threshold could be more than old_gen_size like this

[ 3112]h0 promotion threshold: 314572, promoted bytes: 600 size n+1: 68864 -> promote

(68864 is older_gen_size)

And in fix we don’t hit this because gen1 size is big enough and we don’t consume gen1 budget since we don’t promote. So this is a bad cycle. fix should be doing gen1 GCs - right now it's not because we never detect that we have too many cards coming from gen1 to gen0 (generation_skip_ratio only detects when there are too many gen2 to gen1 refs vs gen2 to gen0 refs, but not gen1 to gen0). and as long as we don't demote, we still have the cards coming from gen1 to gen0. so this is I'll be fixing. also we could consider doing a gen1 collection if gen1 is much larger than gen2. In baseline gen1 is very small while in fix gen1 is much larger than gen2 (2.7mb vs 1mb).

I will be working on these fixes when I'm back from OOF to make them into RC2.

mrsharm

LGTM based on the extensive testing conducted!

dotnet-issue-labeler bot added the area-GC-coreclr label Jul 26, 2024

dotnet-policy-service bot assigned Maoni0 Jul 26, 2024

Maoni0 force-pushed the datas_tuning_fixes branch 3 times, most recently from 85c1d9c to f6734fc Compare July 30, 2024 07:41

mrsharm mentioned this pull request Aug 6, 2024

[WIP] Disallow gen1 - code by @maoni0 as a PR with the Fragmentation Fix Changes #105166

Closed

Maoni0 added 4 commits August 11, 2024 23:14

unusable frag

ee0f717

a lot more info in dprintf in do_pre_gc; temp dprintf changes for per…

ece1ed5

…f runs; fix to get STRESS_DYNAMIC_HEAP_COUNT compiled

threading gen2 FL by triggering a BGC; gen skip ratio reset to 100 in…

68f49b3

… recommission; fix for gen1 hitting assert (free_list_space_decrease <= dd_fragmentation (dd))

reverse order of getting BCD and decommit_ephemeral_segment_pages/dis…

f4fe2d4

…tribute_free and trigger an initial gen2 if needed

Maoni0 force-pushed the datas_tuning_fixes branch from f6734fc to f4fe2d4 Compare August 12, 2024 06:31

minor misc fixes

bd52c93

Maoni0 changed the title ~~[WIP] DATAS tuning fixes~~ DATAS tuning fixes Aug 12, 2024

mrsharm reviewed Aug 12, 2024

View reviewed changes

src/coreclr/gc/gc.cpp Outdated Show resolved Hide resolved

revert a dprintf change

4c698e4

mrsharm approved these changes Aug 12, 2024

View reviewed changes

This was referenced Aug 12, 2024

TimeProviderTests.TestProviderTimer failed in CI #103459

Closed

Test failure: tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.cmd #106113

Closed

Maoni0 merged commit bfb674e into dotnet:main Aug 13, 2024
90 checks passed

Maoni0 mentioned this pull request Aug 20, 2024

[WIP] a different way to calculation unusable frag for DATAS #104291

Closed

github-actions bot locked and limited conversation to collaborators Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DATAS tuning fixes #105545

DATAS tuning fixes #105545

Maoni0 commented Jul 26, 2024 •

edited

Loading

dotnet-policy-service bot commented Jul 26, 2024

Maoni0 commented Aug 12, 2024

mrsharm left a comment

DATAS tuning fixes #105545

DATAS tuning fixes #105545

Conversation

Maoni0 commented Jul 26, 2024 • edited Loading

dotnet-policy-service bot commented Jul 26, 2024

Maoni0 commented Aug 12, 2024

mrsharm left a comment

Choose a reason for hiding this comment

Maoni0 commented Jul 26, 2024 •

edited

Loading