Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove refresh from server path and refresh from network call #16657

Conversation

tyler-cai-microsoft
Copy link
Contributor

@tyler-cai-microsoft tyler-cai-microsoft commented Aug 1, 2023

AB#5118

Remove "refresh from server" logic. We are making this PR into next instead of main as we want to wait a bit before we commit to removing "refresh from server" logic. Instead of refreshing we are closing the container and restarting. More details here: #15140

This requires updating the summarizerNode logic and containerRuntime logic. Work to move when we decide to close the container sits in AB#5152

@github-actions github-actions bot added area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: next PRs targeted against next branch labels Aug 1, 2023
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Aug 2, 2023

@fluid-example/bundle-size-tests: -6.58 KB
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 441.03 KB 437.75 KB -3.29 KB
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 239.79 KB 236.5 KB -3.29 KB
loader.js 147.26 KB 147.26 KB No change
map.js 46.36 KB 46.36 KB No change
matrix.js 139.17 KB 139.17 KB No change
odspDriver.js 88.98 KB 88.98 KB No change
odspPrefetchSnapshot.js 41.67 KB 41.67 KB No change
sharedString.js 155.79 KB 155.79 KB No change
sharedTree2.js 234.64 KB 234.64 KB No change
Total Size 1.64 MB 1.63 MB -6.58 KB

Baseline commit: 6076896

Generated by 🚫 dangerJS against 0c15d64

@tyler-cai-microsoft tyler-cai-microsoft marked this pull request as ready for review August 2, 2023 18:08
@tyler-cai-microsoft tyler-cai-microsoft requested review from a team as code owners August 2, 2023 18:08
@markfields
Copy link
Member

Can we discuss the strategy of merging into next? It's not free - we have other Summarizer refactoring work planned for the semester, and if we have to deal with merging with this every time, it will be a real drag.

Alternatives - Just wait until we're confident, or if we're eager to start coding now, put it in a side branch until we're ready - that way the burden of merging with main doesn't come as a tax to the whole team as we work in this code.

@tyler-cai-microsoft tyler-cai-microsoft changed the base branch from next to main August 4, 2023 18:27
@tyler-cai-microsoft
Copy link
Contributor Author

Can we discuss the strategy of merging into next? It's not free - we have other Summarizer refactoring work planned for the semester, and if we have to deal with merging with this every time, it will be a real drag.

Alternatives - Just wait until we're confident, or if we're eager to start coding now, put it in a side branch until we're ready - that way the burden of merging with main doesn't come as a tax to the whole team as we work in this code.

Changed to main. Let's wait until 6.0 is cut and merge the changes into 6.1

@github-actions github-actions bot added base: main PRs targeted against main branch and removed base: next PRs targeted against next branch labels Aug 28, 2023
* Updates GC state from the given snapshot if GC is enabled and the snapshot is newer than the one this node
* is tracking.
*/
private async refreshGCStateFromSnapshot(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so good to see this gone :-)

settings["Fluid.ContainerRuntime.Test.CloseSummarizerDelayOverrideMs"] = 100;
});

itExpects(
"Closes the summarizing client instead of refreshing",
[
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the order of events (and event name) change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the retry scenario, we call fetchSnapshotFromStorageAndClose, and we don't need to call refreshLatestSummary, because we are closing the container as we fetch the snapshot before passing it into refreshLatestSummary.

In the latest ack scenario, we "refresh" the snapshot and then I moved the fetchSnapshotFromStorageAndClose to the containerRuntime layer so it closes after "refresh"

);

if (result.latestSummaryUpdated && !result.wasSummaryTracked) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment here explaning why in this case fetch snapshot is called.

Also nit: The fetchLatestSnapshot() const is no longer required. You can directly call fetchSnapshotFromStorageAndClose here.

Copy link
Member

@markfields markfields Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had left a comment about this too but VS Code ate it. There's lots of logic in there that doesn't necessarily make sense anymore, would be good to clean up here or in a follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most troubling is waitForDeltaManagerToCatchup which I'm pretty sure will hang if we're not already caught up (It should throw IMO since the Container is disposed). I wonder how often this is happening already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wait function is also called in the refreshLatestAck codepath in submitSummary and this concern should be addressed there too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It's a little weird that there is code running after the container runtime is explicitly disposed. For example, when refreshLatestSummaryAckFromServer is called from submitSummary, it would dispose the container runtime but would continue summarization and eventually fail when checkContinue is called later (and the failure reason would be different I believe). We should short circuit this whole thing and return a failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that be cleaned up as part of AB#5417

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaning up / short-circuiting later sounds fine, as long as we're sure we don't have a potential deadlock - I think we do. That needs to be sorted out immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to AB#5417

Good point. It's a little weird that there is code running after the container runtime is explicitly disposed. For example, when refreshLatestSummaryAckFromServer is called from submitSummary, it would dispose the container runtime but would continue summarization and eventually fail when checkContinue is called later (and the failure reason would be different I believe). We should short circuit this whole thing and return a failure.

import { ITelemetryBaseLogger } from "@fluidframework/core-interfaces";
import {
cloneGCData,
getGCDataFromSnapshot,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this from gc/index.ts, maybe others here too, take a look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copy link
Member

@markfields markfields left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good! Fine to do more cleanup in later PRs. My one concern I want to see addressed (or we can discuss why not now) is the remaining call to waitForDeltaManagerToCatchup, in submitSummary. I'm concerned it could deadlock when the container is disposed.

@tyler-cai-microsoft tyler-cai-microsoft merged commit 21d00fb into microsoft:main Aug 30, 2023
{
eventName: "ClosingSummarizerOnSummaryStale",
codePath,
message: "Stopping fetch from storage",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this field for? It's redundant with eventName, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it will prevent the error's message from being copied over to the event, not that we typically look for that, but JFYI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I realized this is just a code move, not actually new here, but something minor to consider in your next PRs)

readAndParseBlob,
);

// Note that we did not track this summary, but that the latest summary was updated.
Copy link
Member

@markfields markfields Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where we are updating latest summary. If it's true that we're not, then let's update this and the doc comment above accordingly -- not to mention the naming in the return type! Makes sense not to since we're about to close.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe even rename this function to something like findPendingSummaryMatchingSummaryAck

agarwal-navin added a commit that referenced this pull request Sep 1, 2023
## Description
#16657 removed the logic to refresh a summary's state by downloading a
snapshot. With it gone, summarizer node (and rest of the system) only
updates state from a summary that it was tracking. This PR simplifies
the logic in `SummarizerNode::refreshLatestSummary` to account for this.
It also renames the properties in the returned object to give the caller
more context on what happened and what action it can take.

Additionally, it removes logging the `PendingSummaryNotFound` telemetry.
It is currently logged every time the summarizer loaded for the summary
that is loaded from. The reason is that it processes the ack for the
summary is loaded from and since the summary was not generated by this
instance of the summarizer, this event is logged. Instead, added a
property `pendingSummaryTracked` to the refreshLatestSummary_end event.


[AB#4435](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/4435)
tyler-cai-microsoft added a commit that referenced this pull request Oct 17, 2023
[AB#5146](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5146)

Originally this was added to differentiate between the Refresh from
Server code and the Close Code which has already been removed:
#16657, as we always
close.

The information this event is already contained in both
`RefreshLatestSummaryFromServerFetch_end` and during `Summarize_cancel`.

Remove ClosingSummarizerOnSummaryStale telemetry as it is duplicate to
both `RefreshLatestSummaryFromServerFetch_end` and
`RefreshLatestSummaryAckFetch_end` telemetry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants