Create framework for safe rollout of back-compatible runtime document schema changes #20174

vladsud · 2024-03-18T15:40:38Z

Required pre-reading

https://github.com/microsoft/FluidFramework/blob/main/packages/dds/SchemaVersioning.md
Discussion in https://dev.azure.com/fluidframework/internal/_workitems/edit/5699

Problem statement

We are lacking safe way of deploying new runtime capabilities. Our clients will make mistakes with making decisions on when it’s safe to enable new capabilities like op compression / op grouping. When mistakes happen, we want that to be very easy to identify.
- Based on our own experience with Microsoft applications, we know it’s not always the case. For example, prior assert 0x162 was hit in production and a lot of smart engineers were not able to track it down. But as part of testing #20109, I hit it right away when I made a mistake in config and let old runtime version collaborate with new runtime version that had the new features enabled.

Proposed solution (implementation)

Proposed solution adds document schema controller that manages part of the document schema - the part that FluidFramework runtime controls (summaries & op format).

All new capabilities (like op compression) will follow the following lifetime cycle:

An ability to write in new schema is added and shipped dark
New capability will not be used by runtime (even in presence of # 3 below) until document schema reflects that such capability is required to interpret document.
An application, when it is safe for the application, will instruct runtime (through feature gates or code change) to start leveraging new capability.
When runtime detects a mismatch between document schema required capabilities and runtime feature set required, it will do the following:
1. If it’s subtraction (i.e. document schema says - compression is required, but options/configuration instructs not to use compression), the feature will be disabled in current session. This allows application to quickly pullback capabilities that are causing troubles in production. At the same time, document schema will require such capability, as some parts of the document might require it (i.e. old application versions will continue not be able to open such documents)
2. If it’s an addition (I.e. application wants to use compression, but it’s not referenced in document schema), document schema will be changed to list such new capability (through ops, communicating to all current and future clients). Only after that happened, new capabilities can be used.
  1. Depending on a capability, it might take an effect in future sessions only.
All existing document format features should consult controller's current session settings in determining if some feature is enabled or not. They should not consult anything else, including options, feature gates, etc. Those are inputs into controller's calculus only.
A “legacy” mode is added for all the applications that were shipping to production preliminary / internal 2.0 bits, and thus were defining document schema implicitly (by using new capabilities). This mode will not generate ops (as part of # 4.2 flow) as such ops would break unsuspecting older clients.
1. Once these changes ship and saturate to reasonable level, such applications are advised to graduate from legacy mode.

Also see "Implementation details" section at the bottom for deeper details.

Important outcome:
When / If applications enable new capabilities and it happens too early (i.e. before reasonable saturation occurs), old clients will fail (when they see document schema change op) in predictable way, allowing application developers (and FF developers) much easier to diagnose such issues and react to them faster.
This mostly helps 2.0 -> 3.0 future changes in schema though. 1.3-> 2.0 has only partial solution (1.3 will fail to process document schema change ops, but if such ops are summarized, FF 1.3 could continue to limp along).

Implementation details / discussion

As of now, I use “regular” runtime ops to communicate document schema changes. This could be changed in the future to some other mechanism (like quorum). Document schema has a version (version defines expectations around structure that represents document schema, not capabilities of document schema). Version can be bumped in the future to define different operation mode. As any other change, it would need to ship dark and saturate before it can engage, but old clients that do not understand new schema will fail and will not cause eventual consistency issues.

Schemas are changed in Compare-And-Swap manner. Client who wants to change scheme sends a message that essentially says “chance a schema if document schema is A (before change)”. If some other client changed schema right before it, then such operation will fail (be treated as noop).
At the moment, clients do not retry. Expectation is that further sessions will do full recall, and if needed, will propose a new schema.

Schema is serialized in a summary and changes to schema follow eventual consistency rules (i.e. schema can be reliably calculated at any sequence number in lifetime of the document based on ops, or summaries + ops).

packages/runtime/container-runtime/src/summary/documentSchema.ts

Abe27342 · 2024-03-18T16:32:29Z

packages/runtime/container-runtime/src/summary/documentSchema.ts

+	}
+
+	const msg = "Document can't be opened with current version of the code";
+	if (documentSchema.version !== currentDocumentVersionSchema) {


How are you imagining this bit evolves over time? The handling here is a little surprising to me: I would have thought this would be an area where we add back-compat code for older versions and only throw on 'new' versions we don't understand, e.g. something like if (!supportedDocSchemaVersions.has(documentSchema.version)) { throw } (or even a <= if we went with a number over a string). Of course that approach is equivalent for now since we only support one version.

Based on interface contracts, I would have expected us to generally support

We can make it complex when we need it. For now, code understands only 1.0, and has no clue what 2.0 means, so it validates it directly.
Or are you asking - should we support some future where 2.0 is back-compatible with 1.0?
It's hard for me to imagine what 2.0 would mean / shape it takes, so I'm not sure I want to build something more complicated today to support it (like a property that tells that even though it's 2.0, 1.0 should proceed with it as if it was 1.0)

packages/runtime/container-runtime/src/summary/documentSchema.ts

msfluid-bot · 2024-03-18T22:11:49Z

	Warnings
⚠️	Bundle size regression detected -- please investigate before merging!

⯅ @fluid-example/bundle-size-tests: +21.36 KB

Metric Name	Baseline Size	Compare Size	Size Diff
aqueduct.js	514.76 KB	520.09 KB	⯅ +5.33 KB
azureClient.js	605.87 KB	611.22 KB	⯅ +5.35 KB
connectionState.js	680 Bytes	680 Bytes	■ No change
containerRuntime.js	249.19 KB	254.53 KB	⯅ +5.33 KB
fluidFramework.js	340.94 KB	340.94 KB	■ No change
loader.js	127.97 KB	127.97 KB	■ No change
map.js	41.35 KB	41.35 KB	■ No change
matrix.js	143.61 KB	143.61 KB	■ No change
odspClient.js	574.33 KB	579.68 KB	⯅ +5.35 KB
odspDriver.js	97.49 KB	97.49 KB	■ No change
odspPrefetchSnapshot.js	41.91 KB	41.91 KB	■ No change
sharedString.js	161.38 KB	161.38 KB	■ No change
sharedTree.js	331.08 KB	331.08 KB	■ No change
Total Size	3.3 MB	3.32 MB	⯅ +21.36 KB

Baseline commit: fb8fb7d

Generated by 🚫 dangerJS against a209315

packages/dds/SchemaVersioning.md

packages/runtime/container-runtime/src/containerRuntime.ts

packages/runtime/container-runtime/src/summary/documentSchema.ts

packages/runtime/container-runtime/src/containerRuntime.ts

packages/runtime/container-runtime/src/summary/documentSchema.ts

andre4i

LGTM 👍 from my side

…to RuntimeCompatibility

* main: (36 commits) feat(tree): create refreshers during delta visit (#20303) Lint against import of @fluidframework/datastore in e2e tests (#20307) server: cover edge cases for scrubbed checkpoint users (#20259) refactor: Update dev dep on package 'start-server-and-test' (#20298) ci: Move templates out of the 1ES folder (#20056) Added unit tests to check usage of IRedisClientConnectionManager for Historian and Gitrest (#20306) build(test-snapshots): use node16 module resolution (#20233) Forbid import of @fluidframework/aqueduct in e2e tests (#20261) fix(tree): Make failure to provide id-compressor a usage error (#20282) fix(api-markdown-documenter): Reduce package version to correct next version (#20302) Added customization for gitrest and historian (#20243) fix(build-tools): mixed internal range detection (#18828) Removing 'paused session' path from SessionResult Metric (#20294) fix(fluid-build): limit Biome config tracking to repo (#20296) refactor: Update webpack-dev-server dependency (#20278) Create framework for safe rollout of back-compatible runtime document schema changes (#20174) Test enabling IdCompressor in RC2 (#20256) refactor(tree): Extract leaf schemas into their own module (#20289) build(client,build-tools): Upgrade biome to 1.6.2 (#20285) feat(build-cli): Add `modify fluid-imports` command (#20006) ...

This is mostly a back-up mechanism to kick out old runtimes that are not compatible with the latest runtime. It's backup in a sense that we should prefer an explicit mechanism by adding new property to a list of features supported / required by runtime. However, we could find bugs much later in time, where multiple runtime versions shipped claiming they understand new capabilities. In simple cases we can simply rename / add new property to signify that only latest runtime version actually understands it. But in more nuanced cases (where some old runtime versions are broken, and some old runtime versions are not), this mechanism will be insufficient. I prefer here to check exact version, and if needed, list 10 versions, as opposed to do math on them (i.e. do something like >=). The latter is a bit risky as we keep changing our version schema encoding, and I'm not confident we can predict how these kinds of math comparisons will work out. Plush do not want to take a dependency on semver package. The amount of new code here is relatively small, and it's dormant until we start using it in some future. Most changes are UTs. This work builds in previous work: - #20174 - #20297

This work builds on previous work: - #19859 - #20174 While prior work gives us opportunity to generate short IDs (this is used, for example, to generate short IDs for data stores and DDSs), the current state of ID compressor settings does not allow us to change them through lifetime of the container. I.e. if container is created with ID compressor off, then it stays off for duration of container lifetime. This is not great, and ideally, we want (eventually) all files to use ID compressor, at least for generation of short IDs. This work makes it possible, allowing off -> delayed mode transition. It stops short from enabling off | delayed -> on transition, as currently this is not possible with async nature of ID compressor loading and synchronous op processing pipeline. It's possible we can make such transition work with some delay, but more work is required to make it happen.

vladsud added 12 commits March 14, 2024 21:48

Initial implementation

dc1e3ef

Move stuff around

29e8495

Refactoring, skeleton for sending ops

4cdb5fc

Implementation of schema ops & summaries

3df6d1d

Documentation, proper error

4453b4a

Properly merge schemas

e1cca0c

Reduce number of changes (relative to main)

7dd41d8

UTs

fb48421

AzureClient 1.3 -> 2.0 proper support

d9056fb

Fix UTs

670d98d

Merge remote-tracking branch 'origin' into RuntimeCompatibility

2bc072d

Merge conflicts

3016bef

vladsud requested review from anthony-murphy, Abe27342, markfields, agarwal-navin, taylorsw04 and andre4i March 18, 2024 15:40

vladsud added 2 commits March 18, 2024 13:25

Documentation changes

4853c4f

Snapshot tests

ec2d835

github-actions bot added the area: dds Issues related to distributed data structures label Mar 18, 2024

Abe27342 reviewed Mar 18, 2024

View reviewed changes

PR feedback

e9e86e4

More UTs, prettier

08186ed

regenerate api.md

50ce162

vladsud mentioned this pull request Mar 19, 2024

Enabling testing of 2.0 file format features #20109

Merged

Merge branch 'main' into RuntimeCompatibility

2d3f4a2

andre4i reviewed Mar 19, 2024

View reviewed changes

vladsud added 2 commits March 20, 2024 00:28

PR feedback

ec89995

prettier

1be1f42

vladsud marked this pull request as ready for review March 20, 2024 04:30

andre4i approved these changes Mar 20, 2024

View reviewed changes

vladsud added 9 commits March 21, 2024 08:22

Merge branch 'main' of https://github.com/Microsoft/FluidFramework in…

bee6949

…to RuntimeCompatibility

Fix defaults

2e5335c

Undo AzureClient changes and leave it up to further PRs.

3398a9e

Merge branch 'main' of https://github.com/Microsoft/FluidFramework in…

c94133f

…to RuntimeCompatibility

PR feedback

dd9b212

Add end-to-end tests and fix bugs

1ab510e

biome

477377b

fix UT for t9s and add more coverage

1de728a

Merge branch 'main' of https://github.com/Microsoft/FluidFramework in…

a209315

…to RuntimeCompatibility

msfluid-bot added the size regression Significant bundle size regression (>5 KB) label Mar 23, 2024

vladsud changed the title ~~Make 2.0 compatible with 1.3 and create framework for safe rolllout of back-compatible runtime document schema changes~~ Create framework for safe rollout of back-compatible runtime document schema changes Mar 24, 2024

vladsud merged commit 929609b into microsoft:main Mar 24, 2024
44 checks passed

vladsud deleted the RuntimeCompatibility branch March 24, 2024 02:53

vladsud mentioned this pull request Mar 28, 2024

Implement explicit runtime version rejection mechanism. #20375

Merged

vladsud mentioned this pull request Apr 8, 2024

Allow ID compressor to be enabled in existing files. #20531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create framework for safe rollout of back-compatible runtime document schema changes #20174

Create framework for safe rollout of back-compatible runtime document schema changes #20174

vladsud commented Mar 18, 2024 •

edited

Loading

Abe27342 Mar 18, 2024

vladsud Mar 18, 2024

msfluid-bot commented Mar 18, 2024 •

edited

Loading

andre4i left a comment

Create framework for safe rollout of back-compatible runtime document schema changes #20174

Create framework for safe rollout of back-compatible runtime document schema changes #20174

Conversation

vladsud commented Mar 18, 2024 • edited Loading

Required pre-reading

Problem statement

Proposed solution (implementation)

Implementation details / discussion

Abe27342 Mar 18, 2024

Choose a reason for hiding this comment

vladsud Mar 18, 2024

Choose a reason for hiding this comment

msfluid-bot commented Mar 18, 2024 • edited Loading

andre4i left a comment

Choose a reason for hiding this comment

vladsud commented Mar 18, 2024 •

edited

Loading

msfluid-bot commented Mar 18, 2024 •

edited

Loading