Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create framework for safe rollout of back-compatible runtime document schema changes #20174

Merged
merged 29 commits into from
Mar 24, 2024

Conversation

vladsud
Copy link
Contributor

@vladsud vladsud commented Mar 18, 2024

Required pre-reading

https://github.com/microsoft/FluidFramework/blob/main/packages/dds/SchemaVersioning.md
Discussion in https://dev.azure.com/fluidframework/internal/_workitems/edit/5699

Problem statement

We are lacking safe way of deploying new runtime capabilities. Our clients will make mistakes with making decisions on when it’s safe to enable new capabilities like op compression / op grouping. When mistakes happen, we want that to be very easy to identify.
- Based on our own experience with Microsoft applications, we know it’s not always the case. For example, prior assert 0x162 was hit in production and a lot of smart engineers were not able to track it down. But as part of testing #20109, I hit it right away when I made a mistake in config and let old runtime version collaborate with new runtime version that had the new features enabled.

Proposed solution (implementation)

Proposed solution adds document schema controller that manages part of the document schema - the part that FluidFramework runtime controls (summaries & op format).

All new capabilities (like op compression) will follow the following lifetime cycle:

  1. An ability to write in new schema is added and shipped dark
  2. New capability will not be used by runtime (even in presence of # 3 below) until document schema reflects that such capability is required to interpret document.
  3. An application, when it is safe for the application, will instruct runtime (through feature gates or code change) to start leveraging new capability.
  4. When runtime detects a mismatch between document schema required capabilities and runtime feature set required, it will do the following:
    1. If it’s subtraction (i.e. document schema says - compression is required, but options/configuration instructs not to use compression), the feature will be disabled in current session. This allows application to quickly pullback capabilities that are causing troubles in production. At the same time, document schema will require such capability, as some parts of the document might require it (i.e. old application versions will continue not be able to open such documents)
    2. If it’s an addition (I.e. application wants to use compression, but it’s not referenced in document schema), document schema will be changed to list such new capability (through ops, communicating to all current and future clients). Only after that happened, new capabilities can be used.
      1. Depending on a capability, it might take an effect in future sessions only.
  5. All existing document format features should consult controller's current session settings in determining if some feature is enabled or not. They should not consult anything else, including options, feature gates, etc. Those are inputs into controller's calculus only.
  6. A “legacy” mode is added for all the applications that were shipping to production preliminary / internal 2.0 bits, and thus were defining document schema implicitly (by using new capabilities). This mode will not generate ops (as part of # 4.2 flow) as such ops would break unsuspecting older clients.
    1. Once these changes ship and saturate to reasonable level, such applications are advised to graduate from legacy mode.

Also see "Implementation details" section at the bottom for deeper details.

Important outcome:
When / If applications enable new capabilities and it happens too early (i.e. before reasonable saturation occurs), old clients will fail (when they see document schema change op) in predictable way, allowing application developers (and FF developers) much easier to diagnose such issues and react to them faster.
This mostly helps 2.0 -> 3.0 future changes in schema though. 1.3-> 2.0 has only partial solution (1.3 will fail to process document schema change ops, but if such ops are summarized, FF 1.3 could continue to limp along).

Implementation details / discussion

As of now, I use “regular” runtime ops to communicate document schema changes. This could be changed in the future to some other mechanism (like quorum). Document schema has a version (version defines expectations around structure that represents document schema, not capabilities of document schema). Version can be bumped in the future to define different operation mode. As any other change, it would need to ship dark and saturate before it can engage, but old clients that do not understand new schema will fail and will not cause eventual consistency issues.

Schemas are changed in Compare-And-Swap manner. Client who wants to change scheme sends a message that essentially says “chance a schema if document schema is A (before change)”. If some other client changed schema right before it, then such operation will fail (be treated as noop).
At the moment, clients do not retry. Expectation is that further sessions will do full recall, and if needed, will propose a new schema.

Schema is serialized in a summary and changes to schema follow eventual consistency rules (i.e. schema can be reliably calculated at any sequence number in lifetime of the document based on ops, or summaries + ops).

@github-actions github-actions bot added area: framework Framework is a tag for issues involving the developer framework. Eg Aqueduct area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc public api change Changes to a public API base: main PRs targeted against main branch labels Mar 18, 2024
@github-actions github-actions bot added the area: dds Issues related to distributed data structures label Mar 18, 2024
}

const msg = "Document can't be opened with current version of the code";
if (documentSchema.version !== currentDocumentVersionSchema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you imagining this bit evolves over time? The handling here is a little surprising to me: I would have thought this would be an area where we add back-compat code for older versions and only throw on 'new' versions we don't understand, e.g. something like if (!supportedDocSchemaVersions.has(documentSchema.version)) { throw } (or even a <= if we went with a number over a string). Of course that approach is equivalent for now since we only support one version.

Based on interface contracts, I would have expected us to generally support

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make it complex when we need it. For now, code understands only 1.0, and has no clue what 2.0 means, so it validates it directly.
Or are you asking - should we support some future where 2.0 is back-compatible with 1.0?
It's hard for me to imagine what 2.0 would mean / shape it takes, so I'm not sure I want to build something more complicated today to support it (like a property that tells that even though it's 2.0, 1.0 should proceed with it as if it was 1.0)

@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Mar 18, 2024

Warnings
⚠️ Bundle size regression detected -- please investigate before merging!
@fluid-example/bundle-size-tests: +21.36 KB
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 514.76 KB 520.09 KB +5.33 KB
azureClient.js 605.87 KB 611.22 KB +5.35 KB
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 249.19 KB 254.53 KB +5.33 KB
fluidFramework.js 340.94 KB 340.94 KB No change
loader.js 127.97 KB 127.97 KB No change
map.js 41.35 KB 41.35 KB No change
matrix.js 143.61 KB 143.61 KB No change
odspClient.js 574.33 KB 579.68 KB +5.35 KB
odspDriver.js 97.49 KB 97.49 KB No change
odspPrefetchSnapshot.js 41.91 KB 41.91 KB No change
sharedString.js 161.38 KB 161.38 KB No change
sharedTree.js 331.08 KB 331.08 KB No change
Total Size 3.3 MB 3.32 MB +21.36 KB

Baseline commit: fb8fb7d

Generated by 🚫 dangerJS against a209315

packages/dds/SchemaVersioning.md Outdated Show resolved Hide resolved
packages/dds/SchemaVersioning.md Outdated Show resolved Hide resolved
packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved
packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved
packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved
@vladsud vladsud marked this pull request as ready for review March 20, 2024 04:30
Copy link
Contributor

@andre4i andre4i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 from my side

@msfluid-bot msfluid-bot added the size regression Significant bundle size regression (>5 KB) label Mar 23, 2024
@vladsud vladsud changed the title Make 2.0 compatible with 1.3 and create framework for safe rolllout of back-compatible runtime document schema changes Create framework for safe rollout of back-compatible runtime document schema changes Mar 24, 2024
@vladsud vladsud merged commit 929609b into microsoft:main Mar 24, 2024
44 checks passed
@vladsud vladsud deleted the RuntimeCompatibility branch March 24, 2024 02:53
tianzhu007 added a commit that referenced this pull request Mar 25, 2024
* main: (36 commits)
  feat(tree): create refreshers during delta visit (#20303)
  Lint against import of @fluidframework/datastore in e2e tests (#20307)
  server: cover edge cases for scrubbed checkpoint users (#20259)
  refactor: Update dev dep on package 'start-server-and-test' (#20298)
  ci: Move templates out of the 1ES folder (#20056)
  Added unit tests to check usage of IRedisClientConnectionManager for Historian and Gitrest (#20306)
  build(test-snapshots): use node16 module resolution (#20233)
  Forbid import of @fluidframework/aqueduct in e2e tests (#20261)
  fix(tree): Make failure to provide id-compressor a usage error (#20282)
  fix(api-markdown-documenter): Reduce package version to correct next version (#20302)
  Added customization for gitrest and historian (#20243)
  fix(build-tools): mixed internal range detection (#18828)
  Removing 'paused session' path from SessionResult Metric (#20294)
  fix(fluid-build): limit Biome config tracking to repo (#20296)
  refactor: Update webpack-dev-server dependency (#20278)
  Create framework for safe rollout of back-compatible runtime document schema changes (#20174)
  Test enabling IdCompressor in RC2 (#20256)
  refactor(tree): Extract leaf schemas into their own module (#20289)
  build(client,build-tools): Upgrade biome to 1.6.2 (#20285)
  feat(build-cli): Add `modify fluid-imports` command (#20006)
  ...
vladsud added a commit that referenced this pull request Mar 30, 2024
This is mostly a back-up mechanism to kick out old runtimes that are not compatible with the latest runtime.
It's backup in a sense that we should prefer an explicit mechanism by adding new property to a list of features supported / required by runtime.

However, we could find bugs much later in time, where multiple runtime versions shipped claiming they understand new capabilities. In simple cases we can simply rename / add new property to signify that only latest runtime version actually understands it. But in more nuanced cases (where some old runtime versions are broken, and some old runtime versions are not), this mechanism will be insufficient.

I prefer here to check exact version, and if needed, list 10 versions, as opposed to do math on them (i.e. do something like >=). The latter is a bit risky as we keep changing our version schema encoding, and I'm not confident we can predict how these kinds of math comparisons will work out. Plush do not want to take a dependency on semver package.

The amount of new code here is relatively small, and it's dormant until we start using it in some future. Most changes are UTs.

This work builds in previous work:
- #20174
- #20297
vladsud added a commit that referenced this pull request Apr 9, 2024
This work builds on previous work:
- #19859
- #20174

While prior work gives us opportunity to generate short IDs (this is used, for example, to generate short IDs for data stores and DDSs), the current state of ID compressor settings does not allow us to change them through lifetime of the container.
I.e. if container is created with ID compressor off, then it stays off for duration of container lifetime.

This is not great, and ideally, we want (eventually) all files to use ID compressor, at least for generation of short IDs.
This work makes it possible, allowing off -> delayed mode transition.

It stops short from enabling off | delayed -> on transition, as currently this is not possible with async nature of ID compressor loading and synchronous op processing pipeline. It's possible we can make such transition work with some delay, but more work is required to make it happen.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: dds Issues related to distributed data structures area: framework Framework is a tag for issues involving the developer framework. Eg Aqueduct area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch public api change Changes to a public API size regression Significant bundle size regression (>5 KB)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants