-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create framework for safe rollout of back-compatible runtime document schema changes #20174
Conversation
packages/runtime/container-runtime/src/summary/documentSchema.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/documentSchema.ts
Outdated
Show resolved
Hide resolved
} | ||
|
||
const msg = "Document can't be opened with current version of the code"; | ||
if (documentSchema.version !== currentDocumentVersionSchema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you imagining this bit evolves over time? The handling here is a little surprising to me: I would have thought this would be an area where we add back-compat code for older versions and only throw on 'new' versions we don't understand, e.g. something like if (!supportedDocSchemaVersions.has(documentSchema.version)) { throw }
(or even a <=
if we went with a number over a string). Of course that approach is equivalent for now since we only support one version.
Based on interface contracts, I would have expected us to generally support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make it complex when we need it. For now, code understands only 1.0, and has no clue what 2.0 means, so it validates it directly.
Or are you asking - should we support some future where 2.0 is back-compatible with 1.0?
It's hard for me to imagine what 2.0 would mean / shape it takes, so I'm not sure I want to build something more complicated today to support it (like a property that tells that even though it's 2.0, 1.0 should proceed with it as if it was 1.0)
packages/runtime/container-runtime/src/summary/documentSchema.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/documentSchema.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/documentSchema.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/documentSchema.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/documentSchema.ts
Outdated
Show resolved
Hide resolved
⯅ @fluid-example/bundle-size-tests: +21.36 KB
Baseline commit: fb8fb7d |
packages/runtime/container-runtime/src/summary/documentSchema.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/documentSchema.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/documentSchema.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/documentSchema.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 from my side
…to RuntimeCompatibility
…to RuntimeCompatibility
…to RuntimeCompatibility
* main: (36 commits) feat(tree): create refreshers during delta visit (#20303) Lint against import of @fluidframework/datastore in e2e tests (#20307) server: cover edge cases for scrubbed checkpoint users (#20259) refactor: Update dev dep on package 'start-server-and-test' (#20298) ci: Move templates out of the 1ES folder (#20056) Added unit tests to check usage of IRedisClientConnectionManager for Historian and Gitrest (#20306) build(test-snapshots): use node16 module resolution (#20233) Forbid import of @fluidframework/aqueduct in e2e tests (#20261) fix(tree): Make failure to provide id-compressor a usage error (#20282) fix(api-markdown-documenter): Reduce package version to correct next version (#20302) Added customization for gitrest and historian (#20243) fix(build-tools): mixed internal range detection (#18828) Removing 'paused session' path from SessionResult Metric (#20294) fix(fluid-build): limit Biome config tracking to repo (#20296) refactor: Update webpack-dev-server dependency (#20278) Create framework for safe rollout of back-compatible runtime document schema changes (#20174) Test enabling IdCompressor in RC2 (#20256) refactor(tree): Extract leaf schemas into their own module (#20289) build(client,build-tools): Upgrade biome to 1.6.2 (#20285) feat(build-cli): Add `modify fluid-imports` command (#20006) ...
This is mostly a back-up mechanism to kick out old runtimes that are not compatible with the latest runtime. It's backup in a sense that we should prefer an explicit mechanism by adding new property to a list of features supported / required by runtime. However, we could find bugs much later in time, where multiple runtime versions shipped claiming they understand new capabilities. In simple cases we can simply rename / add new property to signify that only latest runtime version actually understands it. But in more nuanced cases (where some old runtime versions are broken, and some old runtime versions are not), this mechanism will be insufficient. I prefer here to check exact version, and if needed, list 10 versions, as opposed to do math on them (i.e. do something like >=). The latter is a bit risky as we keep changing our version schema encoding, and I'm not confident we can predict how these kinds of math comparisons will work out. Plush do not want to take a dependency on semver package. The amount of new code here is relatively small, and it's dormant until we start using it in some future. Most changes are UTs. This work builds in previous work: - #20174 - #20297
This work builds on previous work: - #19859 - #20174 While prior work gives us opportunity to generate short IDs (this is used, for example, to generate short IDs for data stores and DDSs), the current state of ID compressor settings does not allow us to change them through lifetime of the container. I.e. if container is created with ID compressor off, then it stays off for duration of container lifetime. This is not great, and ideally, we want (eventually) all files to use ID compressor, at least for generation of short IDs. This work makes it possible, allowing off -> delayed mode transition. It stops short from enabling off | delayed -> on transition, as currently this is not possible with async nature of ID compressor loading and synchronous op processing pipeline. It's possible we can make such transition work with some delay, but more work is required to make it happen.
Required pre-reading
https://github.com/microsoft/FluidFramework/blob/main/packages/dds/SchemaVersioning.md
Discussion in https://dev.azure.com/fluidframework/internal/_workitems/edit/5699
Problem statement
We are lacking safe way of deploying new runtime capabilities. Our clients will make mistakes with making decisions on when it’s safe to enable new capabilities like op compression / op grouping. When mistakes happen, we want that to be very easy to identify.
- Based on our own experience with Microsoft applications, we know it’s not always the case. For example, prior assert 0x162 was hit in production and a lot of smart engineers were not able to track it down. But as part of testing #20109, I hit it right away when I made a mistake in config and let old runtime version collaborate with new runtime version that had the new features enabled.
Proposed solution (implementation)
Proposed solution adds document schema controller that manages part of the document schema - the part that FluidFramework runtime controls (summaries & op format).
All new capabilities (like op compression) will follow the following lifetime cycle:
Also see "Implementation details" section at the bottom for deeper details.
Important outcome:
When / If applications enable new capabilities and it happens too early (i.e. before reasonable saturation occurs), old clients will fail (when they see document schema change op) in predictable way, allowing application developers (and FF developers) much easier to diagnose such issues and react to them faster.
This mostly helps 2.0 -> 3.0 future changes in schema though. 1.3-> 2.0 has only partial solution (1.3 will fail to process document schema change ops, but if such ops are summarized, FF 1.3 could continue to limp along).
Implementation details / discussion
As of now, I use “regular” runtime ops to communicate document schema changes. This could be changed in the future to some other mechanism (like quorum). Document schema has a version (version defines expectations around structure that represents document schema, not capabilities of document schema). Version can be bumped in the future to define different operation mode. As any other change, it would need to ship dark and saturate before it can engage, but old clients that do not understand new schema will fail and will not cause eventual consistency issues.
Schemas are changed in Compare-And-Swap manner. Client who wants to change scheme sends a message that essentially says “chance a schema if document schema is A (before change)”. If some other client changed schema right before it, then such operation will fail (be treated as noop).
At the moment, clients do not retry. Expectation is that further sessions will do full recall, and if needed, will propose a new schema.
Schema is serialized in a summary and changes to schema follow eventual consistency rules (i.e. schema can be reliably calculated at any sequence number in lifetime of the document based on ops, or summaries + ops).