Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: provided user was not an azure user fixes #20150

Merged
merged 11 commits into from
Mar 19, 2024

Conversation

znewton
Copy link
Contributor

@znewton znewton commented Mar 15, 2024

Description

This PR begins the process for fixing the Provided user was not an "AzureUser" bug, which was caused by #19000. However, we cannot just revert #19000 because it was implemented to resolve privacy and data residency concerns.

image

ADO 7422

Included here is part 1 & 2 of the code changes needed to resolve the incident. Configs added are pretty self-explanatory:

  • scrubUserDataInGlobalCheckpoints: this is going to stay enabled
  • scrubUserDataInLocalCheckpoints: this is going to eventually be disabled once work lands to enable a 24hr TTL in local db
  • scrubUserDataInSummaries: this is going to stay disabled, but including for completeness

Otherwise, the main change is to validate the quorum member information on Scribe boot in an effort to recover from a global DB quorum read by loading checkpoint from the summary instead.

Follow-up Work

Follow-up work will include making some changes to how we create/update the Checkpoint DB TTL index. Once that is complete we can stop scrubbing local checkpoint user data.

@github-actions github-actions bot added area: server Server related issues (routerlicious) base: main PRs targeted against main branch labels Mar 15, 2024
@znewton znewton marked this pull request as ready for review March 15, 2024 17:34
@github-actions github-actions bot added the public api change Changes to a public API label Mar 15, 2024
Copy link
Contributor

@kekachmar kekachmar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating more utils functions and tests.

Copy link
Contributor

@yangg-msft yangg-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@znewton znewton enabled auto-merge (squash) March 19, 2024 22:54
@znewton znewton merged commit 04a2cc9 into microsoft:main Mar 19, 2024
30 checks passed
znewton added a commit that referenced this pull request Mar 25, 2024
## Description

Continuing the work from #20150 to make sure we cover all the possible
edge cases.

## Reviewer Guidance

The control flow in Scribe LambdaFactory was a bit confusing, so
hopefully I didn't mess it up.
I understand the current control flow to be:

```
if (no global checkpoint)
  use Default checkpoint
elsif (global checkpoint was cleared or global checkpoint quorum was scrubbed)
  use Summary checkpoint
else
  use latest DB checkpoint (local or global)
```

My goal was to make the control flow behave as follows:

```
if (no global and no local checkpoint and no summary checkpoint)
  use Default checkpoint
elsif (
    global checkpoint was cleared and summary checkpoint ahead of local db checkpoint
    or latest DB checkpoint quorum was scrubbed
    or summary checkpoint ahead of latest DB checkpoint
  )
  use Summary checkpoint
else
  use latest DB checkpoint (local or global)
```
@znewton znewton deleted the server/not-an-azureuser-1 branch March 26, 2024 17:21
znewton added a commit to znewton/FluidFramework that referenced this pull request May 16, 2024
## Description

This PR _begins_ the process for fixing the `Provided user was not an
"AzureUser"` bug, which was caused by microsoft#19000. However, we cannot just
revert microsoft#19000 because it was implemented to resolve privacy and data
residency concerns.


![image](https://github.com/microsoft/FluidFramework/assets/14362200/948f8823-895e-4945-8e4a-b84dd6cdbfc7)

[ADO
7422](https://dev.azure.com/fluidframework/internal/_workitems/edit/7422)

Included here is part 1 & 2 of the code changes needed to resolve the
incident. Configs added are pretty self-explanatory:

- `scrubUserDataInGlobalCheckpoints`: this is going to stay enabled
- `scrubUserDataInLocalCheckpoints`: this is going to eventually be
disabled once work lands to enable a 24hr TTL in local db
- `scrubUserDataInSummaries`: this is going to stay disabled, but
including for completeness

Otherwise, the main change is to validate the quorum member information
on Scribe boot in an effort to recover from a global DB quorum read by
loading checkpoint from the summary instead.

### Follow-up Work

Follow-up work will include making some changes to how we create/update
the Checkpoint DB TTL index. Once that is complete we can stop scrubbing
local checkpoint user data.
znewton added a commit to znewton/FluidFramework that referenced this pull request May 17, 2024
## Description

Continuing the work from microsoft#20150 to make sure we cover all the possible
edge cases.

## Reviewer Guidance

The control flow in Scribe LambdaFactory was a bit confusing, so
hopefully I didn't mess it up.
I understand the current control flow to be:

```
if (no global checkpoint)
  use Default checkpoint
elsif (global checkpoint was cleared or global checkpoint quorum was scrubbed)
  use Summary checkpoint
else
  use latest DB checkpoint (local or global)
```

My goal was to make the control flow behave as follows:

```
if (no global and no local checkpoint and no summary checkpoint)
  use Default checkpoint
elsif (
    global checkpoint was cleared and summary checkpoint ahead of local db checkpoint
    or latest DB checkpoint quorum was scrubbed
    or summary checkpoint ahead of latest DB checkpoint
  )
  use Summary checkpoint
else
  use latest DB checkpoint (local or global)
```
znewton added a commit that referenced this pull request May 17, 2024
## Description

T9s v4 doesn't have the Quorum scrubbed users fix, and is also missing
the fix for correct Read/Write scopes on socket connection.
Cherry-picking some commits:

---

### server: cover edge cases for scrubbed checkpoint users
(#20259)

6718a9a

---

### server: provided user was not an azure user fixes (#20150)

786abfa

---

### server: fix mutable quorum snapshot (#20329)

5de9472

---

### fix: send correct connection scopes for client (#20312)

7d07fc5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: server Server related issues (routerlicious) base: main PRs targeted against main branch public api change Changes to a public API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants