Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kkachmar/remove session start metric #21125

Merged
merged 72 commits into from
May 21, 2024

Conversation

kekachmar
Copy link
Contributor

This change removes the SessionStartMetric from Scribe and Deli. The reasons behind this change are:

  1. There has been a bug where the size of the checkpoints is too large, causing checkpoints to fail.
  2. A likely cause of this is the successfullyStartedLambdas data object, which can be very large, and is used in the SessionStartMetric.
  3. When checkpoints fail, the successfullyStartedLambdas value in the checkpoint could be incorrect, which may lead to invalid metrics.
  4. With updated logs and metrics, we don't need this metric or the validation of successfullyStartedLambdas, as we have logging around session start behaviors such as the restoreFromCheckpoint metric and RunService metric, as well as logging for creating document partitions for document lambdas like Scribe and Deli.

@github-actions github-actions bot added area: server Server related issues (routerlicious) base: main PRs targeted against main branch labels May 16, 2024
@kekachmar kekachmar requested a review from zhangxin511 May 16, 2024 20:45
this.successfullyStartedLambdas.includes(val),
);
}

private logSessionEndMetrics(closeType: LambdaCloseType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do we don't log start but still log end?

Josmithr and others added 18 commits May 21, 2024 15:58
… and fix violations in code (microsoft#21076)

This package was using the `minimal-deprecated` configuration, which is
scheduled for deletion. Upgrades the package's config to the
`recommended` base and fixes violations in the code.

This PR disables some of the `recommended` rules due to the sheer number
of violations. These can and should be removed and fixed in the future.
For now, the primary goal here is to remove usages of the deprecated
config.


[AB#2900](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/2900)

---------

Co-authored-by: Alex Villarreal <[email protected]>
…lations in code (microsoft#21078)

This package was using the `minimal-deprecated` configuration, which is
scheduled for deletion. Upgrades the package's config to the
`recommended` base and fixes violations in the code.

This PR disables some of the `recommended` rules that are
TypeScript-specific. Longer term, we should create a better pattern for
JavaScript packages. For now, the primary goal here is to remove usages
of the deprecated config.

AB#8051
## Description

Adding a check in doc corruption logic to skip marking the doc as
corrupted if the doc partition is already closed (eg: if the session is
paused)

This will help reduce unnecessary doc corruptions in case of paused
sessions as well as reduce uncaught exception restarts due to emitting
error for a paused session.

---------

Co-authored-by: Shubhangi Agarwal <[email protected]>
…s logger (microsoft#20839)

## Description

ContainerRuntime.logger is currently public, though ideally, it should
not be, and it uses an internal type instead of the preferable base
type. When sharing the logger, the standard practice should be to pass
around the base logger, allowing each consumer to create a ChildLogger
from it.

## Changes

* Rename ContainerRuntime.logger to ContainerRuntime.baseLogger, change
its type to ITelemetryBaseLogger, and set its access level to private.

* Adjust the dependencies that receive the logger to accept the
ITelemetryBaseLogger type and utilize createChildLogger to create a
child logger as needed (this operation will be a no-op if the logger is
already a child logger).

* Modify the following classes in Fluid Framework (FF) so they no longer
directly access the logger from the runtime but instead receive their
own base logger via constructor parameters:

BlobManager
Summarizer
DataStoreContext

* DataStoreContext currently makes the same logger available through its
public API. This may be unnecessary—as indicated by its lack of use in
the FF codebase—especially since IFluidDataStoreRuntime already exposes
a logger. This exposure could potentially be removed to streamline the
API and enhance encapsulation.
## Description

Improve type test symbols handling. Well known symbols are preserved, as
is `symbol`, but unique symbols are replaced (with never), both when
used as keys and as value types preventing valse positives (detected API
breaks that are op-ops).

This was manually tested with the RC4 client packages, a determined to
fix the issue with the new fluid handle symbol.
…crosoft#20851)

microsoft#16885 added a new retry
mechanism in case of summarization failures. This was rolled out with
A/B testing and the experiment was successful. This PR makes the new
retry mechanism default and removes the old one.

**Notes**
- Added retry for these two additional scenarios - When a summary op or
summary ack is not received within the timeout, retry. This is because
these failures are mostly transient and a retry may fix them.
- Removed `refreshLatestAck` summary option because its not used
anymore. A lot of the code changes are because of this.


[AB#7605](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/7605)
## Description

microsoft#21070 made some
improvements to type tests which were needed to regenerate the type
tests for client. This change integrates that, as well as regenerates
the type tests with the new generator.
## Description

- This PR adds the capability for the azure client tests to run against
ephemeral containers when the azure__fluid__relay__service__ephemeral
environment variable is set to "true"
- This is done without any changes to the AzureClient by running the
ephemeral tests using manually created Axios requests with pre-generated
summary trees (stored in [ephemeralSummaryTrees.ts)
- The pre-generated summary trees were obtained by running a debugger
against the existing E2E tests, extracting the summaries created by the
AzureClient, and then modifying the IsEphemeralContainer property of
these objects
- Certain tests are skipped in ephemeral mode, because this method of
creating containers does not involve attachment/detachment logic
- Non-ephemeral E2E test behavior should remain unchanged


- Most of the file changes not in the packages/service-clients folder
are just some autogenerated file changes

---------

Co-authored-by: Brandon Diaz <“[email protected]”>
Co-authored-by: Matt Rakow <[email protected]>
…icrosoft#21066)

In preparation for updating the package to use our recommended base
eslint config, this PR enabled the `explicit-return-types` rule in the
package (excluding test code) and fixes violations (while disabling the
rule in select cases for functions that intentionally derive their
return type from other functions/libraries).

AB#6983
## Description

Adds support for OAuthBearer authentication flow for rdkafka client when
Kafka server expects/supports it. The 'tokenProvider' is supposed to be
set via nconf programatically.

## Breaking Changes


server/routerlicious/packages/services-ordering-rdkafka/src/rdkafkaBase.ts
introduces connect() method breaking change by changing its return value
from 'void' to 'Promise<void>'. This is needed for tokenProvider's call
convenience and overall maintainability/readability of the code.

## Reviewer Guidance

This change has been verified with FRS and three types of Kafka server:

* HDI Kafka w/ SSL auth
* Event Hubs w/ connection string
* Event Hubs w/ managed identity


Another change that is not directly related to the main topic is fix for
incorrect node-rdkafka's feature filtering
```
		const rdKafkaHasSSLEnabled = kafka.features.filter((feature) =>
			feature.toLowerCase().includes("ssl"),
		);
```
The `rdKafkaHasSSLEnabled` will have an empty array if `ssl` is not
included. Which later would be checked with the following line:
```
		if (!rdKafkaHasSSLEnabled) {
```
The problem with this line is that it will always be false whether
`rdKafkaHasSSLEnabled` contains `ssl` entry or it's completely empty.
Empty array is still not null/undefined, so the condition will be false.

I've changed it to an array of lower case values from the original one
of rdkafka (redundant step, since it always contains lower case
features, but to make sure it's always lower case and it doesn't change
the previous logic).
```
		const lcKafkaFeatures = kafka.features.map((s) => s.toLowerCase());
```

And new way of checking it traverses this new array and checks it with
the `includes()` method:

```
		if (!lcKafkaFeatures.includes("ssl")) {
```
Reading through stashedOps.spec.ts, I wanted to make a couple changes:

* Move helpers out of the top-level `describe` block, because they were
inconsistently using both "global" state from that scope and objects
passed in via parameters
* Moved `waitForSummary` helper out too, so callsites are explicit about
which container is being awaited
* I added comments and tweaked some names/strings in the test `handles
stashed ops created on top of sequenced local ops` after stepping
through it, it was a very interesting one to follow in terms of
learning.

Bonus change: Move event-specific logging props into the "details"
property. It's easier to find this info that way (you don't have to
remember the exact name), and avoids an explosion of columns when the
telemetry sink turns each prop into its own column.
## Description

Use node's bundled CA on windows.

Co-authored-by: Abram Sanderson <[email protected]>
## Description

Currently, we use the ropc flow with a confidential client to
authenticate in our e2e/stress tests against odsp. This is less
appropriate than using a public client flow, since our application
registration really only needs to have delegated permissions.
This PR adjusts things to use a public flow--I've updated the relevant
app registrations to allow both forms of auth already.

Using a public flow also means our infrastructure setup here aligns
exactly with [this relatively recent MSFT
guidance](https://learn.microsoft.com/en-us/entra/identity-platform/test-automate-integration-testing?tabs=dotnet).

---------

Co-authored-by: Abram Sanderson <[email protected]>
@github-actions github-actions bot added area: runtime Runtime related issues area: loader Loader related issues area: dds Issues related to distributed data structures area: build Build related issues area: dev experience Improving the experience of devs building on top of fluid dependencies Pull requests that update a dependency file area: odsp-driver and removed area: dev experience Improving the experience of devs building on top of fluid area: dds: propertydds area: driver Driver related issues public api change Changes to a public API area: dds: tree area: examples Changes that focus on our examples area: tests Tests to add, test infrastructure improvements, etc area: framework Framework is a tag for issues involving the developer framework. Eg Aqueduct documentation Improvements or additions to documentation area: dds: sharedstring labels May 21, 2024
@github-actions github-actions bot added area: dds: tree and removed area: odsp-driver area: runtime Runtime related issues area: dev experience Improving the experience of devs building on top of fluid dependencies Pull requests that update a dependency file area: loader Loader related issues area: build Build related issues labels May 21, 2024
@github-actions github-actions bot removed area: dds: tree area: dds Issues related to distributed data structures labels May 21, 2024
Copy link
Contributor

@alexvy86 alexvy86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on behalf of devtools so the PR doesn't have to be recreated, but didn't look in detail at the changes.

@kekachmar kekachmar merged commit 15e9521 into microsoft:main May 21, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: server Server related issues (routerlicious) base: main PRs targeted against main branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.