Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blobstore: Add metrics_tag config field #205

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

minor-fixes
Copy link
Contributor

@minor-fixes minor-fixes commented May 11, 2024

This change adds a single string field to the base blobstore config message, the value of which gets propagated to a tag value on exported metrics for the corresponding backend instance.

In particular, this change allows for attaching different tags on separate uses of a single "label" backend in order to collect metrics specific to a particular instantiation. Without this change, all accesses to the referenced backend shared one combined set of metrics; with this change, it's possible to set separate metrics_tag values on each instantiation and collect separate metrics, as well as tagging the underlying backend and grabbing aggregate metrics as well.

TODO before marking non-draft:

  • pass presubmit checks
  • drop spurious bzlmod changes
  • format code
  • second pass on proto comments
  • add appropriate test(s)?

Tested: see comment below

@minor-fixes
Copy link
Contributor Author

With the config attached below, I started a bb_storage instance, and issued 7 RPCs:

  • 1x "access through readCaching": grpc_cli call localhost:8980 ContentAddressableStorage.FindMissingBlobs 'instance_name: "" digest_function: 1 blob_digests { hash: "aed662c522d64bb9e73534f4cd9cd0d98c8407e737689f1a591e4409f4d9e0ce" size_bytes: 5 }'
  • 2x "access slow directly": grpc_cli call localhost:8980 ContentAddressableStorage.FindMissingBlobs 'instance_name: "slow" digest_function: 1 blob_digests { hash: "aed662c522d64bb9e73534f4cd9cd0d98c8407e737689f1a591e4409f4d9e0ce" size_bytes: 5 }'
  • 4x "access fast directly": grpc_cli call localhost:8980 ContentAddressableStorage.FindMissingBlobs 'instance_name: "fast" digest_function: 1 blob_digests { hash: "aed662c522d64bb9e73534f4cd9cd0d98c8407e737689f1a591e4409f4d9e0ce" size_bytes: 5 }'

The counts for the respective tags were:

  • top: 7 (sum of all RPCs)
  • read_cached: 1 (only one call to the readCaching backend)
  • slow_underlying: 3 (1 call to readCaching + 2 calls to 'slow' directly via instance name)
  • fast_underlying: 4 (0 calls to readCaching? + 4 calls to 'fast' directly via instance name)
  • slow_direct: 2 (counts only direct calls via instance name)
  • slow_cached: 1 (only one call to the readCaching backend)
  • fast_direct: 4 (counts only direct calls via instance name)
  • fast_cached: 0 (presumably; could not find metrics)

I wouldn't mind adding an appropriate test though; any ideas on a better testing approach?


Config:

local SmallCache = function(tag) {
  metricsTag: tag,
  'local': {
    keyLocationMapInMemory: { entries: 1000 },
    keyLocationMapMaximumGetAttempts: 8,
    keyLocationMapMaximumPutAttempts: 32,
    oldBlocks: 2,
    currentBlocks: 1,
    newBlocks: 1,
    blocksInMemory: {
      blockSizeBytes: 10 * 1000 * 1000,
    },
  },
};

{
  executeAuthorizer: { allow: {} },
  grpcServers: [
    {
      listenAddresses: ["127.0.0.1:8980"],
      authenticationPolicy: { allow: {} },
    },
  ],
  global: {
    diagnosticsHttpServer: {
      httpServers: [{ listen_addresses: ["127.0.0.1:9980"], authenticationPolicy: { allow: {} } }],
      enablePrometheus: true,
    },
  },
  contentAddressableStorage: {
    getAuthorizer: { allow: {} },
    putAuthorizer: { allow: {} },
    findMissingAuthorizer: { allow: {} },
    backend: {
      withLabels: {
        backend: {
          metricsTag: 'top',
          demultiplexing: {
            instanceNamePrefixes: {
              '': {
                backend: {
                  metricsTag: 'read_cached',
                  readCaching: {
                    slow: { label: 'slow', metricsTag: 'slow_cached' },
                    fast: { label: 'fast', metricsTag: 'fast_cached' },
                    replicator: { 'local': {} },
                  },
                },
              },
              'slow': {
                backend: {
                  label: 'slow',
                  metricsTag: 'slow_direct',
                },
              },
              'fast': {
                backend: {
                  label: 'fast',
                  metricsTag: 'fast_direct',
                },
              },
            },
          },
        },
        labels: {
          'fast': SmallCache('fast_underlying'),
          'slow': SmallCache('slow_underlying'),
        },
      },
    },
  },
}

@minor-fixes minor-fixes force-pushed the metrics_tag branch 4 times, most recently from 2e717ee to e8ef171 Compare May 13, 2024 00:47
@EdSchouten
Copy link
Member

Instead of actually having a separate metrics_tag, what are your thoughts on reusing the existing with_label / label facility? As in, make it so that using with_label causes a label="..." to be added to any BlobAccess underneath.

@minor-fixes
Copy link
Contributor Author

minor-fixes commented Nov 11, 2024

Instead of actually having a separate metrics_tag, what are your thoughts on reusing the existing with_label / label facility? As in, make it so that using with_label causes a label="..." to be added to any BlobAccess underneath.

I'm trying to make it a bit easier to probe metrics at any layer, so it's a bit more clear when things are misconfigured. The WithLabelsBlobAccess does look like a good candidate for this, but I'm concerned it will complicate things in practice:

  • in order to instrument a particular layer, it needs to be moved to a labeled backend, which is a moderate configuration change - if one were trying to probe metrics to debug a misconfiguration, such a config change to do so would create more uncertainty.
  • nesting labeled backends could be challenging to configure, and difficult to reason about (if children of labeled backends inherit parent labels)

With this approach, one only needs to add unique metricsTag specifiers in the config to desired points in the blobstore configuration stack.

As an added bonus, metrics for the same labeled backend can be tracked with different identifiers if the labeled backend is reused, making it possible to track metrics for disparate entrypoints into the same labeled backend separately if desired. In the example above, the labeled backend calls can be measured via the slow_underlying labels, and the calls from the two wrappers can also be recorded separately via the slow_direct and slow_cached labels.

@minor-fixes minor-fixes marked this pull request as ready for review November 11, 2024 18:52
WIP; more description to come
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants