Releases: fluxninja/aperture
Aperture v0.10.0-rc.2
Changelog
List of aperture PRs merged since 0.10.0-rc.1 release. For the full list of changes, see list of changes
Learning period via EMA warm up window (#921)
Description of change
- EMA emits invalids during warm up by default.
- Increase the EMA warm period in latency gradient policy to 1 minute.
- This would ensure no actuation for at least one minute of initial
traffic while Aperture learns the latency profile of a service.
Aperture v0.10.0-rc.1
Changelog
List of aperture PRs merged since 0.9.0 release. For the full list of changes, see list of changes
Remove unused CheckResponse.Error (#906)
This field described only authz-specific errors and was filled in
envoy.Handler.Check() response when also returning non-nil error, but in
such case the grpc framework was not using the response anyway.
This field was also used for metrics, but no codepath was actually
setting them, as flowcontrol never set these.
Also:
- Create errors using grpc/status package, so that we have control on
the grpc
status. - Add missing sampled logs for error conditions.
Drive-by:
- Remove unused error from ClassifierEngine.Classify(), as it's
infallible (all errors are reported individually per-label). - Remove unused code from authz.go.
Aperture SDK for Javascript (#817)
Co-authored-by: Hasit Mistry [email protected]
Add authzHandler to sdk-validator's grpc server (#797)
Description of change
Add authzHandler to sdk-validator's grpc server
- Add CommonHandler
- Refactor FlowControlHandler with CommonHandler
Alerts pipelines (#893)
Description of change
This introduces basic pipelines for Alerts including the following.
alerts.Alerter
interface
This interface is being propagated as part of the platform. It can be
used by any party interested by calling AddAlert(*alerts.Alert)
method. In particular, it will be used by components like
#863.
There are helper functions and methods provided to alerts.Alert
struct
for easy construction of such alerts.
Alerts receiver
This receiver calls AlertsChan()
method of alerts.Alerter
, converts
received alert.Alert
structs into OpenTelemetry Logs format and pushes
into the next consumer.
There are convenient functions provided for easy conversions in both
ways, to be used in the Alertmanager exporter
#862.
Alerts processors
Alerts processor add proper labels to the alerts i.e. agent_group
,
instance
and controller_id
.
Ref: GH-861
flowcontrol: restructure codebase II (#898)
Description of change
Making room for adding more APIs (adapters, previews etc) under
flowcontrol.
Document Prometheus metrics and OLAP Flow events (#878)
Description of change
Closes: #720
Speed up ser/deserialization of CheckResponse in envoy authz (#881)
Now CheckResponse is binary-encoded in protobuf wire format and stored
in DynamicMetadata as base64 string. This speeds up serialization, but
also deserialization (in metrics processor).
No changes in envoyfilter defition were needed as envoy's access logger passes
StringValue from dynamic meatadata as-is (previously, it was JSON-encoding a
StructValue into string)
Note: metrics processor still accepts JSON-encoding, so other SDKs should
continue working without changes.
Aperture v0.9.0
Changelog
List of aperture PRs merged since 0.8.0 release. For the full list of changes, see list of changes
flowcontrol: restructure codebase II (#898)
Description of change
Making room for adding more APIs (adapters, previews etc) under
flowcontrol.
Document Prometheus metrics and OLAP Flow events (#878)
Speed up ser/deserialization of CheckResponse in envoy authz (#881)
Now CheckResponse is binary-encoded in protobuf wire format and stored
in DynamicMetadata as base64 string. This speeds up serialization, but
also deserialization (in metrics processor).
No changes in envoyfilter defition were needed as envoy's access logger passes
StringValue from dynamic meatadata as-is (previously, it was JSON-encoding a
StructValue into string)
Note: metrics processor still accepts JSON-encoding, so other SDKs should
continue working without changes.
Results
(Based on looking at pprof data)
createExtAuthzResponse
went from total 18% to total 2.6% (from about 50% of
authz.Check to about 10%).GetStruct
went from total 6% to total 3% (from about 75% of
metricsprocessor.ConsumeLogs to about 40%)- total ~20% improvement
- now agent's overhead is either comparable or slightly higher than istio
proxy's (before, it was noticably higher). (Note: istio proxy might also had
sped up as a result of this change due to not needing to serialize
protobuf.Struct in access logs, although I haven't measured this precisely)
Use envoy authz in java sdk (#816)
buf dependencies were updated resulting in changes in many generated files.
Restructure flowcontrol directories (#884)
Description of change
Restructure directories
Invalid signals telemetry (#876)
Description of change
valid
label onsignal_reading
metric for indicating whether the
reading was valid.- Rename label
attribute_found
on FluxMeter metric tovalid
to be
consistent with Signal metrics. - A new panel in Signals dashboard: "Signal Validity (Frequency)"
panichandler: process panic handlers in the same go routine (#875)
Aperture v0.9.0-rc.3
Changelog
List of aperture PRs merged since 0.8.0 release. For the full list of changes, see list of changes
flowcontrol: restructure codebase II (#898)
Description of change
Making room for adding more APIs (adapters, previews etc) under
flowcontrol.
Document Prometheus metrics and OLAP Flow events (#878)
Speed up ser/deserialization of CheckResponse in envoy authz (#881)
Now CheckResponse is binary-encoded in protobuf wire format and stored
in DynamicMetadata as base64 string. This speeds up serialization, but
also deserialization (in metrics processor).
No changes in envoyfilter defition were needed as envoy's access logger passes
StringValue from dynamic meatadata as-is (previously, it was JSON-encoding a
StructValue into string)
Note: metrics processor still accepts JSON-encoding, so other SDKs should
continue working without changes.
Results
(Based on looking at pprof data)
createExtAuthzResponse
went from total 18% to total 2.6% (from about 50% of
authz.Check to about 10%).GetStruct
went from total 6% to total 3% (from about 75% of
metricsprocessor.ConsumeLogs to about 40%)- total ~20% improvement
- now agent's overhead is either comparable or slightly higher than istio
proxy's (before, it was noticably higher). (Note: istio proxy might also had
sped up as a result of this change due to not needing to serialize
protobuf.Struct in access logs, although I haven't measured this precisely)
Use envoy authz in java sdk (#816)
buf dependencies were updated resulting in changes in many generated files.
Restructure flowcontrol directories (#884)
Description of change
Restructure directories
Invalid signals telemetry (#876)
Description of change
valid
label onsignal_reading
metric for indicating whether the
reading was valid.- Rename label
attribute_found
on FluxMeter metric tovalid
to be
consistent with Signal metrics. - A new panel in Signals dashboard: "Signal Validity (Frequency)"
panichandler: process panic handlers in the same go routine (#875)
Aperture v0.9.0-rc.2
Changelog
List of aperture PRs merged since 0.9.0-rc.1 release. For the full list of changes, see list of changes
Aperture v0.9.0-rc.1
Changelog
List of aperture PRs merged since 0.8.0 release. For the full list of changes, see list of changes
Use envoy authz in java sdk (#816)
buf dependencies were updated resulting in changes in many generated files.
Restructure flowcontrol directories (#884)
Description of change
Restructure directories
Invalid signals telemetry (#876)
Description of change
valid
label onsignal_reading
metric for indicating whether the
reading was valid.- Rename label
attribute_found
on FluxMeter metric tovalid
to be
consistent with Signal metrics. - A new panel in Signals dashboard: "Signal Validity (Frequency)"
panichandler: process panic handlers in the same go routine (#875)
remove unused panic handler
Aperture v0.8.0
Changelog
List of aperture PRs merged since 0.7.0 release. For the full list of changes, see list of changes
Revamp workload and flux meter metrics and labels (#843)
Description of change
- New label
attribute_found
in FluxMeter to denote if the attribute on
which the flux meter is based was found in the access log/span - Removed label
decision_type
on summaryworkload_latency_ms
since
it is now emitted only if response was received. - New counter
workload_requests_total
to measure the workload
decisions count since the summary does not take into account the
scenarios where response is not received e.g. rejects or connection
resets. - A new column
response_received
on OLAP Flow events to denote the
case when response is not received.
Ignore negative workload latency (#839)
Issue
- Workload latency in case of Envoy is calculated as:
workload_latency = response_latency - aperture_latency
- Workload Latency can become negative in case of connection reset
- If the connection is aborted by Client or Server Envoy immediately
terminates the connection for the other endpoint. - In the Access Log, status code is set as 0 and
response_latency
is
set as zero. - If Authz call to Aperture Agent had succeeded for this request, then
aperture_latency is greater than zero.
Fix
- Ignore negative workload latency I.E. don't populate the workload
latency column - Publish Prometheus metrics for flux-meter or workload latency only if
the metric column is found
TickInfo in LoadDecision (#836)
Description of change
- Put
TickInfo
in LoadDecision` to re-trigger fill-rate evaluation at
Agent.
Re-structure protos (#831)
Fix telemetry labels propagation (#835)
Description of change
This fixes regression introduced in
#828.
Dynamic Telemetry Flow Labels were added before labels filtering, which
led them to be incorrectly filtered out.
Fix telemetry labels propagation (#835)
Description of change
This fixes regression introduced in
#828.
Dynamic Telemetry Flow Labels were added before labels filtering, which
led them to be incorrectly filtered out.
Bump OTEL to 0.63.0 (#834)
Description of change
Bumps OTEL and FN OTEL to 0.63.0. This removes Istio 1.15 compat hack as
it is included in the upstream OTEL.
Response status in telemetry (#828)
Description of change
This introduces aperture.response_status
column in telemetry. It
mirrors the implementation of response_status
label for metrics.
This also extends above logic to include 1xx
, 2xx
, and 3xx
codes
as OK instead of only 2xx
codes.
Besides this, some cleanup is done:
- Above logic is moved from
FluxMeter
to OTEL package. This changes
FluxMeter interface! - A log of logic is moved from
metricsprocessor
to
metricsprocessor/internal
for better visibility and easier separation
of functions which are called directly in metricsprocessor and helpers, - The above made creating UT much easier, so this PR also includes
some.
Ref: fluxninja/cloud#6788
Dry run mode for Load Actuator (#826)
Description of change
- Dry run mode for Load Actuator. No traffic can get dropped due to this
Load Actuator in this mode. Useful for observing the behavior of Load
Actuator without any disruptions. - Load Actuator has a new Pass through mode
- Default to Pass through mode in case multiplier is invalid and also
when there is no decision available at the Agent including
initialization
Rollup based on metrics (#821)
Closes: GH-515
docs: playground doc updates (#819)
Description of change
- Moved demo_app to playground
- Added more details to playground documentation
- Bump istio and other tools
Aperture v0.8.0-rc.4
Changelog
List of aperture PRs merged since 0.8.0-rc.3 release. For the full list of changes, see list of changes
Aperture v0.8.0-rc.3
Changelog
List of aperture PRs merged since 0.8.0-rc.2 release. For the full list of changes, see list of changes
Revamp workload and flux meter metrics and labels (#843)
Description of change
- New label
attribute_found
in FluxMeter to denote if the attribute on
which the flux meter is based was found in the access log/span - Removed label
decision_type
on summaryworkload_latency_ms
since
it is now emitted only if response was received. - New counter
workload_requests_total
to measure the workload
decisions count since the summary does not take into account the
scenarios where response is not received e.g. rejects or connection
resets. - A new column
response_received
on OLAP Flow events to denote the
case when response is not received.
Skip NaN auto-tokens (#840)
Ignore negative workload latency (#839)
Issue
- Workload latency in case of Envoy is calculated as:
workload_latency = response_latency - aperture_latency
- Workload Latency can become negative in case of connection reset
- If the connection is aborted by Client or Server Envoy immediately
terminates the connection for the other endpoint. - In the Access Log, status code is set as 0 and
response_latency
is
set as zero. - If Authz call to Aperture Agent had succeeded for this request, then
aperture_latency is greater than zero.
Fix
- Ignore negative workload latency I.E. don't populate the workload
latency column - Publish Prometheus metrics for flux-meter or workload latency only if
the metric column is found
Checklist
- Tested in playground or other setup
TickInfo in LoadDecision (#836)
Description of change
- Put
TickInfo
in LoadDecision` to re-trigger fill-rate evaluation at
Agent.
Re-structure protos (#831)
Fix telemetry labels propagation (#835)
Description of change
This fixes regression introduced in
#828.
Dynamic Telemetry Flow Labels were added before labels filtering, which
led them to be incorrectly filtered out.
Checklist
- Tested in playground or other setup
- Breaking changes
Aperture v0.8.0-rc.2
Changelog
List of aperture PRs merged since 0.8.0-rc.1 release. For the full list of changes, see list of changes
Fix telemetry labels propagation (#835)
Description of change
This fixes regression introduced in
#828.
Dynamic Telemetry Flow Labels were added before labels filtering, which
led them to be incorrectly filtered out.
Checklist
- Tested in playground or other setup
- Breaking changes