diff --git a/rfcs/2023-05-03-data-volume-metrics.md b/rfcs/2023-05-03-data-volume-metrics.md new file mode 100644 index 0000000000000..368fe874cd420 --- /dev/null +++ b/rfcs/2023-05-03-data-volume-metrics.md @@ -0,0 +1,240 @@ +# RFC 2023-05-02 - Data Volume Insights metrics + +Vector needs to be able to emit accurate metrics that can be usefully queried +to give users insights into the volume of data moving through the system. + +## Scope + +### In scope + +- All volume event metrics within Vector need to emit the estimated JSON size of the + event. With a consistent method for determining the size it will be easier to accurately + compare data in vs data out. + - `component_received_event_bytes_total` + - `component_sent_event_bytes_total` + - `component_received_event_total` + - `component_sent_event_total` +- The metrics sent by each sink needs to be tagged with the source id of the + event so the route an event takes through Vector can be queried. +- Each event needs to be labelled with a `service`. This is a new concept + within Vector and represents the application that generated the log, + metric or trace. +- The service tag and source tag in the metrics needs to be opt in so customers + that don't need the increased cardinality are unaffected. + +### Out of scope + +- Separate metrics, `component_sent_bytes_total` and `component_received_bytes_total` + that indicate network bytes sent by Vector are not considered here. + +## Pain + +Currently it is difficult to accurately gauge the volume of data that is moving +through Vector. It is difficult to query where data being sent out has come +from. + +## Proposal + +### User Experience + +Global config options will be provided to indicate that the `service` tag and the +`source` tag should be sent. For example: + +```yaml +telemetry: + tags: + service: true + source_id: true +``` + +This will cause Vector to emit a metric like (note the last two tags): + +```prometheus +vector_component_sent_event_bytes_total{component_id="out",component_kind="sink",component_name="out",component_type="console" + ,host="machine",service="potato",source_id="stdin"} 123 +``` + +The default will be to not emit these tags. + +### Implementation + +#### Metric tags + +**service** - to attach the service, we need to add a new meaning to Vector - + `service`. Any sources that receive data that could potentially + be considered a service will need to indicate which field means + `service`. This work has largely already been done with the + LogNamespacing work, so it will be trivial to add this new field. + Not all sources will be able to specify a specific field to + indicate the `service`. In time it will be possible for this to + be accomplished through `VRL`. + +**source_id** - A new field will be added to the [Event metadata][event_metadata] - + `Arc` that will indicate the source of the event. + `OutputId` will need to be serializable so it can be stored in + the disk buffer. Since this field is just an identifier, it can + still be used even if the source no longer exists when the event + is consumed by a sink. + +We will need to do an audit of all components to ensure the +bytes emitted for the `component_received_event_bytes_total` and +`component_sent_event_bytes_total` metrics are the estimated JSON size of the +event. + +These tags will be given the name that was configured in [User Experience] +(#user-experience). + +Transforms `reduce` and `aggregate` combine multiple events together. In this +case the `source` and `service` of the first event will be taken. + +If there is no `source` a source of `-` will be emitted. The only way this can +happen is if the event was created by the `lua` transform. + +If there is no `service` available, a service of `-` will be emitted. + +Emitting a `-` rather than not emitting anything at all makes it clear that +there was no value rather than it just having been forgotten and ensures it +is clear that the metric represents no `service` or `source` rather than the +aggregate value across all services. + +The [Component Spec][component_spec] will need updating to indicate these tags +will need including. + +**Performance** - There is going to be a performance hit when emitting these metrics. +Currently for each batch a simple event is emitted containing the count and size +of the entire batch. With this change it will be necessary to scan the entire +batch to obtain the count of source, service combinations of events before emitting +the counts. This will involve additional allocations to maintain the counts as well +as the O(1) scan. + +#### `component_received_event_bytes_total` + +This metric is emitted by the framework [here][source_sender], so it looks like +the only change needed is to add the service tag. + +#### `component_sent_event_bytes_total` + +For stream based sinks this will typically be the byte value returned by +`DriverResponse::events_sent`. + +Despite being in the [Component Spec][component_spec], not all sinks currently +conform to this. + +As an example, from a cursory glance over a couple of sinks: + +The Amqp sink currently emits this value as the length of the binary +data that is sent. By the time the data has reached the code where the +`component_sent_event_bytes_total` event is emitted, that event has been +encoded and the actual estimated JSON size has been lost. The sink will need +to be updated so that when the event is encoded, the encoded event together +with the pre-encoded JSON bytesize will be sent to the service where the event +is emitted. + +The Kafka sink also currently sends the binary size, but it looks like the +estimated JSON bytesize is easily accessible at the point of emitting, so would +not need too much of a change. + +To ensure that the correct metric is sent in a type-safe manner, we will wrap +the estimated JSON size in a newtype: + +```rust +pub struct JsonSize(usize); +``` + +The `EventsSent` metric will only accept this type. + +### Registered metrics + +It is currently not possible to have dynamic tags with preregistered metrics. + +Preregistering these metrics are essential to ensure that they don't expire. + +The current mechanism to expire metrics is to check if a handle to the given +metric is being held. If it isn't, and nothing has updated that metric in +the last cycle - the metric is dropped. If a metric is dropped, the next time +that event is emitted with those tags, the count starts at zero again. + +We will need to introduce a registered event caching layer that will register +and cache new events keyed on the tags that are sent to it. + +Currently a registered metrics is stored in a `Registered`. + +We will need a new struct that can wrap this that will be generic over a tuple of +the tags for each event and the event - eg. `Cached<(String, String), EventSent>`. +This struct will maintain a BTreeMap of tags -> `Registered`. Since this will +need to be shared across threads, the cache will need to be stored in an `RwLock`. + +In pseudo rust: + +```rust +struct Cached { + cache: Arc>>, + register: Fn(Tags) -> Registered, +} + +impl Cached { + fn emit(&mut self, tags: Tags, value: Event) -> { + if Some(event) = self.cache.get(tags) { + event.emit(value); + } else { + let event = self.register(tags); + event.emit(value); + self.cache.insert(tags, event); + } + } +} +``` + +## Rationale + +The ability to visualize data flowing through Vector will allow users to ascertain +the effectiveness of the current use of Vector. This will enable users to +optimise their configurations to make the best use of Vector's features. + +## Drawbacks + +The additional tags being added to the metrics will increase the cardinality of +those metrics if they are enabled. + +## Prior Art + + +## Alternatives + +We could use an alternative metric instead of estimated JSON size. + +- *Network bytes* This provides a more accurate picture of the actual data being received + and sent by Vector, but will regularly produce different sizes for an incoming event + to an outgoing event. +- *In memory size* The size of the event as held in memory. This may be more accurate in + determining the amount of memory Vector will be utilizing at any time, will often be + less accurate compared to the data being sent and received which is often JSON. + +## Outstanding Questions + +## Plan Of Attack + +Incremental steps to execute this change. These will be converted to issues after the RFC is approved: + +- [ ] Add the `source` field to the Event metadata to indicate the source the event has come from. +- [ ] Update the Volume event metrics to take a `JsonSize` value. Use the compiler to ensure all metrics + emitted use this. The `EstimatedJsonEncodedSizeOf` trait will be updated return a `JsonSize`. +- [ ] Add the Service meaning. Update any sources that potentially create a service to point the meaning + to the relevant field. +- [ ] Introduce an event caching layer that caches registered events based on the tags sent to it. +- [ ] Update the emitted events to accept the new tags - taking the `telemetry` configuration options + into account. +- [ ] There is going to be a hit on performance with these changes. Add benchmarking to help us understand + how much the impact will be. + +## Future Improvements + +- Logs emitted by Vector should also be tagged with `source_id` and `service`. +- This rfc proposes storing the source and service as strings. This incurs a cost since scanning each + event to get the counts of events by source and service will involve multiple string comparisons. A + future optimization could be to hash the combination of these values at the source into a single + integer. + +[component_spec]: https://github.com/vectordotdev/vector/blob/master/docs/specs/component.md#componenteventssent +[source_sender]: https://github.com/vectordotdev/vector/blob/master/src/source_sender/mod.rs#L265-L268 +[event_metadata]: https://github.com/vectordotdev/vector/blob/master/lib/vector-core/src/event/metadata.rs#L20-L38