Skip to content

Commit

Permalink
Merge branch 'master' into martinhaintz/redact-screenshots-via-view-h…
Browse files Browse the repository at this point in the history
…ierarchy

* master: (28 commits)
  feat(angular): Update SDK provider setup for Angular 19 (#11921)
  feat(dynamic-sampling): adapt docs to new dynamic sampling logic (#11886)
  update banner for post-launch week promotion (#11964)
  chore(android): Add masking options to AndroidManifest (#11863)
  Bump API schema to 2126f7dd (#11965)
  chore(Profiling): Add callouts and links to Android Profiling troubleshooting info (#11905)
  docs(flutter): Use sentry flutter init in samples (#11858)
  use native crypto to generate uuid (#11959)
  fix vercel integration 404 (#11958)
  Add RN Replay Privacy page (#11798)
  feat(dashboards): Add docs for Dashboard Edit Access Selector (#11822)
  feat(app-starts): Add RN SDK min version (#11650)
  feat(realy): Add Relay best practices guide (#11914)
  docs(sdks): New Scope APIs (#11943)
  docs(sdks): Span Sampling (#11940)
  Add include explaining sample code options (#11866)
  devenv: internal troubleshooting (#11947)
  Bump API schema to 0b18bfae (#11946)
  Bump API schema to 2bee5317 (#11945)
  feat: Link to Replay Issues when we mention Perf Issues as well (#11933)
  ...
  • Loading branch information
mar-hai committed Nov 27, 2024
2 parents c301cee + a1313a6 commit a6383e4
Show file tree
Hide file tree
Showing 105 changed files with 1,178 additions and 582 deletions.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -6,29 +6,38 @@ sidebar_order: 1

![Sequencing](./images/sequencing.png)


<Alert title="💡 Note" level="info">

Dynamic Sampling currently operates on either spans or transactions to measure data throughput. This is controlled by the feature flag `organizations:dynamic-sampling-spans` and usually set to what the organization's subscription is metered by. In development, this currently defaults to transactions.
The logic between the two data categories is identical, so most of this documentation is kept at a generic level and important differences are pointed out explicitly.

</Alert>


## Sequencing

Dynamic Sampling occurs at the edge of our ingestion pipeline, precisely in [Relay](https://github.com/getsentry/relay).

When transaction events arrive, in a simplified model, they go through the following steps (some of which won't apply if you self-host Sentry):
When events arrive, in a simplified model, they go through the following steps:

1. **Inbound data filters**: every transaction runs through inbound data filters as configured in project settings, such as legacy browsers or denied releases. Transactions dropped here do not count for quota and are not included in total transactions” data.
2. **Quota enforcement**: Sentry charges for all further transactions sent in, before events are passed on to dynamic sampling.
3. **Metrics extraction**: after passing quotas, Sentry extracts metrics from the total incoming transactions. These metrics provide granular numbers for the performance and frequency of every application transaction.
4. **Dynamic Sampling**: based on an internal set of rules, Relay determines a sample rate for every incoming transaction event. A random number generator finally decides whether this payload should be kept or dropped.
5. **Rate limiting**: transactions that are sampled by Dynamic Sampling will be stored and indexed. To protect the infrastructure, internal rate limits apply at this point. Under normal operation, this **rate limit is never reached** since dynamic sampling already reduces the volume of stored events.
1. **Inbound data filters**: every event runs through inbound data filters as configured in project settings, such as legacy browsers or denied releases. Events dropped here are not counted towards quota and are not included in "total events" data.
2. **Quota enforcement**: Sentry charges for all further events sent in, before they are passed on to dynamic sampling.
3. **Metrics extraction**: after passing quotas, Sentry extracts metrics from the total incoming events. These metrics provide granular numbers for the performance and frequency of every event.
4. **Dynamic Sampling**: based on an internal set of rules, Relay determines a sample rate for every incoming event. A random number generator finally decides whether a payload should be kept or dropped.
5. **Rate limiting**: events that are sampled by Dynamic Sampling will be stored and indexed. To protect the infrastructure, internal rate limits apply at this point. Under normal operation, this **rate limit is never reached** since dynamic sampling already reduces the volume of events stored.

<Alert title="💡 Example" level="info">

A client is sending 1000 transactions per second to Sentry:
1. 100 transactions per second are from old browsers and get dropped through an inbound data filter.
2. The remaining 900 transactions per second show up as total transactions in Sentry.
3. Their current overall sample rate is at 20%, which statistically samples 180 transactions per second.
4. Since this is above the 100/s limit, about 80 transactions per second are randomly dropped, and the rest is stored.
A client is sending 1000 events per second to Sentry:
1. 100 events per second are from old browsers and get dropped through an inbound data filter.
2. The remaining 900 events per second show up as total events in Sentry.
3. Their current overall sample rate is at 20%, which statistically samples 180 events per second.
4. Since this is above the 100/s limit, about 80 events per second are randomly dropped, and the rest is stored.

</Alert>

## Rate Limiting and Total Transactions
## Rate Limiting and Total Events

The ingestion pipeline has two kinds of rate limits that behave differently compared to organizations without dynamic sampling:

Expand All @@ -37,49 +46,49 @@ The ingestion pipeline has two kinds of rate limits that behave differently com

<Alert title="✨️ Note" level="info">

There is a dedicated rate limit for stored transactions after inbound filters and dynamic sampling. However, it does not affect total transactions since the fidelity decreases with higher total transaction volumes and this rate limit is not expected to trigger since Dynamic Sampling already reduces the stored transaction throughput.
There is a dedicated rate limit for stored events after inbound filters and dynamic sampling. However, it does not affect total events since the fidelity decreases with higher total event volumes and this rate limit is not expected to trigger since Dynamic Sampling already reduces the stored event throughput.

</Alert>

## Rate Limiting and Trace Completeness

Dynamic sampling ensures complete traces by retaining all transactions associated with a trace if the head transaction is preserved.
Dynamic sampling ensures complete traces by retaining all events associated with a trace if the head event is preserved.

Despite dynamic sampling providing trace completeness, transactions or other items (errors, replays, ...) may still be missing from a trace when rate limiting drops one or more transactions. Rate limiting drops items without regard for the trace, making each decision independently and potentially resulting in broken traces.
Despite dynamic sampling providing trace completeness, events or other items (errors, replays, ...) may still be missing from a trace when rate limiting drops one or more of them. Rate limiting drops items without regard for the trace, making each decision independently and potentially resulting in broken traces.

<Alert title="💡 Example" level="info">

For example, if there is a trace from `Project A` to `Project B` and `Project B` is subject to rate limiting or quota enforcement, transactions of `Project B` from the trace initiated by `Project A` are lost.
For example, if there is a trace from `Project A` to `Project B` and `Project B` is subject to rate limiting or quota enforcement, events of `Project B` from the trace initiated by `Project A` are lost.

</Alert>

## Client Side Sampling and Dynamic Sampling

Clients have their own [traces sample rate](https://docs.sentry.io/platforms/javascript/performance/#configure-the-sample-rate). The client sample rate is a number in the range `[0.0, 1.0]` (from 0% to 100%) that controls **how many transactions arrive at Sentry**. While documentation will generally suggest a sample rate of `1.0`, for some use cases it might be better to reduce it.
Clients have their own [traces sample rate](https://docs.sentry.io/platforms/javascript/tracing/#configure). The client sample rate is a number in the range `[0.0, 1.0]` (from 0% to 100%) that controls **how many events arrive at Sentry**. While documentation will generally suggest a sample rate of `1.0`, for some use cases it might be better to reduce it.

Dynamic Sampling further reduces how many transactions get stored internally. **While many-to-most graphs and numbers in Sentry are based on total transactions**, accessing spans and tags requires stored transactions. The sample rates apply on top of each other.
Dynamic Sampling further reduces how many events get stored internally. **While most graphs and numbers in Sentry are based on metrics**, accessing spans and tags requires stored events. The sample rates apply on top of each other.

An example of client side sampling and Dynamic Sampling starting from 100k transactions which results in 15k stored transactions is shown below:
An example of client side sampling and Dynamic Sampling starting from 100k events which results in 15k stored events is shown below:

![Client and Dynamic Sampling](./images/clientAndDynamicSampling.png)

## Total Transactions

To collect unsampled information for “total” transactions in Performance, Alerts, and Dashboards, Relay extracts [metrics](https://getsentry.github.io/relay/relay_metrics/index.html) from transactions. In short, these metrics comprise:
To collect unsampled information for “total” transactions in Performance, Alerts, and Dashboards, Relay extracts [metrics](https://getsentry.github.io/relay/relay_metrics/index.html) from spans and transactions. In short, these metrics comprise:

- Counts and durations for all transactions.
- Counts and durations for all events.
- A distribution (histogram) for all measurements, most notably the web vitals.
- The number of unique users (set).

Each of these metrics can be filtered and grouped by a number of predefined tags, [implemented in Relay](https://github.com/getsentry/relay/blob/master/relay-server/src/metrics_extraction/transactions/types.rs#L142-L157).

For more granular queries, **stored transaction events are needed**. _The purpose of dynamic sampling here is to ensure that enough representatives are always available._
For more granular queries, **stored events are needed**. _The purpose of dynamic sampling here is to ensure that there are always sufficient representative sample events._

<Alert title="💡 Example" level="info">

If Sentry applies a 1% dynamic sample rate, you can still receive accurate TPM (transactions per minute) and web vital quantiles through total transaction data backed by metrics. There is also a listing of each of these numbers by the transaction.
If Sentry applies a 1% dynamic sample rate, you can still receive accurate events per minute (SPM or TPM, depending on event type) and web vital quantiles through total event data backed by metrics. There is also a listing of each of these numbers by the transaction.

When you go into transaction summary or Discover, you might want to now split the data by a custom tag you’ve added to your transactions. This granularity is not offered by metrics, so **these queries need to use stored transactions**.
When you go into the trace explorer or Discover, you might want to now split the data by a custom tag you’ve added to your events. This granularity is not offered by metrics, so **these queries need to use stored events**.

</Alert>

Expand Down
6 changes: 4 additions & 2 deletions develop-docs/development-infrastructure/environment/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,9 @@ SENTRY_SILO_DEVSERVER=1 SENTRY_SILO_MODE=REGION SENTRY_REGION=us getsentry djang

## Troubleshooting

You might also be interested in <Link to="/development/continuous-integration/#troubleshooting-ci">troubleshooting CI</Link>.
The more up-to-date troubleshooting docs for the internal development environment on MacOS are <Link to="https://www.notion.so/sentry/devenv-troubleshooting-1448b10e4b5d8080ba04f452e33de48d">here</Link>.

You might also be interested in <Link to="/development/continuous-integration/#troubleshooting-ci">Troubleshooting CI</Link>.

---

Expand All @@ -210,7 +212,7 @@ following in getsentry:

**Problem:** You see an error that mentions something like `pkg_resources.DistributionNotFound: The 'some_dependency<0.6.0,>=0.5.5' distribution was not found and is required by sentry`

**Solution:** Your virtualenv needs to be updated. Run `make install-py-dev`.
**Solution:** Your virtualenv needs to be updated. Run `devenv sync`.

---

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ To add or manually update a dependency:
5. In that repo, add to or update `requirements-base.txt` or `requirements-dev.txt`, as appropriate. Note that many of our dependencies are pinned with lower bounds only, to encourage updating to latest versions, though we do use exact pins for certain core dependencies like `django`. Choose whichever one feels most appropriate in your case.
6. Run `make freeze-requirements`. You might need to wait a few minutes for the changes to `getsentry/pypi` to be deployed before this will work without erroring.
7. Commit your changes (which should consist of changes to both one of the `requirements` files and its corresponding lockfile) to a branch and open a PR in the relevant repo. If it's not obvious, explain why you're adding or updating the dependency. Tag `owners-python-build` if they haven't already been auto-tagged.
8. Merge your PR, pull `master`, and run `make install-py-dev`.
8. Merge your PR, pull `master`, and run `devenv sync`.

To update a dependency using GitHub Actions:

Expand Down
124 changes: 124 additions & 0 deletions develop-docs/ingestion/relay-best-practices.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
---
title: Relay Best Practices
---

Relay is a critical component in Sentry's infrastructure handling hundreds of thousands of requests per second.
To make sure your changes to Relay go smoothly, there are some best practices to follow.

For general Rust development best practices, make sure you read the [Rust Development](/engineering-practices/rust/)
document.


## Forward Compatibility

Make sure changes to the event protocol and APIs are forward-compatible. Relay should not drop
or truncate data it does not understand. It is a supported use-case to have customers running
outdated Relays but up-to-date SDKs.


## Feature Gate new Functionality


Consider making your new feature conditional. Not only does this allow a gradual roll-out, but can also
function as a kill-switch when something goes wrong.


### Sentry Features

Sentry features are best used for new product features. They can be gradually rolled out to a subset
of organizations and projects.


### Global Config

Kill-switches, sample- and rollout-rates can also be configured via "global config". Global config
is a simple mechanism to propagate a dynamic configuration from Sentry to Relay.

Global config works very well for technical rollouts and kill-switches.


### Relay Config

The relay configuration file can also be used to feature gate functionality. This is best used
for configuration which is environment specific or fundamental to how Relay operates and should not
be used for product features.


## Regular expressions

Regular expressions are a powerful tool, they are well known and also extremely performant.
But they aren't always the right choice for Relay.

Some of the listed issues are specific to the [regex](https://docs.rs/regex/latest/regex/) Rust crate Relay uses
and others apply to regular expressions in general.

<Alert level="info" title="Important">
Review the [Rust crate's documentation](https://docs.rs/regex/latest/regex/#overview) carefully before introducing a Regex.
</Alert>


### Compiling

Compiling a Regex is quite costly.

- Do not compile a regular expression for every element that is processed.
- Always cache regular expressions.

Static regular expressions can be cached using [`std::sync::LazyLock`](https://doc.rust-lang.org/beta/std/sync/struct.LazyLock.html).


### User defined Regular Expressions

Avoid exposing regular expressions to users. Instead, consider using a targeted domain specific language or
glob like patterns for which Relay has good support.

<Alert level="warning" title="Important">
User-defined regexes are prone to leading to unexpected CPU or memory usage.
Review the [Untrusted Input](https://docs.rs/regex/latest/regex/#untrusted-input) section in the regex crate documentation
carefully.
</Alert>


### Avoid Unicode

When not strictly necessary, specify match groups without Unicode support. For example, `\w` matches around 140000
distinct code points, when you only need to match ASCII consider using explicit groups `[0-9A-Za-z_]` instead or disable Unicode `(?-u:\w)`
for the group.


## (De-)Serialization

(De-)Serialization of JSON, protobuf, message-pack uses a considerable amount of CPU and memory. Be mindful when and where
you (de-)serialize data.

Relay's architecture requires all CPU heavy operations to be done on a dedicated thread pool, which is used in the so called
[processor service](https://github.com/getsentry/relay/blob/master/relay-server/src/services/processor.rs).


## Data Structures

All data structures come with a cost, consider using the correct data structure for your workload.

- A `BTreeMap` may be more performant than a `HashMap` for small amounts of data.
- Use `ahash` (through `hashbrown::HashMap`) as hasher for a `HashMap` to reduce the hashing overhead for high throughput
`HashMap`'s.
- Use `smallvec` over `Vec` when your data is likely to be small.
- Avoid iterating through large collections in regular intervals, consider using a [priority queue](https://en.wikipedia.org/wiki/Priority_queue) or
[heap](https://doc.rust-lang.org/std/collections/struct.BinaryHeap.html) instead.


## CPU and Memory constraints

Relay operates on untrusted user input, make sure there are enough precautions taken to avoid a denial of service
attack.

- Apply size limits to payloads.
- Limit memory consumption in Relay. For example, also apply size limits to decompression, not just the incoming compressed data.
- Avoid exponential algorithms, often there are better data structures and algorithms available. If this is not possible, set a limit.


## Don't panic

Relay needs to be able to handle any user input without a [panic](https://doc.rust-lang.org/std/macro.panic.html).
Always return useful errors when encountering unexpected or malformed inputs. This improves debuggability,
guarantees stability and generates the necessary outcomes.
Loading

0 comments on commit a6383e4

Please sign in to comment.