Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPP-3352: Add first Glean metrics to measure email mask usage #4452

Merged
merged 24 commits into from
Mar 8, 2024

Conversation

jwhitlock
Copy link
Member

Add Glean metrics for measuring P0 email mask usage. This includes email mask creation, label updates, and deletion, as well as receiving emails (forwarded or blocked) through those masks.

This is tracked as epic MPP-3352. The related bug for data review is bug 1882565.

Some parts of this PR:

  • Add the requirements glean_parser, to generate Python code from Glean YAML, and django-ipware, to extract the user's IP address from headers
  • Add a Glean metrics YAML file at telemetry/glean/relay-server-metrics.yaml. In the future, this folder can hold front-end or other application metrics, and specification for other telemetry methods.
  • Add the generated Python code at privaterelay/glean/server_events.py
  • Add a CircleCI job to confirm that the generated Python code has not been modified, as well as code to regenerate the file after a requirement or metrics update.
  • Add privaterelay/glean_interface.py to simplify data extraction and opt-out checking, and to publish events as log entries formatted with MozLog. Add tests for the emitted logs.
  • Add Profile properties to determine the Relay user's subscription from current data. In the future, we may query Mozilla Accounts for more detailed information on the user's plan term (monthly or yearly).
  • Add glean event hooks to the API and email processing. Add tests for the emitted logs.

Glean Data

This adds five Glean events that have been identified as critical for creating a user engagement model for the email mask feature:

  • email_mask.created - An email mask is created
  • email_mask.label_updated - An email mask's label is updated
  • email_mask.deleted - An email mask is deleted
  • email.forwarded - An email to or from an email mask is forwarded
  • email.blocked - An email to or from an email mask is not forwarded

Some data is sent with each event type, identified with a * below. Others are specific to event types.

Event Measurement Name Measurement Description
* #client_id Firefox client ID
* #fxa_id Mozilla accounts user ID
* #platform Relay client platform
* #n_random_masks Number of random masks
* #n_domain_masks Number of premium subdomain masks
* #n_deleted_random_masks Number of deleted random masks
* #n_deleted_domain_masks Number of deleted domain masks
* #date_joined_relay Timestamp for joining Relay, seconds since epoch
* #premium_status Subscription type and term
* #date_joined_premium Timestamp for starting premium_status subscription, seconds since epoch, -1 if not subscribed
* #has_extension The user has the Relay Add-on
* #date_got_extension Timestamp for adding Relay Add-on, seconds since epoch, -1 if not used
* #mask_id Mask ID, 'R' or 'D' followed by a number
* #is_random_mask The mask is a random mask, instead of a domain mask
email_mask.created #created_by_api The mask was created via the API, rather than an incoming email
email_mask.created #has_website The mask was created by the Add-on or integration on a website
email.* #is_reply The email is a reply from the Relay user
email.blocked #reason Code describing why the email was blocked
email.blocked #can_retry The email processes can retry the email, so it may fail multiple times

How to test:

  • Install the requirements
  • Run the script to regenerate the Python interface:
    bash .circleci/python_job.bash run build_glean
    • There are no errors. git does not show any changed files
  • Run the script to check the Python generated code:
    bash .circleci/python_job.bash run check_glean
    • There are no errors. Final output is ✓ Files are identical
  • Run the server. Log in as a user with Mozilla accounts telemetry turned on (the default). Create a mask.
    • Among others, there should be a glean-server-event log, followed by a request.summary log, entry like:
    {"Timestamp": 1709148528194074112, "Type": "glean-server-event", "Logger": "fx-private-relay", "Hostname": "jwhitlock-MB-M2.local", "EnvVersion": "2.0", "Severity": 6, "Pid": 87746, "Fields": {"document_namespace": "relay-backend", "document_type": "events", "document_version": "1", "document_id": "695328d4-24fb-42b3-bd4d-853e92efbc0d", "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:125.0) Gecko/20100101 Firefox/125.0", "ip_address": "", "payload": "{\"metrics\": {}, \"events\": [{\"category\": \"email_mask\", \"name\": \"created\", \"extra\": {\"client_id\": \"\", \"fxa_id\": \"fac5c63eb1c04dd4b9b0192f1a5483cb\", \"platform\": \"\", \"n_random_masks\": \"4\", \"n_domain_masks\": \"0\", \"n_deleted_random_masks\": \"1\", \"n_deleted_domain_masks\": \"0\", \"date_joined_relay\": \"1709058088\", \"premium_status\": \"email_unknown\", \"date_joined_premium\": \"-1\", \"has_extension\": \"false\", \"date_got_extension\": \"-1\", \"mask_id\": \"R7\", \"is_random_mask\": \"true\", \"created_by_api\": \"true\", \"has_website\": \"false\"}, \"timestamp\": 1709148528194}], \"ping_info\": {\"seq\": 0, \"start_time\": \"2024-02-28T19:28:48.194018+00:00\", \"end_time\": \"2024-02-28T19:28:48.194018+00:00\"}, \"client_info\": {\"telemetry_sdk_build\": \"glean_parser v12.0.1\", \"first_run_date\": \"Unknown\", \"os\": \"Unknown\", \"os_version\": \"Unknown\", \"architecture\": \"Unknown\", \"app_build\": \"Unknown\", \"app_display_version\": \"unknown\", \"app_channel\": \"local\"}}", "msg": "glean-server-event"}}
    {"Timestamp": 1709148528194563072, "Type": "request.summary", "Logger": "fx-private-relay", "Hostname": "jwhitlock-MB-M2.local", "EnvVersion": "2.0", "Severity": 6, "Pid": 87746, "Fields": {"errno": 0, "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:125.0) Gecko/20100101 Firefox/125.0", "lang": "en-US,en;q=0.5", "method": "POST", "path": "/api/v1/relayaddresses/", "uid": 3, "rid": "677474f3-172b-4e06-81f0-60ef2a558d8d", "t": 61}}
  • Change the label of the mask
    • Among others, there should be a glean-server-event log, followed by a request.summary log, like:
    {"Timestamp": 1709148608654010880, "Type": "glean-server-event", "Logger": "fx-private-relay", "Hostname": "jwhitlock-MB-M2.local", "EnvVersion": "2.0", "Severity": 6, "Pid": 87746, "Fields": {"document_namespace": "relay-backend", "document_type": "events", "document_version": "1", "document_id": "374e0998-ec59-46e9-a5f3-f26f7bdb3ece", "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:125.0) Gecko/20100101 Firefox/125.0", "ip_address": "", "payload": "{\"metrics\": {}, \"events\": [{\"category\": \"email_mask\", \"name\": \"label_updated\", \"extra\": {\"client_id\": \"\", \"fxa_id\": \"fac5c63eb1c04dd4b9b0192f1a5483cb\", \"platform\": \"\", \"n_random_masks\": \"4\", \"n_domain_masks\": \"0\", \"n_deleted_random_masks\": \"1\", \"n_deleted_domain_masks\": \"0\", \"date_joined_relay\": \"1709058088\", \"premium_status\": \"email_unknown\", \"date_joined_premium\": \"-1\", \"has_extension\": \"false\", \"date_got_extension\": \"-1\", \"mask_id\": \"R7\", \"is_random_mask\": \"true\"}, \"timestamp\": 1709148608653}], \"ping_info\": {\"seq\": 0, \"start_time\": \"2024-02-28T19:30:08.653926+00:00\", \"end_time\": \"2024-02-28T19:30:08.653926+00:00\"}, \"client_info\": {\"telemetry_sdk_build\": \"glean_parser v12.0.1\", \"first_run_date\": \"Unknown\", \"os\": \"Unknown\", \"os_version\": \"Unknown\", \"architecture\": \"Unknown\", \"app_build\": \"Unknown\", \"app_display_version\": \"unknown\", \"app_channel\": \"local\"}}", "msg": "glean-server-event"}}
    {"Timestamp": 1709148608654702080, "Type": "request.summary", "Logger": "fx-private-relay", "Hostname": "jwhitlock-MB-M2.local", "EnvVersion": "2.0", "Severity": 6, "Pid": 87746, "Fields": {"errno": 0, "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:125.0) Gecko/20100101 Firefox/125.0", "lang": "en-US,en;q=0.5", "method": "PATCH", "path": "/api/v1/relayaddresses/7/", "uid": 3, "rid": "d041f4b3-376d-40a0-8188-96073ab894aa", "t": 50}}
  • In the profile menu, select "Manage your account" to go to Mozilla accounts. Under "Data Collection and Use", "Help improve ⁨Mozilla accounts⁩", turn from "On" to "Off". Return to Relay, sign out and sign in again. Update the label for a mask.
    • There is no glean-server-event log, just the request.summary log, like:
    {"Timestamp": 1709148843867493888, "Type": "request.summary", "Logger": "fx-private-relay", "Hostname": "jwhitlock-MB-M2.local", "EnvVersion": "2.0", "Severity": 6, "Pid": 87746, "Fields": {"errno": 0, "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:125.0) Gecko/20100101 Firefox/125.0", "lang": "en-US,en;q=0.5", "method": "GET", "path": "/api/v1/relayaddresses/", "uid": 3, "rid": "0abf6db1-7bd6-476f-a688-f73c9f38297e", "t": 5}}

Copy link
Member

@groovecoder groovecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked thru all the code; still need to test and spot-check. Biggest blocker is that we should NOT send a glean event when the email is blocked because of an error. We should only send a glean even when the email is blocked because of a user action.

api/tests/views_tests.py Outdated Show resolved Hide resolved
api/views/__init__.py Outdated Show resolved Hide resolved
Comment on lines +538 to +547
@property
def plan(self) -> Literal["free", "email", "phone", "bundle"]:
"""The user's Relay plan as a string."""
if self.has_premium:
if self.has_phone:
return "bundle" if self.has_vpn else "phone"
else:
return "email"
else:
return "free"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): should we consider using @cached_property for this? I seem to remember that has_premium or has_phone can be quite slow when they access self.fxa?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is doing very little that would justify caching. If has_premium and has_phone are slow, should they be the cached properties? Or fxa?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I think caching those caused some other stale data issues last time we tried it. I thought maybe caching at this level wouldn't cause those issues?

emails/models.py Show resolved Hide resolved
emails/models.py Show resolved Hide resolved
privaterelay/glean_interface.py Show resolved Hide resolved
data_reviews:
- https://bugzilla.mozilla.org/show_bug.cgi?id=1882565
notification_emails:
- [email protected]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): [email protected] ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Glean documents have no guidance, so I reached out to internal Slack channel. According to @perrymcmanis144, we have to have a single person for metrics that do not expire. We could add a mailing list, but not replace the single person email.

However, [email protected] doesn't look like a good choice to me. There are many people on that list, including people no longer on the project. Once I have some experience with the kinds of emails sent to this address, we could add a maintained list with the people who should get notices and can do something about them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the drive-by, I just wanted to add some additional context to what Perry shared with you.

The Glean docs lack guidance here because this is a Data Stewarding concern (see the wiki on Data Collection for more info). That being said, I'll file a bug in Glean to try and add some breadcrumbs to this information to make it easier to understand the expectations here.

Permanent data-collection policy from Trust & Legal currently requires an IC to be responsible for the collection, in order to audit it from time to time to determine that it is still useful and needed. Temporary collections that expire don't run the same risk of indefinitely collecting some data that might not be useful or problematic.

As a side-note, many teams/project create a specific group email to serve as a backup for an IC bearing this responsibility (should the IC leave or not be available, etc.).

telemetry/glean/relay-server-metrics.yaml Show resolved Hide resolved
telemetry/glean/relay-server-metrics.yaml Outdated Show resolved Hide resolved
"""Check and return parts of the Glean payload that vary over test runs."""
event_ts_ms = payload["events"][0]["timestamp"]
event_time = datetime.fromtimestamp(event_ts_ms / 1000.0)
assert 0 < (datetime.now() - event_time).total_seconds() < 0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): why do we assert that the event time must have been less than 0.5 seconds ago?

Copy link
Member Author

@jwhitlock jwhitlock Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timestamp should be in milliseconds. I'm trying to find a way to say "if we calculate the timestamp wrong, then it will be off by a factor of 1,000". This only tests the x1000 bigger, however, not the x1000 smaller. Maybe it is not worth it.

@jwhitlock jwhitlock force-pushed the add-glean-mpp-3352 branch from bf15260 to f84a5ae Compare March 5, 2024 20:42
@jwhitlock
Copy link
Member Author

jwhitlock commented Mar 6, 2024

@groovecoder, this is ready for re-review. Changes since the last review:

  • The data review in bug 1882565 was approved. I'll return to that with the changes made since the review opened.
  • I added the user to create_expected_glean_event, to centralize the boilerplate for checking user-derived fields
  • I created an AddressViewSet base class to store the generalized perform_create, perform_update, and perform_destroy methods. I could not figure out a way to do this as a mixin that made mypy happy.
  • Add comments for the leading 'R' and 'D' for mask_id metrics.
  • Fixed a typo usersuser
  • Use a new function log_email_dropped to log details, with message email_dropped, for emails dropped for reasons other than mask blocking settings. Tests are updated to check for the new logs.
  • Glean email_blocked events are now only for emails dropped due to mask blocking settings. Dropped can_retry from the event extra fields, since these can not be re-tried like some errors.
  • Add some comments around turning Glean event time fields into Python datetimes.
  • Add # pragma: no cover for some code that only runs when tests fail
  • Fix a test that was exiting early instead of running all assertions.

I've opened two issues for follow-on work:

  • MPP-3764 - Store subscription details for users
  • MPP-3765 - Distinguish between Firefox Integration calls and other API calls

@jwhitlock jwhitlock requested a review from groovecoder March 6, 2024 21:45
assert drop_log is None, "duplicate email_dropped log entry"
drop_log = record
assert drop_log is not None, "email_dropped log entry not found."
assert drop_log.levelno == logging.INFO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): This looks like emails are logged as "dropped" when there's an error, so should they be logged at logging.ERROR level instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only use ERROR logging if I want SREs to monitor and to get tracking Sentry issues. I didn't want that for these, but do you?

@@ -724,6 +810,16 @@ def successful_reply_test_implementation(
assert (last_en := self.relay_address.user.profile.last_engagement) is not None
assert last_en > self.pre_reply_last_engagement

def assert_my_log_email_dropped(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quibble:

Suggested change
def assert_my_log_email_dropped(
def assert_reply_log_email_dropped(

Copy link
Member Author

@jwhitlock jwhitlock Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, thanks! I went with assert_log_reply_email_dropped

event_time=timestamp,
)

def assert_my_log_email_dropped(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quibble:

Suggested change
def assert_my_log_email_dropped(
def assert_bounce_log_email_dropped(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's more than bounce tests, so I went with assert_log_incoming_email_dropped

},
)
request_data = RequestData.from_request(request)
assert request_data.user_agent == "Mozilla/5.0 Firefox/125.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quibble (non-blocking): test name says this test is for ip - do we need to assert on the user agent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test for just the user agent, and dropped the user agent from the IP tests.

jwhitlock added 23 commits March 8, 2024 10:59
The initial metrics for measuring mask usage are in
telemetry/glean/relay-server-metrics.yaml. This new folder can be used
for front-end Glean metrics and other metrics specifications.

glean_parser is used to generate privaterelay/glean/server_events.py and
the empty __init__.py.  The generated file is post-processed with black.
A function in .circleci/python_job.bash has the generation commands and
the post-processing commands. This is used by a new job to validate in
CircleCI that the generated file was not hand-modified. It can also be
used to regenerate the file locally (tested on Linux and macOS).
These should eventually be stored on the profile, but for now these
properties provide the known subscription status for the user.
RelayGleanLogger extends the EventsServerEventLogger to emit records as
logs instead of print() statements. It takes data as User and *Address
models, and extracts the data needed for the .record_* methods.
Pass the related user to the test helper create_expected_glean_event, so
that the user-specific values such as fxa_id and date_joined_relay can
be extracted in the helper rather than each test function.
AddressViewSet is a parent class for RelayAddressViewSet and
DomainAddressViewSet, to avoid duplicating glean event logging.
Because RelayAddress and DomainAddress are in different tables, a row ID
might be used between them. Use a leading letter "R" or "D" to
distinguish between the two.
Send Glean events for blocked emails due to mask setting, such as block
all or block promotional. Emit non-Glean logs for other reasons emails
are blocked.
Coverage showed the (incorrect) metrics tests was never executed.
@jwhitlock jwhitlock force-pushed the add-glean-mpp-3352 branch from 1a9d4e0 to 37e8b6c Compare March 8, 2024 17:03
@jwhitlock jwhitlock added this pull request to the merge queue Mar 8, 2024
Merged via the queue into main with commit a942ac3 Mar 8, 2024
28 checks passed
@jwhitlock jwhitlock deleted the add-glean-mpp-3352 branch March 8, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants