Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPP-3439: Refactor DataIssueTask / CleanerTask, clean users with blank emails #4721

Merged
merged 29 commits into from
Jun 20, 2024

Conversation

jwhitlock
Copy link
Member

Some users in production have user.email == "". This prevents us from forwarding them any emails.

At least one also has an email in their Mozilla account data in the database. The user.email should have been set to their socialaccount.extra_data["email"] during signup, but some other issue must have prevented it from happening, or caused it to be cleared.

This PR adds a cleaner MissingEmailCleaner to the cleaner lists, so that this issue can be fixed in production. The cleaners are run daily as a Cron Job in production and other environments, so these and future incidents will be fixed.

The Refactor

The changes for MPP-3439 are small compared to the refactor of DataIssueTask and the derived CleanerTask. However, none of the tests for the existing email cleaners changed, so the refactor (hopefully!) works the same.

It is probably more enlightening to read the code in emails/cleaners.py and privaterelay/cleaners.py than to try to understand all the code changes, or the explanation that follows. The purpose of all the refactoring is to make these easier to read and write, and require less repetitive code. Feel free to skip ahead to "How to Test".

Previously a cleaner would derive from CleanerTask and define the methods:

  • _get_counts_and_data - Make database COUNT queries, and create a two-level dictionary of counts, a "summary" plus one for each model. Return the counts and querysets for getting the problem rows.
  • _clean - Take the "problem row" querysets, and update the rows or iterate over the rows and fix them one at a time. Return the count of fixed rows.
  • markdown_report - Generate a blob of text representing the data and how many rows were cleaned.

After the refactor, these functions have been generalized and moved up into the parent classes. Because there is quite a lot of shared code, it has been moved to privaterelay/cleaner_task.py, leaving privaterelay/cleaners.py for the new MissingEmailCleaner.

After the refactor, a data cleaner derives from CleanerTask and defines a data_specification with two models:

  • DataModelSpec - This represents a model, such a User or RelayAddress, as well as some details for data cleaning
  • DataBisectSpec - A filter of a model or a further filtering, as well as the opposite of that filter. For example, User.objects.filter(is_active=True) and User.objects.exclude(is_active=True) bisects the User table into two sets of rows. All the current cleaning operations can be represented as a series of filters that bisect the rows until the problem rows, if any, remain.

The data cleaner class also defines functions with the pattern clean_<plural model_name>. These are passed a queryset representing the problem rows, if any. The function fixes the issues if it can, and returns a count of the issues it fixed.

The CleanerTask uses some new classes (ReportItem, CleanedItem, BaseDataItem, DataModelItem, and DataItem) to implement querying, generating counts, running cleaning (in cooperation with the clean_x methods) and creating reports. These mostly hold and validate data, and are thoroughly tested.

The unused DetectorTask is gone. The base class DataIssueTask could do the same, or be extended if we ever want to use this framework to explore the data but not fix it.

Other changes:

  • Added verbose_name_plural for RelayAddress and DomainAddress. This changes them from "addresss" to "addresses" in the Django admin, and also means it can be used in DataTask where a dictionary key is needed.
  • Added "raise NotImplementedError" to the coverage ignore list, so @abstractmethod functions do not appear as uncovered.

How to test:

In your local environment, ensure you have at least one staff user (can access the Django admin) and one regular user (logs in with Mozilla Accounts).

As the staff user, go to http://127.0.0.1:8000/admin/auth/user/

  • From the user list, select your regular user.
  • From the user detail page, clear the email then click the "Save" button.
  • On the user list, confirm the regular user's email is clear.

On this branch, run:

./manage.py cleanup_data -v 2 --missing-email --clean

The output contains a log entry and a text report, and is something like:

{"Timestamp": 1716414385805936896, "Type": "eventsinfo.cleanup_data", "Logger": "fx-private-relay", "Hostname": "jwhitlock-MB-M2.lan", "EnvVersion": "2.0", "Severity": 6, "Pid": 17036, "Fields": {"cleaned": true, "timers": {"query_s": 0.002, "clean_s": 0.002}, "tasks": {"missing-email": {"counts": {"summary": {"ok": 5, "needs_cleaning": 1, "cleaned": 1}, "users": {"all": 7, "email": 5, "!email": 2, "!email.fxa": 1, "cleaned": 1}}, "timers": {"query_s": 0.002, "clean_s": 0.002}}}, "rid": null, "msg": "cleanup_data complete, cleaned 1 of 1 issue."}}

# Summary

|    task     |issues|fixed|query (s)|fix (s)|
|------------:|-----:|----:|--------:|------:|
|missing-email|     1|    1|    0.002|  0.002|
|      _Total_|     1|    1|    0.002|  0.002|

# Details
## missing-email
When User.email is empty, we are unable to forward email to the Relay user. We
can get the email from the FxA profile if available.

Detected 1 issue in 0.002 seconds.
Cleaned 1 issue in 0.002 seconds.

Users:
  All: 7
    Email   : 5 ( 71.4%)
    No Email: 2 ( 28.6%)
      Has FxA: 1 ( 50.0%)
        Cleaned: 1 (100.0%)

As the staff user, refresh the user list at http://127.0.0.1:8000/admin/auth/user/ . The regular user should have their email again.

As a bonus, run all the cleaners:

./manage.py cleanup_data -v 2 --clean

jwhitlock added 25 commits May 17, 2024 12:41
Adds __repr__ and __eq__ for DataItem, further adjusts interfaces.
The tests exposed that DataItem was too generic, and required too many
run-time checks to ensure it was used correctly. Instead, factor to a
hierarchy with more constraints.
Fix users whose primary email was not set from their FxA data.
Be more consistant with Item suffix, and save ReportEntry for new data
structure.
This method collects the relevant data items, injects CleanedItem
records, and sorts in print order along with the data needed to generate
reports. It works in some corner cases that the markdown_report() method
would print in the wrong order.
All the current clean_users, clean_relay_addresses, etc. just used
item.get_queryset(), so pass this instead of the DataItem.
@jwhitlock jwhitlock force-pushed the clean-users-with-blank-emails-mpp-3439 branch from 7fe3563 to d4513d6 Compare May 22, 2024 22:51
@jwhitlock jwhitlock requested a review from groovecoder May 22, 2024 22:55
Copy link
Member

@groovecoder groovecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I cleared a user's email and ran ./manage.py cleanup_data -v 2 --missing-email --clean I saw this ...

{"Timestamp": 1717186451021793024, "Type": "eventsinfo.cleanup_data", "Logger": "fx-private-relay", "Hostname": "groovetop", "EnvVersion": "2.0", "Severity": 6, "Pid": 19675, "Fields": {"cleaned": true, "timers": {"query_s": 0.008, "clean_s": 0.001}, "tasks": {"missing-email": {"counts": {"summary": {"ok": 3, "needs_cleaning": 0, "cleaned": 0}, "users": {"all": 4, "email": 3, "!email": 1, "!email.fxa": 0, "cleaned": 0}}, "timers": {"query_s": 0.008, "clean_s": 0.001}}}, "rid": null, "msg": "cleanup_data complete, cleaned 0 of 0 issues."}}

# Summary

|    task     |issues|fixed|query (s)|fix (s)|
|------------:|-----:|----:|--------:|------:|
|missing-email|     0|    0|    0.008|  0.001|
|      _Total_|     0|    0|    0.008|  0.001|

# Details
## missing-email
When User.email is empty, we are unable to forward email to the Relay user. We
can get the email from the FxA profile if available.

Detected 0 issues in 0.008 seconds.
Cleaned 0 issues in 0.001 seconds.

Users:
  All: 4
    Email   : 3 ( 75.0%)
    No Email: 1 ( 25.0%)
      Has FxA: 0 (  0.0%)

And the user's email address was not restored? 😢

@jwhitlock
Copy link
Member Author

jwhitlock commented Jun 3, 2024

Users:
All: 4
Email : 3 ( 75.0%)
No Email: 1 ( 25.0%)
Has FxA: 0 ( 0.0%)


And the user's email address was not restored? 😢

It restores the email by copying it from the FxA profile, and the user without an email doesn't have an FxA profile (Has FxA: 0). Sorry, "Regular User" meant a user with an FxA profile.

Copy link
Member

@groovecoder groovecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of all the refactoring is to make these easier to read and write, and require less repetitive code.

suggestion (non-blocking): Maybe it’s because I’ve been gone for a week, but this refactored data cleaner code doesn’t seem easier to read? Or at least, not for a "first-time" reader of the code. I had to debug it and to find the queryset filter() call that was failing, I had to navigate thru MissingEmailCleaner, CleanerTask, DataModelSpec, DataBisectSpec, BaseDataItem, etc. ... and I'm still not sure I understand it all.

Do you remember which parts of the refactor were the ones you meant specifically for read- and write-ability? Any chance of teasing those apart? The refactor that was meant to reduce code duplication seems to move a couple hundred lines of duplicate code into 800 lines of more abstract code?

model=User,
subdivisions=[
DataBisectSpec("email", ~Q(email__exact="")),
DataBisectSpec("!email.fxa", Q(socialaccount__provider="fxa")),
Copy link
Member

@groovecoder groovecoder Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard-coded fxa here means that the application must be set up with "fxa" as the name of the SocialAccount SocialApp name or this code will fail to find the user's FXA, which caused my error with the spot-check.

suggestion (blocking): either change this value to check for something like Q(socialaccount__provider=SocialAccount.SocialApp.get(provider="fxa").name) (not exactly sure of the syntax), or let's update the README directions to always use "fxa" for the provider id+name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me more about the bug you were seeing? The query Q(socialaccout_provider="fxa") means what you are suggesting - the SocialAccount where provider=fxa, not the name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I might be getting SocialAccount and SocialApplication mixed up - I'll dig in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm not. SocialApp has a provider that is fxa for Firefox Accounts, and SocialAccount should have provider set to fxa as well. What is SocialAccount.provider for the test user that wasn't working?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SocialAccount.provider for the user was accounts.stage.mozaws.net when it wasn't working, which seems to be copied from SocialApplication at creation-time ...

Test 1: SocialApplication "fxa"

image
  1. Sign-up new user
  2. Check http://127.0.0.1:8000/admin/socialaccount/socialaccount/
    • provider is "fxa"
  3. Delete the User.email value
  4. Run python manage.py cleanup_data -v 2 --missing-email --clean
    • Fixes the issue! (Cleaned: 1 (100.0%))

Test 2: SocialApplication "accounts.stage.mozaws.net"

image
  1. Change SocialApplication "Provider ID" and "Name" value to accounts.stage.mozaws.net
  2. Sign-up new user
  3. Check http://127.0.0.1:8000/admin/socialaccount/socialaccount/
    • provider is "accounts.stage.mozaws.net"
  4. Delete the User.email value
  5. Run python manage.py cleanup_data -v 2 --missing-email --clean
    • Doesn't find the FXA! (Has FxA: 0 ( 0.0%))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks. Mine is looks like this:

Screenshot 2024-06-13 at 12 58 02 PM

The optional provider_id was added in 0.55.0. I'll dig in and see what that is for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR - you should not be setting provider_id, leave it blank like it is in dev, stage, and prod. However, I'll update the code to handle it.

The documentation is not very helpful. There's specific provider docs, and this doc for allauth.socialaccount.adapter.DefaultSocialAccountAdapter.get_provider():

Looks up a provider, supporting subproviders by looking up by provider_id.

https://docs.allauth.org/en/latest/socialaccount/adapter.html#allauth.socialaccount.adapter.DefaultSocialAccountAdapter.get_provider

The code is a little more helpful:

For providers that support subproviders, such as OpenID Connect and SAML, this ID identifies that instance. SocialAccount's originating from app will have their provider field set to the provider_id if available, else provider.

https://github.com/pennersr/django-allauth/blob/9be5ef00e8d1ea697baf046c00957f259133aae5/allauth/socialaccount/models.py#L47-L55

When a provider is providing OpenID Connect or SAML, then django-allauth has been removing the provider and using the generic provider:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit fixes the provider_id issue, with tests.

@jwhitlock
Copy link
Member Author

The "cleaner" part is in the cleaning tasks. It went from these 127 lines:

class ServerStorageCleaner(CleanerTask):
slug = "server-storage"
title = "Ensure no data is stored when server_storage=False"
check_description = (
"When Profile.server_storage is False, the addresses (both regular and domain)"
" should have empty data (the fields description, generated_for and used_on)."
)
def _get_counts_and_data(self) -> tuple[Counts, CleanupData]:
"""
Analyze usage of the server_storage flag and server-stored data.
Returns:
* counts: two-level dict of row counts for Profile, RelayAddress, and
DomainAddress
* cleanup_data: two-element dict of RelayAddresses and DomainAddress
queries to clear
"""
profiles_without_server_storage = Profile.objects.filter(server_storage=False)
no_store_relay_addresses = RelayAddress.objects.filter(
user__profile__server_storage=False
)
no_store_domain_addresses = DomainAddress.objects.filter(
user__profile__server_storage=False
)
blank_used_on = Q(used_on="") | Q(used_on__isnull=True)
blank_relay_data = blank_used_on & Q(description="") & Q(generated_for="")
blank_domain_data = blank_used_on & Q(description="")
empty_relay_addresses = no_store_relay_addresses.filter(blank_relay_data)
empty_domain_addresses = no_store_domain_addresses.filter(blank_domain_data)
non_empty_relay_addresses = no_store_relay_addresses.exclude(blank_relay_data)
non_empty_domain_addresses = no_store_domain_addresses.exclude(
blank_domain_data
)
empty_relay_addresses_count = empty_relay_addresses.count()
empty_domain_addresses_count = empty_domain_addresses.count()
non_empty_relay_addresses_count = non_empty_relay_addresses.count()
non_empty_domain_addresses_count = non_empty_domain_addresses.count()
counts: Counts = {
"summary": {
"ok": empty_relay_addresses_count + empty_domain_addresses_count,
"needs_cleaning": non_empty_relay_addresses_count
+ non_empty_domain_addresses_count,
},
"profiles": {
"all": Profile.objects.count(),
"no_server_storage": profiles_without_server_storage.count(),
},
"relay_addresses": {
"all": RelayAddress.objects.count(),
"no_server_storage": no_store_relay_addresses.count(),
"no_server_storage_or_data": empty_relay_addresses_count,
"no_server_storage_but_data": non_empty_relay_addresses_count,
},
"domain_addresses": {
"all": DomainAddress.objects.count(),
"no_server_storage": no_store_domain_addresses.count(),
"no_server_storage_or_data": empty_domain_addresses_count,
"no_server_storage_but_data": non_empty_domain_addresses_count,
},
}
cleanup_data: CleanupData = {
"relay_addresses": non_empty_relay_addresses,
"domain_addresses": non_empty_domain_addresses,
}
return counts, cleanup_data
def _clean(self) -> int:
"""Clean addresses with unwanted server-stored data."""
counts = self.counts
cleanup_data = self.cleanup_data
counts["relay_addresses"]["cleaned"] = cleanup_data["relay_addresses"].update(
description="", generated_for="", used_on=""
)
counts["domain_addresses"]["cleaned"] = cleanup_data["domain_addresses"].update(
description="", used_on=""
)
return (
counts["relay_addresses"]["cleaned"] + counts["domain_addresses"]["cleaned"]
)
def markdown_report(self) -> str:
"""Report on server-stored data found and optionally cleaned."""
def model_lines(name: str, counts: dict[str, int]) -> list[str]:
"""Get the report lines for a model (Profile, Relay Addr., Domain Addrs.)"""
total = counts["all"]
lines = [f"{name}:", f" All: {total}"]
if total == 0:
return lines
no_server_storage = counts["no_server_storage"]
lines.append(
" Without Server Storage: "
f"{self._as_percent(no_server_storage, total)}"
)
if no_server_storage == 0:
return lines
no_data = counts.get("no_server_storage_or_data")
has_data = counts.get("no_server_storage_but_data")
if no_data is None or has_data is None:
return lines
lines.extend(
[
" No Data : "
f"{self._as_percent(no_data, no_server_storage)}",
" Has Data: "
f"{self._as_percent(has_data, no_server_storage)}",
]
)
cleaned = counts.get("cleaned")
if cleaned is None or has_data == 0:
return lines
lines.append(f" Cleaned: {self._as_percent(cleaned, has_data)}")
return lines
lines: list[str] = (
model_lines("Profiles", self.counts["profiles"])
+ model_lines("Relay Addresses", self.counts["relay_addresses"])
+ model_lines("Domain Addresses", self.counts["domain_addresses"])
)
return "\n".join(lines)

to these 50:

class ServerStorageCleaner(CleanerTask):
slug = "server-storage"
title = "Ensure no data is stored when server_storage=False"
check_description = (
"When Profile.server_storage is False, the addresses (both regular and domain)"
" should have empty data (the fields description, generated_for and used_on)."
)
_blank_used_on = Q(used_on="") | Q(used_on__isnull=True)
data_specification = [
# Report on how many users have turned off server storage
DataModelSpec(
Profile,
[DataBisectSpec("server_storage", "user__profile__server_storage")],
omit_key_prefixes=["server_storage"],
metric_name_overrides={"!server_storage": "no_server_storage"},
report_name_overrides={"!server_storage": "Without Server Storage"},
)
] + [
# Detect users with no server storage but address records with data
DataModelSpec(
AddressModel,
[
DataBisectSpec("server_storage", "user__profile__server_storage"),
DataBisectSpec("!server_storage.empty", blank_data),
],
omit_key_prefixes=["server_storage"],
metric_name_overrides={
"!server_storage": "no_server_storage",
"!server_storage.empty": "no_server_storage_or_data",
"!server_storage.!empty": "no_server_storage_but_data",
},
report_name_overrides={
"!server_storage": "Without Server Storage",
"!server_storage.empty": "No Data",
"!server_storage.!empty": "Has Data",
},
ok_key="!server_storage.empty",
needs_cleaning_key="!server_storage.!empty",
)
for AddressModel, blank_data in [
(RelayAddress, _blank_used_on & Q(description="") & Q(generated_for="")),
(DomainAddress, _blank_used_on & Q(description="")),
]
]
def clean_relay_addresses(self, queryset: QuerySet[RelayAddress]) -> int:
return queryset.update(description="", generated_for="", used_on="")
def clean_domain_addresses(self, queryset: QuerySet[DomainAddress]) -> int:
return queryset.update(description="", used_on="")

The retained stuff:

  • slug - used on the command line
  • title- used on reports
  • check_description - Used for help output/

The new stuff:

  • data_specification - Says what model(s) we're looking at, what queries can get the problematic rows / items, how they should show up in logs, and how they should show up in human-level reports

The dropped stuff:

  • _get_counts_and_data() - This hand-coded method is replaced by generic methods
  • markdown_report() - This hand-coded method is replaced by generic methods

The altered stuff:

  • _clean() - This hand-coded method becomes smaller methods like clean_relay_addresses and clean_domain_addresses, often one-liners, called by generic methods

Do you think some more documentation would help? Or would you like it broken into smaller pieces that don't try to do so much at once?

@groovecoder
Copy link
Member

Do you think some more documentation would help? Or would you like it broken into smaller pieces that don't try to do so much at once?

I think it's the number of "generic methods" calling back and forth, using more generic types & objects that made it harder to debug the specific issue I was having.

@@ -84,10 +92,13 @@ def test_missing_email_cleaner_data_to_clear() -> None:
assert report == expected


@pytest.mark.parametrize("provider_id", ("", "fxa", "test.example.com"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: perfect, thanks!

@jwhitlock jwhitlock added this pull request to the merge queue Jun 20, 2024
Merged via the queue into main with commit 5cbdedf Jun 20, 2024
29 checks passed
@jwhitlock jwhitlock deleted the clean-users-with-blank-emails-mpp-3439 branch June 20, 2024 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants