MPP-3815: Handle broken email processing #4702

jwhitlock · 2024-05-13T14:54:20Z

This PR makes several changes to ./manage.py process_emails_from_sqs, now that it is being used to process the dead letter queue (DLQ) as well as the incoming emails. This PR replaces PR #4689. The changes to processing are:

Messages are now processed in a subprocess, which allows updating the healthcheck file and also aborting processing when the processing seems stuck. A new setting PROCESS_EMAIL_MAX_SECONDS_PER_MESSAGE, default 120, controls how many seconds is too many. This will stop email tasks and the DLQ task from getting killed by Kubernetes after 120 seconds.
The processor now captures exceptions and sends them to Sentry. This will stop the email tasks and the DLQ tasks from existing on an unhandled exception.

Additional changes:

Remove process_delayed_emails_from_sqs.py, which is now unused and had no tests
Add dependencies mypy-boto3-sns, mypy-boto3-sqs, and mypy-boto3-s3 for more boto3 type hints
Add type hints to all code in process_emails_from_sqs.py and its tests
Add tests for above functionality

How to test

If you have your local environment setup for email processing from a queue, enable your AWS credentials, run ./manage.py process_emails_from_sqs and send some emails. You can also push the branch to Heroku and test it there.

This was replaced by process_emails_from_sqs with different parameters, such as using the DLQ queue and deleting failed messages.

If this retry logic ran, it would emit counter metrics and error logs. Since there are none, we can assume this never happens or is rare enough to remove and wait until it happends again.

There are few changes needed for the code. One change is that the type hints say that queue metrics, like ApproximateNumberOfMessages, are strings, so this converts them to integers.

Because of a reused variable, the queue data and the cycle number were not present in the log.

Avoid .update() because 1) it requires more lines of code and 2) it is not very compatible with TypedDict, and I hope to use that more in the future.

multiprocessing runs the target function in a subprocess rather than a thread. This is slower to start up than a thread, and requires django.setup() to initialize the application. However, it does allow us to terminate a stuck process, which happens sometimes in email processing and frequently in DLQ processing.

groovecoder

Nice code; looks good; spot-check works well; tests pass. Just a couple clarifying questions.

emails/tests/mgmt_process_emails_from_sqs_tests.py

groovecoder · 2024-05-13T15:06:15Z

emails/tests/mgmt_process_emails_from_sqs_tests.py

+    The retry logic was removed in May 2024 when no throttling or pause errors were
+    registered in the previous 6 months.


praise: nice doc.

groovecoder · 2024-05-13T15:16:34Z

emails/tests/mgmt_process_emails_from_sqs_tests.py

+    assert rec2_extra["success"] is True
+    assert rec2_extra["message_process_time_s"] < 120.0
+    assert rec2_extra["subprocess_setup_time_s"] == 1.0
+    assert mock_process_pool_future._timeouts == [1.0]


question (non-blocking): I don't understand this _timeouts property: why is there a 1.0 value in the _timeouts when the message succeeded? That seems like there should be no timeout?

Good question! I got the mock backwards. This line:

mock_process_pool_future._is_stalled.side_effect = [False, False, True]

should be:

mock_process_pool_future._is_stalled.side_effect = [True, True, False]

_timeouts is [1.0] because it was called one. When you call future.wait(1.0) on the mock future, it runs:

def call_wait(timeout: float) -> None: mocked_clocks(timeout) mock_future._timeouts.append(timeout) if not mock_future._is_stalled(): mock_future._ready = True try: ret = func(*args) except BaseException as e: if error_callback: error_callback(e) else: if callback: callback(ret)

So, _timeouts always has a value when future.wait() is called.

groovecoder · 2024-05-13T15:17:20Z

emails/tests/mgmt_process_emails_from_sqs_tests.py

+    assert rec2_extra["error"] == "Timed out after 120.0 seconds."
+    assert rec2_extra["message_process_time_s"] >= 120.0
+    assert rec2_extra["subprocess_setup_time_s"] == 1.0
+    assert mock_process_pool_future._timeouts == [1.0] * 60


question (non-blocking): Again I don't understand this _timeouts property: why is it a list of 60 1.0 values when the message timed out after 120 seconds? That seems like there should be 120 1.0 values?

Good question! The loop includes a call to time.monotonic, which also increments the mocked clock, so each loops is 2 fake seconds. However, it makes more sense if future.wait does not increment the clock. I'll make that change.

emails/management/commands/process_emails_from_sqs.py

sentry-io · 2024-05-19T01:21:06Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ ParamValidationError: Parameter validation failed: emails.utils in ses_send_raw_email View Issue

_{Did you find this useful? React with a 👍 or 👎}

jwhitlock added 9 commits May 13, 2024 09:34

Remove unused process_delayed_emails_from_sqs

03dd0f5

This was replaced by process_emails_from_sqs with different parameters, such as using the DLQ queue and deleting failed messages.

Remove unused retry logic

5c23d77

If this retry logic ran, it would emit counter metrics and error logs. Since there are none, we can assume this never happens or is rare enough to remove and wait until it happends again.

Use UUID strings in tests

66b755c

Add more AWS / boto3 type hint packages

367ad74

Add type hints to process_emails_from_sqs

eaa47aa

There are few changes needed for the code. One change is that the type hints say that queue metrics, like ApproximateNumberOfMessages, are strings, so this converts them to integers.

Add queue data to cycle data

bac4650

Because of a reused variable, the queue data and the cycle number were not present in the log.

Avoid dict.update() for easy cases

b3f444d

Avoid .update() because 1) it requires more lines of code and 2) it is not very compatible with TypedDict, and I hope to use that more in the future.

Catch email exceptions, mark as failures

d59b6fa

jwhitlock requested a review from groovecoder May 13, 2024 14:54

groovecoder approved these changes May 13, 2024

View reviewed changes

groovecoder added this pull request to the merge queue May 13, 2024

Merged via the queue into main with commit 7a76734 May 13, 2024
28 checks passed

groovecoder deleted the handle-broken-email-processing-mpp-3815 branch May 13, 2024 15:59

jwhitlock mentioned this pull request May 13, 2024

MPP-3815: Fix email processing tests #4706

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPP-3815: Handle broken email processing #4702

MPP-3815: Handle broken email processing #4702

jwhitlock commented May 13, 2024

groovecoder left a comment

groovecoder May 13, 2024

groovecoder May 13, 2024

jwhitlock May 13, 2024

groovecoder May 13, 2024

jwhitlock May 13, 2024

sentry-io bot commented May 19, 2024

		The retry logic was removed in May 2024 when no throttling or pause errors were
		registered in the previous 6 months.

MPP-3815: Handle broken email processing #4702

MPP-3815: Handle broken email processing #4702

Conversation

jwhitlock commented May 13, 2024

How to test

groovecoder left a comment

Choose a reason for hiding this comment

groovecoder May 13, 2024

Choose a reason for hiding this comment

groovecoder May 13, 2024

Choose a reason for hiding this comment

jwhitlock May 13, 2024

Choose a reason for hiding this comment

groovecoder May 13, 2024

Choose a reason for hiding this comment

jwhitlock May 13, 2024

Choose a reason for hiding this comment

sentry-io bot commented May 19, 2024

Suspect Issues