-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup stale files #3584
Cleanup stale files #3584
Changes from 4 commits
8e09d0f
5b6dadf
16a60ca
4ed66c3
c1c1189
bf5cf4b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -11,6 +11,7 @@ | |||
from contentcuration.utils.garbage_collect import clean_up_contentnodes | ||||
from contentcuration.utils.garbage_collect import clean_up_deleted_chefs | ||||
from contentcuration.utils.garbage_collect import clean_up_feature_flags | ||||
from contentcuration.utils.garbage_collect import clean_up_stale_files | ||||
from contentcuration.utils.garbage_collect import clean_up_tasks | ||||
|
||||
|
||||
|
@@ -33,5 +34,7 @@ def handle(self, *args, **options): | |||
clean_up_deleted_chefs() | ||||
logging.info("Cleaning up feature flags") | ||||
clean_up_feature_flags() | ||||
logging.info("Cleaning up stale file objects") | ||||
clean_up_stale_files() | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function is only cleaning up 10000 files at a time, so each run of the garbage collection job (which runs daily on production) will only clear up 10000 files - instead we should be running until there are no more files that meet the criteria. One possibility would be to return the number of deleted files, and keep on trying to delete until the number of deleted files is zero. If we don't do this, it will take us approximately 2 years to clear all the existing 'orphaned' file objects at just 10,000 a day! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rtibbles It's actually 100k, not 10k. We decided to limit this because we don't know how long deleting the 7 million would take. I think what the actual limit is could be adjusted. We looked at what the cron schedule of the job was in production, and a very conservative value was chosen. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I assumed that this was intended as a batching parameter to limit the length of any particular delete query. Even with 100000 it will still be 70 days before the backlog is cleared. Is there a specific concern with the scheduled job taking a long time? It seems like it will be a one time long run and then after that will just be keeping on top of it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We saw that the cron jobs are scheduled to run every 24 hours. While looking at Django’s delete() method, we noticed that it also triggers the deletion signals, which could take some time to process, along with the memory usage of deleting millions of file objects in one go. The potential for the job to run past 24 hours was our main concern for the limit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There seem to be two concerns here - one about limiting the memory usage and time of the deletion operation, and one about the run time of the cronjob. I think we should be limiting the memory usage, and could even reduce the batch size further to remove any concerns. For the cronjob, I think it is more important that it runs to completion than it finish within 24 hours - as this is parallel to the other deletions that are happening in the cronjob. Without ensuring that all the file objects matching this criteria are deleted, we continue to run the risk of a file table filled with an ever expanding number of 'junk' files. In terms of concerns about signals, there is one signal that is currently attached to the post_delete signal of the File object, which recalculates file storage: studio/contentcuration/contentcuration/models.py Line 2232 in 4444843
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Since there is quite a bit of unknown here, seems like it would be safer to limit it then. So how about a limit of 500k? Or 1 million? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am fine with a higher limit - do we still want to batch though to limit memory usage? Presumably for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like it doesn't do any batching, but it does check if there are any signal handlers to see whether it needs to load the model representations - this might explain why the other delete methods for garbage collection are explicitly disabling the signal handlers with a context manager. https://docs.djangoproject.com/en/3.2/ref/models/querysets/#delete There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So if we are going to increase the limit, we probably need to disable the signal handlers, rather than just deferring the user storage calculation. However, it seems like we don't actually need to run the user storage calculation at all, as by definition, these files are not part of any active trees (and the user storage calculation only calculates for active trees). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it, I will update the function to include a higher limit and remove the storage calculation. Do you have any good examples and/or links for batching? |
||||
logging.info("Cleaning up tasks") | ||||
clean_up_tasks() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
import json | ||
import random | ||
import uuid | ||
from datetime import datetime | ||
from datetime import timedelta | ||
|
@@ -11,6 +12,8 @@ | |
from django.core.files.storage import default_storage | ||
from django.urls import reverse_lazy | ||
from le_utils.constants import content_kinds | ||
from le_utils.constants import file_formats | ||
from le_utils.constants import format_presets | ||
|
||
from contentcuration import models as cc | ||
from contentcuration.api import activate_channel | ||
|
@@ -23,6 +26,7 @@ | |
from contentcuration.utils.garbage_collect import clean_up_contentnodes | ||
from contentcuration.utils.garbage_collect import clean_up_deleted_chefs | ||
from contentcuration.utils.garbage_collect import clean_up_feature_flags | ||
from contentcuration.utils.garbage_collect import clean_up_stale_files | ||
from contentcuration.utils.garbage_collect import clean_up_tasks | ||
from contentcuration.utils.garbage_collect import get_deleted_chefs_root | ||
from contentcuration.views.internal import api_commit_channel | ||
|
@@ -396,3 +400,56 @@ def test_recent_task(self): | |
TaskResult.objects.get(task_id=self.recent_task.task_id) | ||
except TaskResult.DoesNotExist: | ||
self.fail("Task was removed") | ||
|
||
|
||
TWO_DAYS_AGO = datetime.now() - timedelta(days=2) | ||
|
||
|
||
def _create_stale_file(user, modified_date): | ||
checksum = '%32x' % random.getrandbits(16 * 8) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool! |
||
file = File( | ||
file_size=5, | ||
checksum=checksum, | ||
original_filename="SomeAudio.mp3", | ||
file_on_disk=f"{checksum}.mp3", | ||
file_format_id=file_formats.MP3, | ||
preset_id=format_presets.AUDIO, | ||
uploaded_by=user, | ||
duration=365, | ||
) | ||
file.save() | ||
|
||
# Manually setting last modified date of file | ||
File.objects.filter(id=file.id).update(modified=modified_date) | ||
|
||
return file | ||
|
||
|
||
class CleanupStaleFilesTestCase(StudioTestCase): | ||
|
||
def setUp(self): | ||
user = self.admin_user | ||
|
||
# create test stale files | ||
self.file_to_be_deleted = _create_stale_file(user, THREE_MONTHS_AGO) | ||
self.file_to_keep = _create_stale_file(user, TWO_DAYS_AGO) | ||
|
||
# run | ||
clean_up_stale_files() | ||
|
||
def test_deletes_correct_files(self): | ||
""" | ||
Function should only get files where ContentNode, Assessmentitem, and SlideshowSlide are null | ||
and modified date is older than the date specified | ||
""" | ||
with self.assertRaises(File.DoesNotExist): | ||
File.objects.get(id=self.file_to_be_deleted.id) | ||
|
||
def test_doesnt_delete_files_with_modified_date_less_than_date_specified(self): | ||
""" | ||
Function should not delete files where the modified date is more recent than the date specified | ||
""" | ||
try: | ||
File.objects.get(id=self.file_to_keep.id) | ||
except File.DoesNotExist: | ||
self.fail("File was deleted") |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,7 +7,9 @@ | |
|
||
from celery import states | ||
from django.conf import settings | ||
from django.db.models import OuterRef | ||
from django.db.models.expressions import CombinedExpression | ||
from django.db.models.expressions import Exists | ||
from django.db.models.expressions import F | ||
from django.db.models.expressions import Value | ||
from django.db.models.signals import post_delete | ||
|
@@ -16,6 +18,7 @@ | |
|
||
from contentcuration.constants import feature_flags | ||
from contentcuration.db.models.functions import JSONObjectKeys | ||
from contentcuration.decorators import delay_user_storage_calculation | ||
from contentcuration.models import ContentNode | ||
from contentcuration.models import File | ||
from contentcuration.models import TaskResult | ||
|
@@ -145,3 +148,20 @@ def clean_up_tasks(): | |
count, _ = TaskResult.objects.filter(date_done__lt=date_cutoff, status__in=states.READY_STATES).delete() | ||
|
||
logging.info("Deleted {} completed task(s) from the task table".format(count)) | ||
|
||
|
||
@delay_user_storage_calculation | ||
def clean_up_stale_files(last_modified=now() - datetime.timedelta(days=30)): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might be good to define the default inside the function scope. At the moment, by defining it in the function definition, The usual Python pattern for this is to set a default of The big gotcha that can sometimes happen here is if the default value is set to a mutable type like a dict, and then successive invocations of the function can actually modify and share the default value. See here: https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments for more about that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is a good point that I did not realize, thank you! I will make those changes and update the pr. |
||
""" | ||
Clean up files that aren't attached to any ContentNode, AssessmentItem, or SlideshowSlide and where | ||
the modified date is older than `last_modified` | ||
""" | ||
|
||
files_to_clean_up = File.objects.filter( | ||
contentnode__isnull=True, assessment_item__isnull=True, slideshow_slide__isnull=True, modified__lt=last_modified | ||
) | ||
|
||
# Uses a subset of Python’s array-slicing syntax to limit deletion to 100000 files per function run | ||
files_deleted, _ = File.objects.filter(Exists(files_to_clean_up.filter(pk=OuterRef('pk'))[:100000])).delete() | ||
|
||
logging.info("Files with a modified date older than {} should be deleted. Deleted {} file(s).".format(last_modified, files_deleted)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a good idea to note in your PR description why we deleted these!