Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup stale files #3584

Merged
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 30 additions & 44 deletions contentcuration/contentcuration/decorators.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
from contextlib import ContextDecorator

from django.conf import settings
from django.core.exceptions import ObjectDoesNotExist
from django.shortcuts import render

from contentcuration.models import Channel

ACCEPTED_BROWSERS = settings.HEALTH_CHECK_BROWSERS + settings.SUPPORTED_BROWSERS

Expand Down Expand Up @@ -35,48 +35,6 @@ def wrap(request, *args, **kwargs):
return wrap


def can_access_channel(function):
def wrap(request, *args, **kwargs):
try:
channel = Channel.objects.get(pk=kwargs["channel_id"])
except ObjectDoesNotExist:
return render(request, "channel_not_found.html")

if (
channel.public
or channel.editors.filter(id=request.user.id).exists()
or channel.viewers.filter(id=request.user.id).exists()
or (not request.user.is_anonymous and request.user.is_admin)
):
return function(request, *args, **kwargs)

return render(request, "channel_not_found.html", status=404)

wrap.__doc__ = function.__doc__
wrap.__name__ = function.__name__
return wrap


def can_edit_channel(function):
def wrap(request, *args, **kwargs):
try:
channel = Channel.objects.get(pk=kwargs["channel_id"], deleted=False)

if (
not channel.editors.filter(id=request.user.id).exists()
and not request.user.is_admin
):
return render(request, "unauthorized.html", status=403)

return function(request, *args, **kwargs)
except ObjectDoesNotExist:
return render(request, "channel_not_found.html")

wrap.__doc__ = function.__doc__
wrap.__name__ = function.__name__
return wrap


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a good idea to note in your PR description why we deleted these!

# NOTE: This approach only works because of a seeming bug in Django
# https://code.djangoproject.com/ticket/15855
# Essentially, what happens is that this decorator sets the Vary headers
Expand Down Expand Up @@ -104,3 +62,31 @@ def wrap(request, *args, **kwargs):
return response

return wrap


class DelayUserStorageCalculation(ContextDecorator):
"""
Decorator class that will dedupe and delay requests to enqueue storage calculation tasks for users
until after the wrapped function has exited
"""
depth = 0
queue = []

@property
def is_active(self):
return self.depth > 0

def __enter__(self):
self.depth += 1

def __exit__(self, exc_type, exc_val, exc_tb):
from contentcuration.utils.user import calculate_user_storage
self.depth -= 1
if not self.is_active:
user_ids = set(self.queue)
self.queue = []
for user_id in user_ids:
calculate_user_storage(user_id)


delay_user_storage_calculation = DelayUserStorageCalculation()
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from contentcuration.utils.garbage_collect import clean_up_contentnodes
from contentcuration.utils.garbage_collect import clean_up_deleted_chefs
from contentcuration.utils.garbage_collect import clean_up_feature_flags
from contentcuration.utils.garbage_collect import clean_up_stale_files
from contentcuration.utils.garbage_collect import clean_up_tasks


Expand All @@ -33,5 +34,7 @@ def handle(self, *args, **options):
clean_up_deleted_chefs()
logging.info("Cleaning up feature flags")
clean_up_feature_flags()
logging.info("Cleaning up stale file objects")
clean_up_stale_files()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only cleaning up 10000 files at a time, so each run of the garbage collection job (which runs daily on production) will only clear up 10000 files - instead we should be running until there are no more files that meet the criteria. One possibility would be to return the number of deleted files, and keep on trying to delete until the number of deleted files is zero.

If we don't do this, it will take us approximately 2 years to clear all the existing 'orphaned' file objects at just 10,000 a day!

Copy link
Member

@bjester bjester Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rtibbles It's actually 100k, not 10k. We decided to limit this because we don't know how long deleting the 7 million would take. I think what the actual limit is could be adjusted. We looked at what the cron schedule of the job was in production, and a very conservative value was chosen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I assumed that this was intended as a batching parameter to limit the length of any particular delete query. Even with 100000 it will still be 70 days before the backlog is cleared.

Is there a specific concern with the scheduled job taking a long time? It seems like it will be a one time long run and then after that will just be keeping on top of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We saw that the cron jobs are scheduled to run every 24 hours. While looking at Django’s delete() method, we noticed that it also triggers the deletion signals, which could take some time to process, along with the memory usage of deleting millions of file objects in one go. The potential for the job to run past 24 hours was our main concern for the limit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be two concerns here - one about limiting the memory usage and time of the deletion operation, and one about the run time of the cronjob.

I think we should be limiting the memory usage, and could even reduce the batch size further to remove any concerns.

For the cronjob, I think it is more important that it runs to completion than it finish within 24 hours - as this is parallel to the other deletions that are happening in the cronjob. Without ensuring that all the file objects matching this criteria are deleted, we continue to run the risk of a file table filled with an ever expanding number of 'junk' files.

In terms of concerns about signals, there is one signal that is currently attached to the post_delete signal of the File object, which recalculates file storage:

def auto_delete_file_on_delete(sender, instance, **kwargs):
- you are already handling this using the decorator that delays storage recalculation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that it would deadlock, maybe the subsequent or existing run would end up running superfluous deletes?

At the moment, I think this is plausible when running the KA EN Chef (although I am trying to fix the ricecooker behaviour to prevent this).

Since there is quite a bit of unknown here, seems like it would be safer to limit it then. So how about a limit of 500k? Or 1 million?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with a higher limit - do we still want to batch though to limit memory usage? Presumably for the post_delete signal the full model representation is being loaded, and I'm not sure if Django is batching that or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it doesn't do any batching, but it does check if there are any signal handlers to see whether it needs to load the model representations - this might explain why the other delete methods for garbage collection are explicitly disabling the signal handlers with a context manager.

https://docs.djangoproject.com/en/3.2/ref/models/querysets/#delete

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we are going to increase the limit, we probably need to disable the signal handlers, rather than just deferring the user storage calculation.

However, it seems like we don't actually need to run the user storage calculation at all, as by definition, these files are not part of any active trees (and the user storage calculation only calculates for active trees).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I will update the function to include a higher limit and remove the storage calculation. Do you have any good examples and/or links for batching?

logging.info("Cleaning up tasks")
clean_up_tasks()
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
import mock

from contentcuration.decorators import delay_user_storage_calculation
from contentcuration.tests.base import StudioTestCase
from contentcuration.utils.user import calculate_user_storage
from contentcuration.utils.user import delay_user_storage_calculation


class UserUtilsTestCase(StudioTestCase):
class DecoratorsTestCase(StudioTestCase):
@mock.patch("contentcuration.utils.user.calculate_user_storage_task")
def test_delay_storage_calculation(self, mock_task):
@delay_user_storage_calculation
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import json
import random
import uuid
from datetime import datetime
from datetime import timedelta
Expand All @@ -11,6 +12,8 @@
from django.core.files.storage import default_storage
from django.urls import reverse_lazy
from le_utils.constants import content_kinds
from le_utils.constants import file_formats
from le_utils.constants import format_presets

from contentcuration import models as cc
from contentcuration.api import activate_channel
Expand All @@ -23,6 +26,7 @@
from contentcuration.utils.garbage_collect import clean_up_contentnodes
from contentcuration.utils.garbage_collect import clean_up_deleted_chefs
from contentcuration.utils.garbage_collect import clean_up_feature_flags
from contentcuration.utils.garbage_collect import clean_up_stale_files
from contentcuration.utils.garbage_collect import clean_up_tasks
from contentcuration.utils.garbage_collect import get_deleted_chefs_root
from contentcuration.views.internal import api_commit_channel
Expand Down Expand Up @@ -396,3 +400,56 @@ def test_recent_task(self):
TaskResult.objects.get(task_id=self.recent_task.task_id)
except TaskResult.DoesNotExist:
self.fail("Task was removed")


TWO_DAYS_AGO = datetime.now() - timedelta(days=2)


def _create_stale_file(user, modified_date):
checksum = '%32x' % random.getrandbits(16 * 8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

file = File(
file_size=5,
checksum=checksum,
original_filename="SomeAudio.mp3",
file_on_disk=f"{checksum}.mp3",
file_format_id=file_formats.MP3,
preset_id=format_presets.AUDIO,
uploaded_by=user,
duration=365,
)
file.save()

# Manually setting last modified date of file
File.objects.filter(id=file.id).update(modified=modified_date)

return file


class CleanupStaleFilesTestCase(StudioTestCase):

def setUp(self):
user = self.admin_user

# create test stale files
self.file_to_be_deleted = _create_stale_file(user, THREE_MONTHS_AGO)
self.file_to_keep = _create_stale_file(user, TWO_DAYS_AGO)

# run
clean_up_stale_files()

def test_deletes_correct_files(self):
"""
Function should only get files where ContentNode, Assessmentitem, and SlideshowSlide are null
and modified date is older than the date specified
"""
with self.assertRaises(File.DoesNotExist):
File.objects.get(id=self.file_to_be_deleted.id)

def test_doesnt_delete_files_with_modified_date_less_than_date_specified(self):
"""
Function should not delete files where the modified date is more recent than the date specified
"""
try:
File.objects.get(id=self.file_to_keep.id)
except File.DoesNotExist:
self.fail("File was deleted")
20 changes: 20 additions & 0 deletions contentcuration/contentcuration/utils/garbage_collect.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@

from celery import states
from django.conf import settings
from django.db.models import OuterRef
from django.db.models.expressions import CombinedExpression
from django.db.models.expressions import Exists
from django.db.models.expressions import F
from django.db.models.expressions import Value
from django.db.models.signals import post_delete
Expand All @@ -16,6 +18,7 @@

from contentcuration.constants import feature_flags
from contentcuration.db.models.functions import JSONObjectKeys
from contentcuration.decorators import delay_user_storage_calculation
from contentcuration.models import ContentNode
from contentcuration.models import File
from contentcuration.models import TaskResult
Expand Down Expand Up @@ -145,3 +148,20 @@ def clean_up_tasks():
count, _ = TaskResult.objects.filter(date_done__lt=date_cutoff, status__in=states.READY_STATES).delete()

logging.info("Deleted {} completed task(s) from the task table".format(count))


@delay_user_storage_calculation
def clean_up_stale_files(last_modified=now() - datetime.timedelta(days=30)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to define the default inside the function scope. At the moment, by defining it in the function definition, now() will be whatever now() was at the time the module was imported. So if this function was run from inside a process that had been running for a hundred days, it would actually be looking for last_modified that was 130 days ago.

The usual Python pattern for this is to set a default of None, and then do if last_modified is None inside the function and then set it to the default value.

The big gotcha that can sometimes happen here is if the default value is set to a mutable type like a dict, and then successive invocations of the function can actually modify and share the default value.

See here: https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments for more about that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point that I did not realize, thank you! I will make those changes and update the pr.

"""
Clean up files that aren't attached to any ContentNode, AssessmentItem, or SlideshowSlide and where
the modified date is older than `last_modified`
"""

files_to_clean_up = File.objects.filter(
contentnode__isnull=True, assessment_item__isnull=True, slideshow_slide__isnull=True, modified__lt=last_modified
)

# Uses a subset of Python’s array-slicing syntax to limit deletion to 100000 files per function run
files_deleted, _ = File.objects.filter(Exists(files_to_clean_up.filter(pk=OuterRef('pk'))[:100000])).delete()

logging.info("Files with a modified date older than {} should be deleted. Deleted {} file(s).".format(last_modified, files_deleted))
2 changes: 1 addition & 1 deletion contentcuration/contentcuration/utils/publish.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,14 +40,14 @@
from past.utils import old_div

from contentcuration import models as ccmodels
from contentcuration.decorators import delay_user_storage_calculation
from contentcuration.statistics import record_publish_stats
from contentcuration.utils.cache import delete_public_channel_cache_keys
from contentcuration.utils.files import create_thumbnail_from_base64
from contentcuration.utils.files import get_thumbnail_encoding
from contentcuration.utils.parser import extract_value
from contentcuration.utils.parser import load_json_string
from contentcuration.utils.sentry import report_exception
from contentcuration.utils.user import delay_user_storage_calculation


logmodule.basicConfig()
Expand Down
29 changes: 1 addition & 28 deletions contentcuration/contentcuration/utils/user.py
Original file line number Diff line number Diff line change
@@ -1,39 +1,12 @@
import logging
from contextlib import ContextDecorator

from contentcuration.tasks import calculate_user_storage_task


class DelayUserStorageCalculation(ContextDecorator):
"""
Decorator class that will dedupe and delay requests to enqueue storage calculation tasks for users
until after the wrapped function has exited
"""
depth = 0
queue = []

@property
def is_active(self):
return self.depth > 0

def __enter__(self):
self.depth += 1

def __exit__(self, exc_type, exc_val, exc_tb):
self.depth -= 1
if not self.is_active:
user_ids = set(self.queue)
self.queue = []
for user_id in user_ids:
calculate_user_storage(user_id)


delay_user_storage_calculation = DelayUserStorageCalculation()


def calculate_user_storage(user_id):
"""TODO: Perhaps move this to User model to avoid unnecessary User lookups"""
from contentcuration.models import User
from contentcuration.decorators import delay_user_storage_calculation

if delay_user_storage_calculation.is_active:
delay_user_storage_calculation.queue.append(user_id)
Expand Down
2 changes: 1 addition & 1 deletion contentcuration/contentcuration/viewsets/sync/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from search.viewsets.savedsearch import SavedSearchViewSet

from contentcuration.utils.user import delay_user_storage_calculation
from contentcuration.decorators import delay_user_storage_calculation
from contentcuration.viewsets.assessmentitem import AssessmentItemViewSet
from contentcuration.viewsets.bookmark import BookmarkViewSet
from contentcuration.viewsets.channel import ChannelViewSet
Expand Down