-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup stale files #3584
Cleanup stale files #3584
Conversation
wrap.__name__ = function.__name__ | ||
return wrap | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a good idea to note in your PR description why we deleted these!
|
||
|
||
def _create_stale_file(user, modified_date): | ||
checksum = '%32x' % random.getrandbits(16 * 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
…(previously test_user.py) for consistency and in response to user.py test case fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One important change needed to make sure we delete everything. One other which is more precautionary.
|
||
|
||
@delay_user_storage_calculation | ||
def clean_up_stale_files(last_modified=now() - datetime.timedelta(days=30)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to define the default inside the function scope. At the moment, by defining it in the function definition, now()
will be whatever now()
was at the time the module was imported. So if this function was run from inside a process that had been running for a hundred days, it would actually be looking for last_modified
that was 130 days ago.
The usual Python pattern for this is to set a default of None
, and then do if last_modified is None
inside the function and then set it to the default value.
The big gotcha that can sometimes happen here is if the default value is set to a mutable type like a dict, and then successive invocations of the function can actually modify and share the default value.
See here: https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments for more about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point that I did not realize, thank you! I will make those changes and update the pr.
@@ -33,5 +34,7 @@ def handle(self, *args, **options): | |||
clean_up_deleted_chefs() | |||
logging.info("Cleaning up feature flags") | |||
clean_up_feature_flags() | |||
logging.info("Cleaning up stale file objects") | |||
clean_up_stale_files() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is only cleaning up 10000 files at a time, so each run of the garbage collection job (which runs daily on production) will only clear up 10000 files - instead we should be running until there are no more files that meet the criteria. One possibility would be to return the number of deleted files, and keep on trying to delete until the number of deleted files is zero.
If we don't do this, it will take us approximately 2 years to clear all the existing 'orphaned' file objects at just 10,000 a day!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rtibbles It's actually 100k, not 10k. We decided to limit this because we don't know how long deleting the 7 million would take. I think what the actual limit is could be adjusted. We looked at what the cron schedule of the job was in production, and a very conservative value was chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I assumed that this was intended as a batching parameter to limit the length of any particular delete query. Even with 100000 it will still be 70 days before the backlog is cleared.
Is there a specific concern with the scheduled job taking a long time? It seems like it will be a one time long run and then after that will just be keeping on top of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We saw that the cron jobs are scheduled to run every 24 hours. While looking at Django’s delete() method, we noticed that it also triggers the deletion signals, which could take some time to process, along with the memory usage of deleting millions of file objects in one go. The potential for the job to run past 24 hours was our main concern for the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be two concerns here - one about limiting the memory usage and time of the deletion operation, and one about the run time of the cronjob.
I think we should be limiting the memory usage, and could even reduce the batch size further to remove any concerns.
For the cronjob, I think it is more important that it runs to completion than it finish within 24 hours - as this is parallel to the other deletions that are happening in the cronjob. Without ensuring that all the file objects matching this criteria are deleted, we continue to run the risk of a file table filled with an ever expanding number of 'junk' files.
In terms of concerns about signals, there is one signal that is currently attached to the post_delete signal of the File object, which recalculates file storage:
studio/contentcuration/contentcuration/models.py
Line 2232 in 4444843
def auto_delete_file_on_delete(sender, instance, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that it would deadlock, maybe the subsequent or existing run would end up running superfluous deletes?
At the moment, I think this is plausible when running the KA EN Chef (although I am trying to fix the ricecooker behaviour to prevent this).
Since there is quite a bit of unknown here, seems like it would be safer to limit it then. So how about a limit of 500k? Or 1 million?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with a higher limit - do we still want to batch though to limit memory usage? Presumably for the post_delete
signal the full model representation is being loaded, and I'm not sure if Django is batching that or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it doesn't do any batching, but it does check if there are any signal handlers to see whether it needs to load the model representations - this might explain why the other delete methods for garbage collection are explicitly disabling the signal handlers with a context manager.
https://docs.djangoproject.com/en/3.2/ref/models/querysets/#delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we are going to increase the limit, we probably need to disable the signal handlers, rather than just deferring the user storage calculation.
However, it seems like we don't actually need to run the user storage calculation at all, as by definition, these files are not part of any active trees (and the user storage calculation only calculates for active trees).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I will update the function to include a higher limit and remove the storage calculation. Do you have any good examples and/or links for batching?
… higher limit and added DisablePostDeleteSignal context manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good - we can see how it behaves on unstable and make further updates if needed!
Summary
Description of the change(s) you made
This PR adds a function that cleans up File objects to the garbage collection utilities. Files with no associated ContentNode AssessementItem or SlideShow will be deleted, as well as files that have not been modified since the date specified (defaults to one month ago if no date is given).
The garbage collect management command is updated to run this function for any files not modified in the last month. This PR also includes a code clean-up of the decorators.py file. Sections deleted within this file were unused.
Reviewer guidance
How can a reviewer test these changes?
Tests for this function have been added to the test_garbage_collect.py file. Tests for the deletion of files that meet the specified requirements (no ContentNode AssessementItem or SlideShow association, and have not been modified in 1 month) and the non-deletion of files that have no ContentNode AssessementItem or SlideShow association, but have been modified more recently.
References
Fixes #3533
Contributor's Checklist
PR process:
CHANGELOG
label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later timedocs
label has been added if this introduces a change that needs to be updated in the user docs?requirements.txt
files also included in this PRStudio-specifc:
notranslate
class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)pages
,components
, andlayouts
directories as described in the docsTesting:
Reviewer's Checklist
This section is for reviewers to fill out.
yarn
andpip
)