Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup stale files #3584

Merged

Conversation

LianaHarris360
Copy link
Member

@LianaHarris360 LianaHarris360 commented Aug 25, 2022

Summary

Description of the change(s) you made

This PR adds a function that cleans up File objects to the garbage collection utilities. Files with no associated ContentNode AssessementItem or SlideShow will be deleted, as well as files that have not been modified since the date specified (defaults to one month ago if no date is given).

The garbage collect management command is updated to run this function for any files not modified in the last month. This PR also includes a code clean-up of the decorators.py file. Sections deleted within this file were unused.

Reviewer guidance

How can a reviewer test these changes?

Tests for this function have been added to the test_garbage_collect.py file. Tests for the deletion of files that meet the specified requirements (no ContentNode AssessementItem or SlideShow association, and have not been modified in 1 month) and the non-deletion of files that have no ContentNode AssessementItem or SlideShow association, but have been modified more recently.

References

Fixes #3533

Contributor's Checklist

PR process:

  • If this is an important user-facing change, PR or related issue the CHANGELOG label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later time
  • If this includes an internal dependency change, a link to the diff is provided
  • The docs label has been added if this introduces a change that needs to be updated in the user docs?
  • If any Python requirements have changed, the updated requirements.txt files also included in this PR
  • Opportunities for using Google Analytics here are noted
  • Migrations are safe for a large db

Studio-specifc:

  • All user-facing strings are translated properly
  • The notranslate class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)
  • All UI components are LTR and RTL compliant
  • Views are organized into pages, components, and layouts directories as described in the docs
  • Users' storage used is recalculated properly on any changes to main tree files
  • If there new ways this uses user data that needs to be factored into our Privacy Policy, it has been noted.

Testing:

  • Code is clean and well-commented
  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Any new interactions have been added to the QA Sheet
  • Critical and brittle code paths are covered by unit tests

Reviewer's Checklist

This section is for reviewers to fill out.

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

wrap.__name__ = function.__name__
return wrap


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a good idea to note in your PR description why we deleted these!



def _create_stale_file(user, modified_date):
checksum = '%32x' % random.getrandbits(16 * 8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

…(previously test_user.py) for consistency and in response to user.py test case fail.
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One important change needed to make sure we delete everything. One other which is more precautionary.



@delay_user_storage_calculation
def clean_up_stale_files(last_modified=now() - datetime.timedelta(days=30)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to define the default inside the function scope. At the moment, by defining it in the function definition, now() will be whatever now() was at the time the module was imported. So if this function was run from inside a process that had been running for a hundred days, it would actually be looking for last_modified that was 130 days ago.

The usual Python pattern for this is to set a default of None, and then do if last_modified is None inside the function and then set it to the default value.

The big gotcha that can sometimes happen here is if the default value is set to a mutable type like a dict, and then successive invocations of the function can actually modify and share the default value.

See here: https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments for more about that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point that I did not realize, thank you! I will make those changes and update the pr.

@@ -33,5 +34,7 @@ def handle(self, *args, **options):
clean_up_deleted_chefs()
logging.info("Cleaning up feature flags")
clean_up_feature_flags()
logging.info("Cleaning up stale file objects")
clean_up_stale_files()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only cleaning up 10000 files at a time, so each run of the garbage collection job (which runs daily on production) will only clear up 10000 files - instead we should be running until there are no more files that meet the criteria. One possibility would be to return the number of deleted files, and keep on trying to delete until the number of deleted files is zero.

If we don't do this, it will take us approximately 2 years to clear all the existing 'orphaned' file objects at just 10,000 a day!

Copy link
Member

@bjester bjester Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rtibbles It's actually 100k, not 10k. We decided to limit this because we don't know how long deleting the 7 million would take. I think what the actual limit is could be adjusted. We looked at what the cron schedule of the job was in production, and a very conservative value was chosen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I assumed that this was intended as a batching parameter to limit the length of any particular delete query. Even with 100000 it will still be 70 days before the backlog is cleared.

Is there a specific concern with the scheduled job taking a long time? It seems like it will be a one time long run and then after that will just be keeping on top of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We saw that the cron jobs are scheduled to run every 24 hours. While looking at Django’s delete() method, we noticed that it also triggers the deletion signals, which could take some time to process, along with the memory usage of deleting millions of file objects in one go. The potential for the job to run past 24 hours was our main concern for the limit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be two concerns here - one about limiting the memory usage and time of the deletion operation, and one about the run time of the cronjob.

I think we should be limiting the memory usage, and could even reduce the batch size further to remove any concerns.

For the cronjob, I think it is more important that it runs to completion than it finish within 24 hours - as this is parallel to the other deletions that are happening in the cronjob. Without ensuring that all the file objects matching this criteria are deleted, we continue to run the risk of a file table filled with an ever expanding number of 'junk' files.

In terms of concerns about signals, there is one signal that is currently attached to the post_delete signal of the File object, which recalculates file storage:

def auto_delete_file_on_delete(sender, instance, **kwargs):
- you are already handling this using the decorator that delays storage recalculation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that it would deadlock, maybe the subsequent or existing run would end up running superfluous deletes?

At the moment, I think this is plausible when running the KA EN Chef (although I am trying to fix the ricecooker behaviour to prevent this).

Since there is quite a bit of unknown here, seems like it would be safer to limit it then. So how about a limit of 500k? Or 1 million?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with a higher limit - do we still want to batch though to limit memory usage? Presumably for the post_delete signal the full model representation is being loaded, and I'm not sure if Django is batching that or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it doesn't do any batching, but it does check if there are any signal handlers to see whether it needs to load the model representations - this might explain why the other delete methods for garbage collection are explicitly disabling the signal handlers with a context manager.

https://docs.djangoproject.com/en/3.2/ref/models/querysets/#delete

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we are going to increase the limit, we probably need to disable the signal handlers, rather than just deferring the user storage calculation.

However, it seems like we don't actually need to run the user storage calculation at all, as by definition, these files are not part of any active trees (and the user storage calculation only calculates for active trees).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I will update the function to include a higher limit and remove the storage calculation. Do you have any good examples and/or links for batching?

… higher limit and added DisablePostDeleteSignal context manager
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good - we can see how it behaves on unstable and make further updates if needed!

@rtibbles rtibbles merged commit c603816 into learningequality:unstable Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup stale File objects during the garbage collection job
3 participants