-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean serval data #614
base: master
Are you sure you want to change the base?
clean serval data #614
Conversation
There will be other files, and there will also be built models that may be sent back in this folder. I would be more interested in removing all files that are in the folders that are over a certain date. |
It may be nice to have the ability to have "max months" be different for research and production. Specifically, I would like at least 2 months for production data. It isn't that large and it can be very helpful to see what went wrong. |
To have the ability change the max months for production or research, it may be best to have the two paths even at the top - a switch for research or production. |
I would be interested in a txt or csv output listing all the files it found that were not too old or too old, their creation timestamps and the determination to "delete or not delete". This could be output when dry run is invoked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll work on adding this. Currently I redirect the text output to a .txt file every time the cron job for this script runs.
Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @johnml1135)
scripts/clean_s3.py
line 15 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
It may be nice to have the ability to have "max months" be different for research and production. Specifically, I would like at least 2 months for production data. It isn't that large and it can be very helpful to see what went wrong.
Done.
scripts/clean_s3.py
line 47 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
There will be other files, and there will also be built models that may be sent back in this folder. I would be more interested in removing all files that are in the folders that are over a certain date.
To clarify, which folders do you want to remove data from? The original post mentioned deleting "corpus and pretranslation files once a build has finished", so I imagine it's all the files in the builds
folder. Would you also like to delete files in the models
folder?
scripts/clean_s3.py
line 88 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
To have the ability change the max months for production or research, it may be best to have the two paths even at the top - a switch for research or production.
Done.
Previously, mshannon-sil wrote…
Only builds. We need to keep the models. We allow models to be downloaded directly from the S3 bucket, so we need to keep them around (indefinitely). That is, we manage the model lifecycle (for now). |
I'll approve it once you have a dry run and can email me the results. |
This PR addresses sillsdev/serval#468. The clean_s3.py script now also cleans up serval data. Any file
pretranslate.src.json|pretranslate.tgt.json|train.src.txt|train.tgt.txt
in the folders^(production|dev|int-qa|ext-qa)/builds/.+
that is older than 1 month will be deleted when this script is run. The script is run every week on Sunday at 1am CT as a cron job on the AQuA server.This change is