Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean serval data #614

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

clean serval data #614

wants to merge 2 commits into from

Conversation

mshannon-sil
Copy link
Collaborator

@mshannon-sil mshannon-sil commented Dec 23, 2024

This PR addresses sillsdev/serval#468. The clean_s3.py script now also cleans up serval data. Any file pretranslate.src.json|pretranslate.tgt.json|train.src.txt|train.tgt.txt in the folders ^(production|dev|int-qa|ext-qa)/builds/.+ that is older than 1 month will be deleted when this script is run. The script is run every week on Sunday at 1am CT as a cron job on the AQuA server.


This change is Reviewable

@mshannon-sil mshannon-sil self-assigned this Dec 24, 2024
@johnml1135
Copy link
Collaborator

scripts/clean_s3.py line 47 at r1 (raw file):

    regex_to_delete = re.compile(
        r"^(production|dev|int-qa|ext-qa)/builds/.+"
        r"(pretranslate.src.json|pretranslate.tgt.json|train.src.txt|train.tgt.txt)"

There will be other files, and there will also be built models that may be sent back in this folder. I would be more interested in removing all files that are in the folders that are over a certain date.

@johnml1135
Copy link
Collaborator

scripts/clean_s3.py line 15 at r1 (raw file):

    research_total_deleted, research_total_space_freed = clean_research(max_months, dry_run)
    print("Cleaning production")
    production_total_deleted, production_total_space_freed = clean_production(max_months, dry_run)

It may be nice to have the ability to have "max months" be different for research and production. Specifically, I would like at least 2 months for production data. It isn't that large and it can be very helpful to see what went wrong.

@johnml1135
Copy link
Collaborator

scripts/clean_s3.py line 88 at r1 (raw file):

    args = parser.parse_args()

    clean_s3(args.max_months, args.dry_run)

To have the ability change the max months for production or research, it may be best to have the two paths even at the top - a switch for research or production.

@johnml1135
Copy link
Collaborator

I would be interested in a txt or csv output listing all the files it found that were not too old or too old, their creation timestamps and the determination to "delete or not delete". This could be output when dry run is invoked.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil)

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll work on adding this. Currently I redirect the text output to a .txt file every time the cron job for this script runs.

Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @johnml1135)


scripts/clean_s3.py line 15 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

It may be nice to have the ability to have "max months" be different for research and production. Specifically, I would like at least 2 months for production data. It isn't that large and it can be very helpful to see what went wrong.

Done.


scripts/clean_s3.py line 47 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

There will be other files, and there will also be built models that may be sent back in this folder. I would be more interested in removing all files that are in the folders that are over a certain date.

To clarify, which folders do you want to remove data from? The original post mentioned deleting "corpus and pretranslation files once a build has finished", so I imagine it's all the files in the builds folder. Would you also like to delete files in the models folder?


scripts/clean_s3.py line 88 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

To have the ability change the max months for production or research, it may be best to have the two paths even at the top - a switch for research or production.

Done.

@johnml1135
Copy link
Collaborator

scripts/clean_s3.py line 47 at r1 (raw file):

Previously, mshannon-sil wrote…

To clarify, which folders do you want to remove data from? The original post mentioned deleting "corpus and pretranslation files once a build has finished", so I imagine it's all the files in the builds folder. Would you also like to delete files in the models folder?

Only builds. We need to keep the models. We allow models to be downloaded directly from the S3 bucket, so we need to keep them around (indefinitely). That is, we manage the model lifecycle (for now).

@johnml1135
Copy link
Collaborator

I'll approve it once you have a dry run and can email me the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants