Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for only producing .xz on CI #80753

Closed
5 of 6 tasks
pietroalbini opened this issue Jan 6, 2021 · 6 comments
Closed
5 of 6 tasks

Tracking issue for only producing .xz on CI #80753

pietroalbini opened this issue Jan 6, 2021 · 6 comments
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.

Comments

@pietroalbini
Copy link
Member

pietroalbini commented Jan 6, 2021

This issue tracks the work needed to only produce .xz tarballs on CI, recompressing them into .gz only during the release process. The effort will greatly reduce the amount of artifacts we need to store.

Implementation work needed:

Bugs caused by this:

@pietroalbini pietroalbini added T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC labels Jan 6, 2021
@pietroalbini
Copy link
Member Author

Forgot to create the tracking issue when the work started :)

@aidanhs
Copy link
Member

aidanhs commented Feb 10, 2021

I've got an inventory of the rust-lang-ci2 bucket. Observations:

  • there are three 'categories' of gz
    1. rustc-builds-alt/.../*nightly*.gz - ~50k
    2. rustc-builds/.../*nightly*.gz - ~500k
    3. beta variants of i and ii - ~20k
    4. version numbered variants of i and ii - ~3k
  • we're still uploading beta .gz files (but not release ones)

@aidanhs
Copy link
Member

aidanhs commented Feb 11, 2021

I think this is done. I will check the next inventory and, if so, revert rust-lang/simpleinfra@16e3aaa. For the record, here's my dirty script:

import boto3
import botocore
import csv
import json
import os
import collections
from concurrent.futures import ThreadPoolExecutor, as_completed

def mks3():
    session = boto3.Session(profile_name='mfa-rust')
    s3 = session.client('s3')
    return s3

status = json.load(open('progress.json'))
unsaved = 0
def save():
    global status
    print('Saving')
    json.dump(status, open('progress.tmp.json', 'w'))
    os.replace('progress.tmp.json', 'progress.json')
def set_deleted(key):
    global status
    global unsaved
    status[key] = True
    unsaved += 1
    if unsaved > 100:
        unsaved = 0
        save()

BUCKET = None
BATCH_SIZE = 100
NUM_WORKERS = 5
batches = []

def handle_batch(keys):
    global BUCKET
    s3 = mks3()
    to_delete = []
    deleted = []
    for key in keys:
        try:
            obj = s3.head_object(Bucket=BUCKET, Key=key)
            to_delete.append({ 'Key': key, 'VersionId': obj['VersionId'] })
        except botocore.exceptions.ClientError as e:
            if e.response['Error']['Code'] != '404':
                print('Failed to get object {} deets: {}'.format(key, e))
            else:
                print('Object {} missing, skipping'.format(key))
                deleted.append(key)
    if len(to_delete) != 0:
        print('Deleting {} objects'.format(len(to_delete)))
        ret = s3.delete_objects(Bucket=BUCKET, Delete={ 'Objects': to_delete })
        for obj in ret['Deleted']:
            deleted.append(obj['Key'])
        if 'Errors' in ret and len(ret['Errors']) != 0:
            print(ret['Errors'])
            assert False
    print('Finished deletion')
    return deleted

batch = []
rdr = csv.reader(open('nightlygzs'))
print('Creating batches')
for i, row in enumerate(rdr):
    if BUCKET is None:
        BUCKET = row[0]
    assert BUCKET == row[0]
    if row[1] in status:
        continue
    if len(batch) >= BATCH_SIZE:
        batches.append(batch)
        batch = []
    batch.append(row[1])

print('Got {} batches'.format(len(batches)))
executor = ThreadPoolExecutor(max_workers=NUM_WORKERS)
futures = [executor.submit(handle_batch, batch) for batch in batches]
print('Waiting for batch completion')
for i, complete in enumerate(as_completed(futures)):
    print('Completed batch {} of {}'.format(i+1, len(futures)))
    ret = complete.result()
    for key in ret:
        set_deleted(key)

save()

@aidanhs
Copy link
Member

aidanhs commented Feb 16, 2021

Correction - I deleted the nightly ones, so >500k combined. This looks like it's reduced our storage of ci2 from 78.5TiB to ~55TiB.

There are still a few other .gz files:

$ cat *.csv | grep '\.gz"$' | grep 'rustc-builds\(-alt\)*/.*/.*nightly.*\.gz"$' | wc -l
7090
$ cat *.csv | grep '\.gz"$' | grep -v 'rustc-builds\(-alt\)*/.*/.*nightly.*\.gz"$' | wc -l
26190

The first set are .gz files that mention nightly - not sure where they've come from, maybe they got missed in the initial run?
Of the second command, ~21k are nightly and most of the others are release versions. I estimate if we nuked the remainder of nightly/beta that's another 1TiB - but I'd wait until we stop uploading beta releases.

@jyn514
Copy link
Member

jyn514 commented Dec 17, 2022

What's the status of this? I think at this point we've stopped uploading new .gz files?

@pietroalbini
Copy link
Member Author

This is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

3 participants