Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single Stage Detection with MLCube™ [request for feedback] #465

Conversation

sergey-serebryakov
Copy link

@sergey-serebryakov sergey-serebryakov commented Apr 22, 2021

Updates

  • 06/05/2021-02 Adding --force-reinstall switch for pip install command in the step-by-step guide below.
  • 06/05/2021-01 Fixing bug: "docker image exists" check now uses docker command specified in docker platform file. In previous version, the docker command was hard coded for this check.
  • 05/05/2021-01 Adding missing dependency to MLCube™ docker file (unzip).
  • 02/05/2021-01 Fixed errors in Current implementation section related to installing MLCube from GitHub repository.
  • 01/05/2021-01 Vision section below now clearly states it's not a working example.
  • 22/04/2021-01 All pending MLCube PRs have been merged into master.

Known problems

  • 05/05/2021 User environment needs sudo to run docker containers. A quick fix could be to replace command: docker with command: sudo docker in docker.yaml.

Introduction

MLCommons™ Best Practices WG is working towards simplifying the process of running ML workloads, including MLCommons reference training and inference benchmarks. We have developed a prototype of a library that we call MLCube™.

The goal of this PR is to show how MLCube can be used to run MLCommons training and inference workloads, and to gather a feedback.

Vision

This does not work now! We need to do a new MLCube release. This section describes our vision. Next section
(Current implementation) shows the working example.

One possible way of interacting with MLCubes is presented in this section. To simplify the process of running ML models, users need to know the following:

  • They need to be aware about MLCube.
  • They need to know how to install it.
  • They need to know that they can run mlcube describe in a MLCube directory.

Install MLCube:

virtualenv -p python3 ./mlcube_env
source ./mlcube_env/bin/activate
pip install mlcube

Get the MLCommons SSD reference benchmark:

mlcube pull https://github.com/mlcommons/training --project single_stage_detector
cd ./single_stage_detector

Explore what tasks SSD MLCube supports:

mlcube describe

Run SSD benchmark using local Docker runtime:

# Download SSD dataset (~20 GB, ~40 GB space required)
mlcube run --task download_data --platform docker

# Download ResNet34 feature extractor
mlcube run --task download_model --platform docker

# Run benchmark
mlcube run --task train --platform docker

Current implementation

We'll be updating this section as we merge MLCube PRs and make new MLCube releases.

# Create Python environment 
virtualenv -p python3 ./env && source ./env/bin/activate

# Install MLCube and MLCube docker runner from GitHub repository (normally, users will just run `pip install mlcube mlcube_docker`)
git clone https://github.com/mlcommons/mlcube && cd ./mlcube
cd ./mlcube && python setup.py bdist_wheel  && pip install --force-reinstall ./dist/mlcube-* && cd ..
cd ./runners/mlcube_docker && python setup.py bdist_wheel  && pip install --force-reinstall --no-deps ./dist/mlcube_docker-* && cd ../../..

# Fetch the SSD workload
git clone https://github.com/mlcommons/training && cd ./training
git fetch origin pull/465/head:feature/mlcube-ssd && git checkout feature/mlcube-ssd
cd ./single_stage_detector

# Build MLCube docker image. We'll find a better way of integrating existing workloads
# with MLCube, so that MLCube runs this by itself (it can actually do it now, but in order
# to enable this, we would have to introduce more changes to the SSD repo).
docker build --build-arg http_proxy="${http_proxy}" --build-arg https_proxy="${https_proxy}" . -t mlcommons/train_ssd:0.0.1 -f Dockerfile.mlcube

# Show tasks implemented in this MLCube.
cd ./mlcube && mlcube describe

# Download SSD dataset (~20 GB, ~40 GB space required). Default paths = ./workspace/cache and ./workspace/data
# To override them, use --cache_dir=CACHE_DIR and --data_dir=DATA_DIR
mlcube run --task download_data --platform docker

# Download ResNet34 feature extractor. Default path = ./workspace/data
# To override, use: --data_dir=DATA_DIR
mlcube run --task download_model --platform docker

# Run benchmark. Default paths = ./workspace/data
# Parameters to override: --data_dir=DATA_DIR, --pretrained_backbone=PATH_TO_RESNET3_WEIGHTS, --parameters_file=PATH_TO_TRAINING_PARAMS
mlcube run --task train --platform docker

@github-actions
Copy link

github-actions bot commented Apr 22, 2021

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@TheKanter TheKanter changed the title Single Stage Detection with MLCube [request for feedback] Single Stage Detection with MLCube™ [request for feedback] Apr 23, 2021

class DownloadDataTask(object):

urls = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this entire file except these urls is boilerplate copy/paste, so there should just be one copy of the code in the central mlcube distribution. Then these urls can go in the single_stage_detector/mlcube/.mlcube.yaml file.



class DownloadModelTask(object):
url = "https://download.pytorch.org/models/resnet34-333f7ec4.pth"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also move this to single_stage_detector/mlcube/.mlcube.yaml

cached_archive = cache_dir / archive_name
if not cached_archive.exists():
print(f"Data ({name}) is not in cache ({cached_archive}), downloading ...")
os.system(f"cd {cache_dir}; curl -O {DownloadDataTask.urls[name]};")
Copy link
Contributor

@matthew-frank matthew-frank Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be followed by an md5sum check (and we should have the md5sum for each downloaded file in addition to just the url)

shutil.copyfile(cached_archive, dest_archive)

print(f"Extracting archive ({archive_name}) ...")
os.system(f"cd {data_dir}; unzip {archive_name};")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be helpful to have a consistency check here as well (presumably not an md5sum of every file, but some indication that the user has the right data? For this case, perhaps the number of .jpgs?

export DATASET_DIR="/data/coco2017"
export TORCH_MODEL_ZOO="/data/torchvision"
export DATASET_DIR=${DATASET_DIR:-"/data/coco2017"}
export TORCH_MODEL_ZOO=${TORCH_MODEL_ZOO:-"/data/torchvision"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the TORCH_MODEL_ZOO variable is unnecessary if you are pre-downloading the resnet34-333f7ec4.pth file and using the --pretrained_backbone script argument. Also Pytorch changes the name of the envvar in almost every version, so you can't really depend on it (which is part of why we implemented the --pretrained_backbone script argument).

@matthew-frank
Copy link
Contributor

This PR refers to the old, retired, ssd-v1 benchmark, which was replaced by the Retinanet benchmark.

In an effort to do a better job maintaining this repo, we're closing PRs for retired benchmarks. The old benchmark code still exists, but has been moved to https://github.com/mlcommons/training/tree/master/retired_benchmarks/ssd-v1/.

If you think there is useful cleanup to be done to the retired_benchmarks subtree, please submit a new PR.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants