Single Stage Detection with MLCube™ [request for feedback] #465

sergey-serebryakov · 2021-04-22T22:04:31Z

Updates

06/05/2021-02 Adding --force-reinstall switch for pip install command in the step-by-step guide below.
06/05/2021-01 Fixing bug: "docker image exists" check now uses docker command specified in docker platform file. In previous version, the docker command was hard coded for this check.
05/05/2021-01 Adding missing dependency to MLCube™ docker file (unzip).
02/05/2021-01 Fixed errors in Current implementation section related to installing MLCube from GitHub repository.
01/05/2021-01 Vision section below now clearly states it's not a working example.
22/04/2021-01 All pending MLCube PRs have been merged into master.

Known problems

05/05/2021 User environment needs sudo to run docker containers. A quick fix could be to replace command: docker with command: sudo docker in docker.yaml.

Introduction

MLCommons™ Best Practices WG is working towards simplifying the process of running ML workloads, including MLCommons reference training and inference benchmarks. We have developed a prototype of a library that we call MLCube™.

MLCube GitHub repository
MLCube wiki

The goal of this PR is to show how MLCube can be used to run MLCommons training and inference workloads, and to gather a feedback.

Vision

This does not work now! We need to do a new MLCube release. This section describes our vision. Next section
(Current implementation) shows the working example.

One possible way of interacting with MLCubes is presented in this section. To simplify the process of running ML models, users need to know the following:

They need to be aware about MLCube.
They need to know how to install it.
They need to know that they can run mlcube describe in a MLCube directory.

Install MLCube:

virtualenv -p python3 ./mlcube_env
source ./mlcube_env/bin/activate
pip install mlcube

Get the MLCommons SSD reference benchmark:

mlcube pull https://github.com/mlcommons/training --project single_stage_detector
cd ./single_stage_detector

Explore what tasks SSD MLCube supports:

mlcube describe

Run SSD benchmark using local Docker runtime:

# Download SSD dataset (~20 GB, ~40 GB space required)
mlcube run --task download_data --platform docker

# Download ResNet34 feature extractor
mlcube run --task download_model --platform docker

# Run benchmark
mlcube run --task train --platform docker

Current implementation

We'll be updating this section as we merge MLCube PRs and make new MLCube releases.

# Create Python environment 
virtualenv -p python3 ./env && source ./env/bin/activate

# Install MLCube and MLCube docker runner from GitHub repository (normally, users will just run `pip install mlcube mlcube_docker`)
git clone https://github.com/mlcommons/mlcube && cd ./mlcube
cd ./mlcube && python setup.py bdist_wheel  && pip install --force-reinstall ./dist/mlcube-* && cd ..
cd ./runners/mlcube_docker && python setup.py bdist_wheel  && pip install --force-reinstall --no-deps ./dist/mlcube_docker-* && cd ../../..

# Fetch the SSD workload
git clone https://github.com/mlcommons/training && cd ./training
git fetch origin pull/465/head:feature/mlcube-ssd && git checkout feature/mlcube-ssd
cd ./single_stage_detector

# Build MLCube docker image. We'll find a better way of integrating existing workloads
# with MLCube, so that MLCube runs this by itself (it can actually do it now, but in order
# to enable this, we would have to introduce more changes to the SSD repo).
docker build --build-arg http_proxy="${http_proxy}" --build-arg https_proxy="${https_proxy}" . -t mlcommons/train_ssd:0.0.1 -f Dockerfile.mlcube

# Show tasks implemented in this MLCube.
cd ./mlcube && mlcube describe

# Download SSD dataset (~20 GB, ~40 GB space required). Default paths = ./workspace/cache and ./workspace/data
# To override them, use --cache_dir=CACHE_DIR and --data_dir=DATA_DIR
mlcube run --task download_data --platform docker

# Download ResNet34 feature extractor. Default path = ./workspace/data
# To override, use: --data_dir=DATA_DIR
mlcube run --task download_model --platform docker

# Run benchmark. Default paths = ./workspace/data
# Parameters to override: --data_dir=DATA_DIR, --pretrained_backbone=PATH_TO_RESNET3_WEIGHTS, --parameters_file=PATH_TO_TRAINING_PARAMS
mlcube run --task train --platform docker

github-actions · 2021-04-22T22:04:45Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

matthew-frank · 2021-06-29T22:27:35Z

single_stage_detector/ssd/mlcube.py

+
+class DownloadDataTask(object):
+
+    urls = {


seems like this entire file except these urls is boilerplate copy/paste, so there should just be one copy of the code in the central mlcube distribution. Then these urls can go in the single_stage_detector/mlcube/.mlcube.yaml file.

matthew-frank · 2021-06-29T22:28:25Z

single_stage_detector/ssd/mlcube.py

+
+
+class DownloadModelTask(object):
+    url = "https://download.pytorch.org/models/resnet34-333f7ec4.pth"


also move this to single_stage_detector/mlcube/.mlcube.yaml

matthew-frank · 2021-06-29T22:30:25Z

single_stage_detector/ssd/mlcube.py

+        cached_archive = cache_dir / archive_name
+        if not cached_archive.exists():
+            print(f"Data ({name}) is not in cache ({cached_archive}), downloading ...")
+            os.system(f"cd {cache_dir}; curl -O {DownloadDataTask.urls[name]};")


this should be followed by an md5sum check (and we should have the md5sum for each downloaded file in addition to just the url)

matthew-frank · 2021-06-29T22:31:47Z

single_stage_detector/ssd/mlcube.py

+            shutil.copyfile(cached_archive, dest_archive)
+
+        print(f"Extracting archive ({archive_name}) ...")
+        os.system(f"cd {data_dir}; unzip {archive_name};")


it would be helpful to have a consistency check here as well (presumably not an md5sum of every file, but some indication that the user has the right data? For this case, perhaps the number of .jpgs?

matthew-frank · 2021-06-29T22:42:53Z

single_stage_detector/ssd/run_and_time.sh

-export DATASET_DIR="/data/coco2017"
-export TORCH_MODEL_ZOO="/data/torchvision"
+export DATASET_DIR=${DATASET_DIR:-"/data/coco2017"}
+export TORCH_MODEL_ZOO=${TORCH_MODEL_ZOO:-"/data/torchvision"}


I believe the TORCH_MODEL_ZOO variable is unnecessary if you are pre-downloading the resnet34-333f7ec4.pth file and using the --pretrained_backbone script argument. Also Pytorch changes the name of the envvar in almost every version, so you can't really depend on it (which is part of why we implemented the --pretrained_backbone script argument).

matthew-frank · 2022-12-02T17:28:07Z

This PR refers to the old, retired, ssd-v1 benchmark, which was replaced by the Retinanet benchmark.

In an effort to do a better job maintaining this repo, we're closing PRs for retired benchmarks. The old benchmark code still exists, but has been moved to https://github.com/mlcommons/training/tree/master/retired_benchmarks/ssd-v1/.

If you think there is useful cleanup to be done to the retired_benchmarks subtree, please submit a new PR.

sergey-serebryakov added 3 commits April 16, 2021 00:09

Support to run SSD training workload with MLCube.

a9db1dd

Updating documentation.

bc6d4a5

Merge branch 'master' into feature/mlcube-ssd

4aee927

TheKanter changed the title ~~Single Stage Detection with MLCube [request for feedback]~~ Single Stage Detection with MLCube™ [request for feedback] Apr 23, 2021

Adding missing dependency to MLCube docker file - unzip.

f731e38

bitfort mentioned this pull request Jun 9, 2021

Reproduce SSD MLCube run mlcommons/mlcube#185

Closed

bitfort self-requested a review June 16, 2021 15:55

davidjurado mentioned this pull request Jun 24, 2021

Add MLCube support for RNN speech recognition #491

Open

matthew-frank reviewed Jun 29, 2021

View reviewed changes

davidjurado mentioned this pull request Jul 2, 2021

Add MLCube support for Image Segmentation Benchmark #494

Open

matthew-frank closed this Dec 2, 2022

github-actions bot locked and limited conversation to collaborators Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single Stage Detection with MLCube™ [request for feedback] #465

Single Stage Detection with MLCube™ [request for feedback] #465

sergey-serebryakov commented Apr 22, 2021 •

edited

Loading

github-actions bot commented Apr 22, 2021 •

edited

Loading

matthew-frank Jun 29, 2021

matthew-frank Jun 29, 2021

matthew-frank Jun 29, 2021 •

edited

Loading

matthew-frank Jun 29, 2021

matthew-frank Jun 29, 2021

matthew-frank commented Dec 2, 2022



		class DownloadModelTask(object):
		url = "https://download.pytorch.org/models/resnet34-333f7ec4.pth"

Single Stage Detection with MLCube™ [request for feedback] #465

Single Stage Detection with MLCube™ [request for feedback] #465

Conversation

sergey-serebryakov commented Apr 22, 2021 • edited Loading

Updates

Known problems

Introduction

Vision

Current implementation

github-actions bot commented Apr 22, 2021 • edited Loading

matthew-frank Jun 29, 2021

Choose a reason for hiding this comment

matthew-frank Jun 29, 2021

Choose a reason for hiding this comment

matthew-frank Jun 29, 2021 • edited Loading

Choose a reason for hiding this comment

matthew-frank Jun 29, 2021

Choose a reason for hiding this comment

matthew-frank Jun 29, 2021

Choose a reason for hiding this comment

matthew-frank commented Dec 2, 2022

sergey-serebryakov commented Apr 22, 2021 •

edited

Loading

github-actions bot commented Apr 22, 2021 •

edited

Loading

matthew-frank Jun 29, 2021 •

edited

Loading