Train and deploy Mask R-CNN model using Sagemaker Operator for Kubernetes

AWS recently introduced Sagemaker Operator for Kubernetes which allows development teams to integrate Sagemaker services with existing Kubernetes infrastructure.

Scope

This project provides a set of scripts to build container with Mask R-CNN model, then train and deploy model using Amazon EKS cluster.

We use TensorPack Mask/Faster-RCNN model implementation. Model is trained on COCO 2017 dataset.

Prerequisites

You need to have AWS account (reference). It's recommended to create all AWS resource in specific AWS region (e.g. us-west-2 or us-east-2).
You need to have EKS cluster with Sagemaker Operator configured (setup reference)
Ensure your AWS account has appropriate resource limits. You'll need at least 2 ml.p3.16xlarge instances for Mask R-CNN training. It's recommended to have 4 instances.
(recommended) Use Sagemaker managed Jupyter/JupyterLab notebooks. Note, that your Sagemaker notebook needs to have IAM role which is authorized to access EKS cluster, have eksctl, and kubctl utils configured. See configuration steps on Step #2.
Create S3 bucket to store training data and training output (reference).

Training Faster R-CNN model

Run ./prepare_s3_bucket.sh <YOUR_S3_BUCKET>. This will upload training dataset to your S3 bucket.
Run ./build_and_push <YOUR_AWS_REGION>. This will create Mask R-CNN image with training script and push it to AWS ECR. This operation will take 5-10 minutes to complete. If successful, you should see URI of your image. You'll need it later.
Update train.yaml as follows:

update "name" field with unique value. This will be a name of your Sagemaker training job;
update "trainingImage" with URI of your container image (Prepare your envrionment - Step #3);
update "roleArn" with your Sagemaker execution role (reference);
update "region" with you AWS region;
update your "S3OutputPath" and "inputDataPath" with your S3 bucket.

Run kubectl apply -f train.yaml in terminal. This will schedule the Sagemaker job.
Monitor your job in Sagemaker console or by running kubectl describe trainingjob.

Hosting trained model

Work in Progress. Following tasks are pending:

Use separate Sagemaker TF Serving image for inference.
- Details of deploymnet of TF Serving containers in Sagemaker: https://sagemaker.readthedocs.io/en/stable/using_tf.html#deploy-tensorflow-serving-models
Export trained Tensorpack model into format compatible with TF Serving container (require update of train.py)
- Tensorpack has methods to export export to model.pb format: https://github.com/tensorpack/tensorpack/blob/master/examples/basics/export-model.py & https://tensorpack.readthedocs.io/_modules/tensorpack/tfutils/export.html
Implement Sagemaker-specific code:
- handle input and output data of inference request;
- test that Tensorpack exported model implements standard TF Serving interface

Credits

Mask-RCNN training script and docker image are copied from this AWS repository

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
container		container
README.md		README.md
hosting.yaml		hosting.yaml
prepare_s3_bucket.sh		prepare_s3_bucket.sh
train.yaml		train.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Train and deploy Mask R-CNN model using Sagemaker Operator for Kubernetes

Scope

Prerequisites

Training Faster R-CNN model

Hosting trained model

Credits

About

Releases

Packages

Languages

vdabravolski/Mask-RCNN-EKS

Folders and files

Latest commit

History

Repository files navigation

Train and deploy Mask R-CNN model using Sagemaker Operator for Kubernetes

Scope

Prerequisites

Training Faster R-CNN model

Hosting trained model

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages