Train and deploy Mask R-CNN model using Sagemaker Operator for Kubernetes

AWS recently introduced Sagemaker Operator for Kubernetes which allows development teams to integrate Sagemaker services with existing Kubernetes infrastructure.

Scope

This project provides a set of scripts to build container with Mask R-CNN model, then train and deploy model using Amazon EKS cluster.

We use TensorPack Mask/Faster-RCNN model implementation. Model is trained on COCO 2017 dataset.

You need to have AWS account (reference). It's recommended to create all AWS resource in specific AWS region (e.g. us-west-2 or us-east-2).
You need to have EKS cluster with Sagemaker Operator configured (setup reference)
Ensure your AWS account has appropriate resource limits. You'll need at least 2 ml.p3.16xlarge instances for Mask R-CNN training. It's recommended to have 4 instances.
(recommended) Use Sagemaker managed Jupyter/JupyterLab notebooks. Note, that your Sagemaker notebook needs to have IAM role which is authorized to access EKS cluster, have eksctl, and kubctl utils configured. See configuration steps on Step #2.
Create S3 bucket to store training data and training output (reference).

Run ./prepare_s3_bucket.sh <YOUR_S3_BUCKET>. This will upload training dataset to your S3 bucket.
Run ./build_and_push <YOUR_AWS_REGION>. This will create Mask R-CNN image with training script and push it to AWS ECR. This operation will take 5-10 minutes to complete. If successful, you should see URI of your image. You'll need it later.
Update train.yaml as follows:

update "name" field with unique value. This will be a name of your Sagemaker training job;
update "trainingImage" with URI of your container image (Prepare your envrionment - Step #3);
update "roleArn" with your Sagemaker execution role (reference);
update "region" with you AWS region;
update your "S3OutputPath" and "inputDataPath" with your S3 bucket.

Run kubectl apply -f train.yaml in terminal. This will schedule the Sagemaker job.
Monitor your job in Sagemaker console or by running kubectl describe trainingjob.

Work in Progress. Following tasks are pending:

Use separate Sagemaker TF Serving image for inference.
- Details of deploymnet of TF Serving containers in Sagemaker: https://sagemaker.readthedocs.io/en/stable/using_tf.html#deploy-tensorflow-serving-models
Export trained Tensorpack model into format compatible with TF Serving container (require update of train.py)
- Tensorpack has methods to export export to model.pb format: https://github.com/tensorpack/tensorpack/blob/master/examples/basics/export-model.py & https://tensorpack.readthedocs.io/_modules/tensorpack/tfutils/export.html
Implement Sagemaker-specific code:
- handle input and output data of inference request;
- test that Tensorpack exported model implements standard TF Serving interface

Mask-RCNN training script and docker image are copied from this AWS repository