AWS recently introduced Sagemaker Operator for Kubernetes which allows development teams to integrate Sagemaker services with existing Kubernetes infrastructure.
This project provides a set of scripts to build container with Mask R-CNN model, then train and deploy model using Amazon EKS cluster.
We use TensorPack Mask/Faster-RCNN model implementation. Model is trained on COCO 2017 dataset.
- You need to have AWS account (reference). It's recommended to create all AWS resource in specific AWS region (e.g. us-west-2 or us-east-2).
- You need to have EKS cluster with Sagemaker Operator configured (setup reference)
- Ensure your AWS account has appropriate resource limits. You'll need at least 2
ml.p3.16xlarge
instances for Mask R-CNN training. It's recommended to have 4 instances. - (recommended) Use Sagemaker managed Jupyter/JupyterLab notebooks. Note, that your Sagemaker notebook needs to have IAM role which is authorized to access EKS cluster, have
eksctl
, andkubctl
utils configured. See configuration steps on Step #2. - Create S3 bucket to store training data and training output (reference).
- Run
./prepare_s3_bucket.sh <YOUR_S3_BUCKET>
. This will upload training dataset to your S3 bucket. - Run
./build_and_push <YOUR_AWS_REGION>
. This will create Mask R-CNN image with training script and push it to AWS ECR. This operation will take 5-10 minutes to complete. If successful, you should see URI of your image. You'll need it later. - Update
train.yaml
as follows:
- update "name" field with unique value. This will be a name of your Sagemaker training job;
- update "trainingImage" with URI of your container image (Prepare your envrionment - Step #3);
- update "roleArn" with your Sagemaker execution role (reference);
- update "region" with you AWS region;
- update your "S3OutputPath" and "inputDataPath" with your S3 bucket.
- Run
kubectl apply -f train.yaml
in terminal. This will schedule the Sagemaker job. - Monitor your job in Sagemaker console or by running
kubectl describe trainingjob
.
Work in Progress. Following tasks are pending:
- Use separate Sagemaker TF Serving image for inference.
- Details of deploymnet of TF Serving containers in Sagemaker: https://sagemaker.readthedocs.io/en/stable/using_tf.html#deploy-tensorflow-serving-models
- Export trained Tensorpack model into format compatible with TF Serving container (require update of
train.py
)- Tensorpack has methods to export export to model.pb format: https://github.com/tensorpack/tensorpack/blob/master/examples/basics/export-model.py & https://tensorpack.readthedocs.io/_modules/tensorpack/tfutils/export.html
- Implement Sagemaker-specific code:
- handle input and output data of inference request;
- test that Tensorpack exported model implements standard TF Serving interface
Mask-RCNN training script and docker image are copied from this AWS repository