Skip to content

Running Mask-RCNN using Sagemaker Operators for EKS

Notifications You must be signed in to change notification settings

vdabravolski/Mask-RCNN-EKS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Train and deploy Mask R-CNN model using Sagemaker Operator for Kubernetes

AWS recently introduced Sagemaker Operator for Kubernetes which allows development teams to integrate Sagemaker services with existing Kubernetes infrastructure.

Scope

This project provides a set of scripts to build container with Mask R-CNN model, then train and deploy model using Amazon EKS cluster.

We use TensorPack Mask/Faster-RCNN model implementation. Model is trained on COCO 2017 dataset.

Prerequisites

  1. You need to have AWS account (reference). It's recommended to create all AWS resource in specific AWS region (e.g. us-west-2 or us-east-2).
  2. You need to have EKS cluster with Sagemaker Operator configured (setup reference)
  3. Ensure your AWS account has appropriate resource limits. You'll need at least 2 ml.p3.16xlarge instances for Mask R-CNN training. It's recommended to have 4 instances.
  4. (recommended) Use Sagemaker managed Jupyter/JupyterLab notebooks. Note, that your Sagemaker notebook needs to have IAM role which is authorized to access EKS cluster, have eksctl, and kubctl utils configured. See configuration steps on Step #2.
  5. Create S3 bucket to store training data and training output (reference).

Training Faster R-CNN model

  1. Run ./prepare_s3_bucket.sh <YOUR_S3_BUCKET>. This will upload training dataset to your S3 bucket.
  2. Run ./build_and_push <YOUR_AWS_REGION>. This will create Mask R-CNN image with training script and push it to AWS ECR. This operation will take 5-10 minutes to complete. If successful, you should see URI of your image. You'll need it later.
  3. Update train.yaml as follows:
  • update "name" field with unique value. This will be a name of your Sagemaker training job;
  • update "trainingImage" with URI of your container image (Prepare your envrionment - Step #3);
  • update "roleArn" with your Sagemaker execution role (reference);
  • update "region" with you AWS region;
  • update your "S3OutputPath" and "inputDataPath" with your S3 bucket.
  1. Run kubectl apply -f train.yaml in terminal. This will schedule the Sagemaker job.
  2. Monitor your job in Sagemaker console or by running kubectl describe trainingjob.

Hosting trained model

Work in Progress. Following tasks are pending:

  1. Use separate Sagemaker TF Serving image for inference.
  2. Export trained Tensorpack model into format compatible with TF Serving container (require update of train.py)
  3. Implement Sagemaker-specific code:
    • handle input and output data of inference request;
    • test that Tensorpack exported model implements standard TF Serving interface

Credits

Mask-RCNN training script and docker image are copied from this AWS repository

About

Running Mask-RCNN using Sagemaker Operators for EKS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published