Skip to content

Latest commit



1349 lines (1128 loc) · 54.9 KB

File metadata and controls

1349 lines (1128 loc) · 54.9 KB

SageMaker Model Hosting L2 Constructs

  • Original Author(s): @petermeansrock, @mattmcclean, @l2yao, @jetterdj, @foxpro24, @rangoju
  • Tracking Issue: #431
  • API Bar Raiser: @kaizencc

This feature supports the creation of Amazon SageMaker real-time inference hosted endpoints using a new set of L2 constructs for the Endpoint, EndpointConfig, and Model CloudFormation resources.

Working Backwards


feat(sagemaker): add model hosting L2 constructs


Amazon SageMaker Construct Library

cfn-resources: Stable

All classes with the Cfn prefix in this module (CFN Resources) are always stable and safe to use.

cdk-constructs: Experimental

The APIs of higher level constructs in this module are experimental and under active development. They are subject to non-backward compatible changes or removal in any future version. These are not subject to the Semantic Versioning model and breaking changes will be announced in the release notes. This means that while you may use them, you may need to update your source code when upgrading to a newer version of this package.

Amazon SageMaker provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly. Amazon SageMaker is a fully-managed service that covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the model, tune and optimize it for deployment, make predictions, and take action. Your models get to production faster with much less effort and lower cost.


Install the module:

$ npm i @aws-cdk/aws-sagemaker

Import it into your code:

import * as sagemaker from '@aws-cdk/aws-sagemaker';


To create a machine learning model with Amazon Sagemaker, use the Model construct. This construct includes properties that can be configured to define model components, including the model inference code as a Docker image and an optional set of separate model data artifacts. See the AWS documentation to learn more about SageMaker models.

Single Container Model

In the event that a single container is sufficient for your inference use-case, you can define a single-container model:

import * as sagemaker from '@aws-cdk/aws-sagemaker';
import * as path from 'path';

const image = sagemaker.ContainerImage.fromAsset(path.join('path', 'to', 'Dockerfile', 'directory'));
const modelData = sagemaker.ModelData.fromAsset(path.join('path', 'to', 'artifact', 'file.tar.gz'));

const model = new sagemaker.Model(this, 'PrimaryContainerModel', {
  containers: [
      image: image,
      modelData: modelData,

Inference Pipeline Model

An inference pipeline is an Amazon SageMaker model that is composed of a linear sequence of multiple containers that process requests for inferences on data. See the AWS documentation to learn more about SageMaker inference pipelines. To define an inference pipeline, you can provide additional containers for your model:

import * as sagemaker from '@aws-cdk/aws-sagemaker';

declare const image1: sagemaker.ContainerImage;
declare const modelData1: sagemaker.ModelData;
declare const image2: sagemaker.ContainerImage;
declare const modelData2: sagemaker.ModelData;
declare const image3: sagemaker.ContainerImage;
declare const modelData3: sagemaker.ModelData;

const model = new sagemaker.Model(this, 'InferencePipelineModel', {
  containers: [
    { image: image1, modelData: modelData1 },
    { image: image2, modelData: modelData2 },
    { image: image3, modelData: modelData3 }

Container Images

Inference code can be stored in the Amazon EC2 Container Registry (Amazon ECR), which is specified via ContainerDefinition's image property which accepts a class that extends the ContainerImage abstract base class.

Asset Image

Reference a local directory containing a Dockerfile:

import * as sagemaker from '@aws-cdk/aws-sagemaker';
import * as path from 'path';

const image = sagemaker.ContainerImage.fromAsset(path.join('path', 'to', 'Dockerfile', 'directory'));

ECR Image

Reference an image available within ECR:

import * as ecr from '@aws-cdk/aws-ecr';
import * as sagemaker from '@aws-cdk/aws-sagemaker';

const repository = ecr.Repository.fromRepositoryName(this, 'Repository', 'repo');
const image = sagemaker.ContainerImage.fromEcrRepository(repository, 'tag');

Model Artifacts

If you choose to decouple your model artifacts from your inference code (as is natural given different rates of change between inference code and model artifacts), the artifacts can be specified via the modelData property which accepts a class that extends the ModelData abstract base class. The default is to have no model artifacts associated with a model.

Asset Model Data

Reference local model data:

import * as sagemaker from '@aws-cdk/aws-sagemaker';
import * as path from 'path';

const modelData = sagemaker.ModelData.fromAsset(path.join('path', 'to', 'artifact', 'file.tar.gz'));

S3 Model Data

Reference an S3 bucket and object key as the artifacts for a model:

import * as s3 from '@aws-cdk/aws-s3';
import * as sagemaker from '@aws-cdk/aws-sagemaker';

const bucket = new s3.Bucket(this, 'MyBucket');
const modelData = sagemaker.ModelData.fromBucket(bucket, 'path/to/artifact/file.tar.gz');

Model Hosting

Amazon SageMaker provides model hosting services for model deployment. Amazon SageMaker provides an HTTPS endpoint where your machine learning model is available to provide inferences.

Endpoint Configuration

By using the EndpointConfig construct, you can define a set of endpoint configuration which can be used to provision one or more endpoints. In this configuration, you identify one or more models to deploy and the resources that you want Amazon SageMaker to provision. You define one or more production variants, each of which identifies a model. Each production variant also describes the resources that you want Amazon SageMaker to provision. If you are hosting multiple models, you also assign a variant weight to specify how much traffic you want to allocate to each model. For example, suppose that you want to host two models, A and B, and you assign traffic weight 2 for model A and 1 for model B. Amazon SageMaker distributes two-thirds of the traffic to Model A, and one-third to model B:

import * as sagemaker from '@aws-cdk/aws-sagemaker';

declare const modelA: sagemaker.Model;
declare const modelB: sagemaker.Model;

const endpointConfig = new sagemaker.EndpointConfig(this, 'EndpointConfig', {
  instanceProductionVariants: [
      model: modelA,
      variantName: 'modelA',
      initialVariantWeight: 2.0,
      model: modelB,
      variantName: 'variantB',
      initialVariantWeight: 1.0,


When you create an endpoint from an EndpointConfig, Amazon SageMaker launches the ML compute instances and deploys the model or models as specified in the configuration. To get inferences from the model, client applications send requests to the Amazon SageMaker Runtime HTTPS endpoint. For more information about the API, see the InvokeEndpoint API. Defining an endpoint requires at minimum the associated endpoint configuration:

import * as sagemaker from '@aws-cdk/aws-sagemaker';

declare const endpointConfig: sagemaker.EndpointConfig;

const endpoint = new sagemaker.Endpoint(this, 'Endpoint', { endpointConfig });


To enable autoscaling on the production variant, use the autoScaleInstanceCount method:

import * as sagemaker from '@aws-cdk/aws-sagemaker';

declare const endpointConfig: sagemaker.EndpointConfig;

const endpoint = new sagemaker.Endpoint(this, 'Endpoint', { endpointConfig });
const productionVariant = endpoint.findInstanceProductionVariant('variantName');
const instanceCount = productionVariant.autoScaleInstanceCount({
  maxCapacity: 3
instanceCount.scaleOnInvocations('LimitRPS', {
  maxRequestsPerSecond: 30,

For load testing guidance on determining the maximum requests per second per instance, please see this documentation.


To monitor CloudWatch metrics for a production variant, use one or more of the metric convenience methods:

import * as sagemaker from '@aws-cdk/aws-sagemaker';

declare const endpointConfig: sagemaker.EndpointConfig;

const endpoint = new sagemaker.Endpoint(this, 'Endpoint', { endpointConfig });
const productionVariant = endpoint.findInstanceProductionVariant('variantName');
productionVariant.metricModelLatency().createAlarm(this, 'ModelLatencyAlarm', {
  threshold: 100000,
  evaluationPeriods: 3,

Ticking the box below indicates that the public API of this RFC has been signed-off by the API bar raiser (the status/api-approved label was applied to the RFC pull request):

[x] Signed-off by API Bar Raiser @kaizencc

Public FAQ

What are we launching today?

We are launching the first set of L2 constructs for the SageMaker module, introducing the Endpoint construct alongside its dependencies EndpointConfig and Model. Together, these constructs enable customers to deploy a machine learning model to an Amazon SageMaker-hosted endpoint which can be used for real-time inference via SageMaker's InvokeEndpoint API.

Why should I use this feature?

SageMaker hosting for real-time inference provides a fully-managed, auto-scalable solution to customers wishing to deploy machine learning models behind an interactive endpoint.

Internal FAQ

Why are we doing this?

The tracking GitHub issue for the module has 48 +1s, so there appears to be sufficient public demand for higher-level constructs above the existing L1s.

As SageMaker models are composed of an algorithm (expressed as a Docker image) and data (expressed as an S3 object), the CDK's support for image and file assets would allow a customer to fully specify their endpoints' AWS infrastructure and resource dependencies solely using the CDK.

Assets aside, a multi-variant, auto-scalable, CloudWatch-monitored endpoint within a VPC can be specified in just under 100 lines of code using the proposed L2 CDK constructs which generates a 1000+ line CloudFormation template. Producing an equivalent template using the existing SageMaker L1 constructs can prove challenging for customers as they have to stitch together the L1 SageMaker attributes (e.g., production variant names) to L2 constructs from other modules (e.g., CloudWatch, Application Auto Scaling) leaving room for manual error.

Why should we not do this?

In the time since the original PR for these constructs was authored in 2020, SageMaker has expanded its feature set to include Amazon SageMaker Pipelines, a CI/CD offering for training and deploying models. This offering directs customers to SageMaker Studio for interacting with their pipeline, which itself can be programmatically manipulated using the SageMaker Python SDK. Given the user experience difference between these new SageMaker products and other AWS infrastructure-as-code solutions (e.g., CloudFormation and the CDK), it's unclear how broader adoption of SageMaker CDK constructs aligns with the SageMaker product vision.

What is the technical solution (design) of this feature?

The proposed design has been fully implemented in CDK PR #20113. Each of the following sections lays out the proposed interfaces needed for each L2 construct along with any supporting classes.


  • IModel -- interface for defined and imported models

    export interface IModel extends cdk.IResource, iam.IGrantable, ec2.IConnectable {
       * Returns the ARN of this model.
       * @attribute
      readonly modelArn: string;
       * Returns the name of this model.
       * @attribute
      readonly modelName: string;
       * The IAM role associated with this Model.
      readonly role?: iam.IRole;
       * Adds a statement to the IAM role assumed by the instance.
      addToRolePolicy(statement: iam.PolicyStatement): void;
  • ModelProps -- configuration for defining a Model

    export interface ModelProps {
       * The IAM role that the Amazon SageMaker service assumes.
       * @default - a new IAM role will be created.
      readonly role?: iam.IRole;
       * Name of the SageMaker Model.
       * @default - AWS CloudFormation generates a unique physical ID and uses that ID for the model's
       * name.
      readonly modelName?: string;
       * The VPC to deploy model containers to.
       * @default - none
      readonly vpc?: ec2.IVpc;
       * The VPC subnets to use when deploying model containers.
       * @default - none
      readonly vpcSubnets?: ec2.SubnetSelection;
       * The security groups to associate to the Model. If no security groups are provided and 'vpc' is
       * configured, one security group will be created automatically.
       * @default - A security group will be automatically created if 'vpc' is supplied
      readonly securityGroups?: ec2.ISecurityGroup[];
       * Specifies the container definitions for this model, consisting of either a single primary
       * container or an inference pipeline of multiple containers.
       * @default - none
      readonly containers?: ContainerDefinition[];
       * Whether to allow the SageMaker Model to send all network traffic
       * If set to false, you must individually add traffic rules to allow the
       * SageMaker Model to connect to network targets.
       * Only used if 'vpc' is supplied.
       * @default true
      readonly allowAllOutbound?: boolean;
  • ModelBase -- abstract base definition class shared by defined and imported models

    abstract class ModelBase extends cdk.Resource implements IModel {
       * Returns the ARN of this model.
       * @attribute
      public abstract readonly modelArn: string;
       * Returns the name of the model.
       * @attribute
      public abstract readonly modelName: string;
       * Execution role for SageMaker Model
      public abstract readonly role?: iam.IRole;
       * The principal this Model is running as
      public abstract readonly grantPrincipal: iam.IPrincipal;
       * An accessor for the Connections object that will fail if this Model does not have a VPC
       * configured.
      public get connections(): ec2.Connections { ... }
       * The actual Connections object for this Model. This may be unset in the event that a VPC has not
       * been configured.
       * @internal
      protected _connections: ec2.Connections | undefined;
       * Adds a statement to the IAM role assumed by the instance.
      public addToRolePolicy(statement: iam.PolicyStatement) { ... }
  • Model -- defines a SageMaker model (with helper methods for importing a model)

    export class Model extends ModelBase {
       * Imports a Model defined either outside the CDK or in a different CDK stack.
       * @param scope the Construct scope.
       * @param id the resource id.
       * @param modelName the name of the model.
      public static fromModelName(scope: Construct, id: string, modelName: string): IModel { ... }
       * Imports a Model defined either outside the CDK or in a different CDK stack.
       * @param scope the Construct scope.
       * @param id the resource id.
       * @param attrs the attributes of the model to import.
      public static fromModelAttributes(scope: Construct, id: string, attrs: ModelAttributes): IModel { ... }
       * Returns the ARN of this model.
       * @attribute
      public readonly modelArn: string;
       * Returns the name of the model.
       * @attribute
      public readonly modelName: string;
       * Execution role for SageMaker Model
      public readonly role?: iam.IRole;
       * The principal this Model is running as
      public readonly grantPrincipal: iam.IPrincipal;
      private readonly subnets: ec2.SelectedSubnets | undefined;
      constructor(scope: Construct, id: string, props: ModelProps = {}) { ... }
Container Definition

When defining a model above, the ContainerDefinition interface encapsulates both the specification of model inference code as a ContainerImage and an optional set of artifacts as ModelData. The image is specified as a Docker registry path while the model artifacts must be stored in S3.

  • ContainerDefinition -- describes the container, as part of model definition above

    export interface ContainerDefinition {
       * The image used to start a container.
      readonly image: ContainerImage;
       * A map of environment variables to pass into the container.
       * @default - none
      readonly environment?: {[key: string]: string};
       * Hostname of the container.
       * @default - none
      readonly containerHostname?: string;
       * S3 path to the model artifacts.
       * @default - none
      readonly modelData?: ModelData;
Container Image

The following interface and abstract class provide mechanisms for configuring a container image. These closely mirror analogous entities from the ECS module but, rather than bind-ing upon an ECS task definition, instead operate upon a SageMaker model.

  • ContainerImageConfig -- the configuration for creating a container image

    export interface ContainerImageConfig {
       * The image name. Images in Amazon ECR repositories can be specified by either using the full registry/repository:tag or
       * registry/repository@digest.
       * For example, 012345678910.dkr.ecr.<region-name><repository-name>:latest or
       * 012345678910.dkr.ecr.<region-name><repository-name>@sha256:94afd1f2e64d908bc90dbca0035a5b567EXAMPLE.
      readonly imageName: string;
  • ContainerImage -- abstract class defining bind contract for images alongside static factory methods to enable different sources (e.g., image in ECR repository, local Dockerfile)

    export abstract class ContainerImage {
       * Reference an image in an ECR repository
      public static fromEcrRepository(repository: ecr.IRepository, tag: string = 'latest'): ContainerImage { ... }
       * Reference an image that's constructed directly from sources on disk
       * @param directory The directory where the Dockerfile is stored
       * @param options The options to further configure the selected image
      public static fromAsset(directory: string, options: assets.DockerImageAssetOptions = {}): ContainerImage { ... }
       * Called when the image is used by a Model
      public abstract bind(scope: Construct, model: Model): ContainerImageConfig;
Model Data

Analogous to the above pairing of ContainerImageConfig and ContainerImage, the following interface and abstract class provide mechanisms for customers to specify the source of their model artifacts, either in an S3 bucket or a local file asset.

  • ModelDataConfig -- the configuration needed to reference model artifacts

    export interface ModelDataConfig {
       * The S3 path where the model artifacts, which result from model training, are stored. This path
       * must point to a single gzip compressed tar archive (.tar.gz suffix).
      readonly uri: string;
  • ModelData -- model data represents the source of model artifacts, which will ultimately be loaded from an S3 location

    export abstract class ModelData {
       * Constructs model data which is already available within S3.
       * @param bucket The S3 bucket within which the model artifacts are stored
       * @param objectKey The S3 object key at which the model artifacts are stored
      public static fromBucket(bucket: s3.IBucket, objectKey: string): ModelData { ... }
       * Constructs model data that will be uploaded to S3 as part of the CDK app deployment.
       * @param path The local path to a model artifact file as a gzipped tar file
       * @param options The options to further configure the selected asset
      public static fromAsset(path: string, options: assets.AssetOptions = {}): ModelData { ... }
       * This method is invoked by the SageMaker Model construct when it needs to resolve the model
       * data to a URI.
       * @param scope The scope within which the model data is resolved
       * @param model The Model construct performing the URI resolution
      public abstract bind(scope: Construct, model: IModel): ModelDataConfig;

Endpoint Configuration

  • IEndpointConfig -- the interface for a SageMaker EndpointConfig resource

    export interface IEndpointConfig extends cdk.IResource {
       * The ARN of the endpoint configuration.
       * @attribute
      readonly endpointConfigArn: string;
       * The name of the endpoint configuration.
       * @attribute
      readonly endpointConfigName: string;
  • EndpointConfigProps -- construction properties for a SageMaker EndpointConfig

    export interface EndpointConfigProps {
       * Name of the endpoint configuration.
       * @default - AWS CloudFormation generates a unique physical ID and uses that ID for the
       * endpoint configuration's name.
      readonly endpointConfigName?: string;
       * Optional KMS encryption key associated with this stream.
       * @default - none
      readonly encryptionKey?: kms.IKey;
       * A list of instance production variants. You can always add more variants later by calling
       * {@link EndpointConfig#addInstanceProductionVariant}.
       * @default - none
      readonly instanceProductionVariants?: InstanceProductionVariantProps[];
  • EndpointConfig -- defines a SageMaker EndpointConfig (with helper methods for importing an endpoint config)

    export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
       * Imports an EndpointConfig defined either outside the CDK or in a different CDK stack.
       * @param scope the Construct scope.
       * @param id the resource id.
       * @param endpointConfigName the name of the endpoint configuration.
      public static fromEndpointConfigName(scope: Construct, id: string, endpointConfigName: string): IEndpointConfig { ... }
       * The ARN of the endpoint configuration.
      public readonly endpointConfigArn: string;
       * The name of the endpoint configuration.
      public readonly endpointConfigName: string;
      constructor(scope: Construct, id: string, props: EndpointConfigProps = {}) { ... }
       * Add instance production variant to the endpoint configuration.
       * @param props The properties of a production variant to add.
      public addInstanceProductionVariant(props: InstanceProductionVariantProps): void { ... }
       * Get instance production variants associated with endpoint configuration.
      public get instanceProductionVariants(): InstanceProductionVariant[] { ... }
       * Find instance production variant based on variant name
       * @param name Variant name from production variant
      public findInstanceProductionVariant(name: string): InstanceProductionVariant { ... }
Production Variants

To accommodate A/B testing of model behaviors, an endpoint config supports the specification of multiple production variants. Each variant's weight determines the traffic distribution to itself relative to the other configured variants.

  • ProductionVariantProps -- common construction properties for all production variant types (e.g., instance, serverless) (note, not exported)

    interface ProductionVariantProps {
       * Determines initial traffic distribution among all of the models that you specify in the
       * endpoint configuration. The traffic to a production variant is determined by the ratio of the
       * variant weight to the sum of all variant weight values across all production variants.
       * @default 1.0
      readonly initialVariantWeight?: number;
       * The model to host.
      readonly model: IModel;
       * Name of the production variant.
      readonly variantName: string;
  • InstanceProductionVariantProps -- construction properties for an instance production variant

    export interface InstanceProductionVariantProps extends ProductionVariantProps {
      * The size of the Elastic Inference (EI) instance to use for the production variant. EI instances
      * provide on-demand GPU computing for inference.
      * @default - none
      readonly acceleratorType?: AcceleratorType;
      * Number of instances to launch initially.
      * @default 1
      readonly initialInstanceCount?: number;
      * Instance type of the production variant.
      * @default - ml.t2.medium instance type.
      readonly instanceType?: InstanceType;
  • ProductionVariant -- represents common attributes of all production variant types (e.g., instance, serverless) once associated to an EndpointConfig (note, not exported)

    interface ProductionVariant {
       * Determines initial traffic distribution among all of the models that you specify in the
      * endpoint configuration. The traffic to a production variant is determined by the ratio of the
      * variant weight to the sum of all variant weight values across all production variants.
      readonly initialVariantWeight: number;
       * The name of the model to host.
      readonly modelName: string;
       * The name of the production variant.
      readonly variantName: string;
  • InstanceProductionVariant -- represents an instance production variant that has been associated with an EndpointConfig

    export interface InstanceProductionVariant extends ProductionVariant {
      * The size of the Elastic Inference (EI) instance to use for the production variant. EI instances
      * provide on-demand GPU computing for inference.
      * @default - none
      readonly acceleratorType?: AcceleratorType;
      * Number of instances to launch initially.
      readonly initialInstanceCount: number;
      * Instance type of the production variant.
      readonly instanceType: InstanceType;
  • AcceleratorType -- enum-like class of supported Elastic Inference (EI) instance types for SageMaker instance-based production variants; EI instances provide on-demand GPU computing for inference

    export class AcceleratorType {
       * ml.eia1.large
      public static readonly EIA1_LARGE = AcceleratorType.of('ml.eia1.large');
      /* Additional supported accelerator types */
       * Builds an AcceleratorType from a given string or token (such as a CfnParameter).
       * @param acceleratorType An accelerator type as string
       * @returns A strongly typed AcceleratorType
      public static of(acceleratorType: string): AcceleratorType;
       * Return the accelerator type as a string
       * @returns The accelerator type as a string
      public toString(): string;
  • InstanceType -- enum-like class of supported instance types for SageMaker instance-based production variants

    export class InstanceType {
       * ml.c4.2xlarge
      public static readonly C4_2XLARGE = InstanceType.of('ml.c4.2xlarge');
      /* Additional supported instance types */
       * Builds an InstanceType from a given string or token (such as a CfnParameter).
       * @param instanceType An instance type as string
       * @returns A strongly typed InstanceType
      public static of(instanceType: string): InstanceType;


  • IEndpoint -- the interface for a SageMaker Endpoint resource

    export interface IEndpoint extends cdk.IResource {
       * The ARN of the endpoint.
       * @attribute
      readonly endpointArn: string;
       * The name of the endpoint.
       * @attribute
      readonly endpointName: string;
       * Permits an IAM principal to invoke this endpoint
       * @param grantee The principal to grant access to
      grantInvoke(grantee: iam.IGrantable): iam.Grant;
  • EndpointProps -- construction properties for a SageMaker endpoint

    export interface EndpointProps {
       * Name of the endpoint.
       * @default - AWS CloudFormation generates a unique physical ID and uses that ID for the
       * endpoint's name.
      readonly endpointName?: string;
       * The endpoint configuration to use for this endpoint.
      readonly endpointConfig: IEndpointConfig;
  • EndpointBase -- abstract base definition class shared by defined and imported endpoints

    abstract class EndpointBase extends cdk.Resource implements IEndpoint {
       * The ARN of the endpoint.
       * @attribute
      public abstract readonly endpointArn: string;
       * The name of the endpoint.
       * @attribute
      public abstract readonly endpointName: string;
       * Permits an IAM principal to invoke this endpoint
       * @param grantee The principal to grant access to
      public grantInvoke(grantee: iam.IGrantable) { ... }
  • Endpoint -- defines a SageMaker endpoint (with helper methods for importing an endpoint)

    export class Endpoint extends EndpointBase {
       * Imports an Endpoint defined either outside the CDK or in a different CDK stack.
       * @param scope the Construct scope.
       * @param id the resource id.
       * @param endpointName the name of the endpoint.
      public static fromEndpointName(scope: Construct, id: string, endpointName: string): IEndpoint { ... }
       * The ARN of the endpoint.
       * @attribute
      public readonly endpointArn: string;
       * The name of the endpoint.
       * @attribute
      public readonly endpointName: string;
      constructor(scope: Construct, id: string, props: EndpointProps) { ... }
       * Get instance production variants associated with endpoint.
      public get instanceProductionVariants(): IEndpointInstanceProductionVariant[] { ... }
       * Find instance production variant based on variant name
       * @param name Variant name from production variant
      public findInstanceProductionVariant(name: string): IEndpointInstanceProductionVariant { ... }
Endpoint Production Variants

When monitoring or auto-scaling real-time inference endpoints, both CloudWatch and Application Auto Scaling operate at the level of endpoint name + variant name. For this reason, once a variant has been attached to an endpoint, this RFC allows customers to retrieve IEndpointInstanceProductionVariant object instances from their endpoint for the purposes of referencing CloudWatch metrics or an Application Auto Scaling BaseScalableAttribute.

  • IEndpointProductionVariant -- represents the features common to all production variant types (e.g., instance, serverless) that have been associated with an endpoint (note, not exported)

    interface IEndpointProductionVariant {
       * The name of the production variant.
      readonly variantName: string;
       * Return the given named metric for Endpoint
       * @default - sum over 5 minutes
      metric(namespace: string, metricName: string, props?: cloudwatch.MetricOptions): cloudwatch.Metric;
  • IEndpointInstanceProductionVariant -- represents an instance production variant that has been associated with an endpoint

    export interface IEndpointInstanceProductionVariant extends IEndpointProductionVariant {
       * Metric for the number of invocations
       * @default - sum over 5 minutes
      metricInvocations(props?: cloudwatch.MetricOptions): cloudwatch.Metric;
       * Metric for the number of invocations per instance
       * @default - sum over 5 minutes
      metricInvocationsPerInstance(props?: cloudwatch.MetricOptions): cloudwatch.Metric;
       * Metric for model latency
       * @default - average over 5 minutes
      metricModelLatency(props?: cloudwatch.MetricOptions): cloudwatch.Metric;
       * Metric for overhead latency
       * @default - average over 5 minutes
      metricOverheadLatency(props?: cloudwatch.MetricOptions): cloudwatch.Metric;
       * Metric for the number of invocations by HTTP response code
       * @default - sum over 5 minutes
      metricInvocationResponseCode(responseCode: InvocationHttpResponseCode, props?: cloudwatch.MetricOptions): cloudwatch.Metric;
       * Metric for disk utilization
       * @default - average over 5 minutes
      metricDiskUtilization(props?: cloudwatch.MetricOptions): cloudwatch.Metric;
       * Metric for CPU utilization
       * @default - average over 5 minutes
      metricCpuUtilization(props?: cloudwatch.MetricOptions): cloudwatch.Metric;
       * Metric for memory utilization
       * @default - average over 5 minutes
      metricMemoryUtilization(props?: cloudwatch.MetricOptions): cloudwatch.Metric;
       * Metric for GPU utilization
       * @default - average over 5 minutes
      metricGpuUtilization(props?: cloudwatch.MetricOptions): cloudwatch.Metric;
       * Metric for GPU memory utilization
       * @default - average over 5 minutes
      metricGpuMemoryUtilization(props?: cloudwatch.MetricOptions): cloudwatch.Metric;
       * Enable autoscaling for SageMaker Endpoint production variant
       * @param scalingProps EnableScalingProps
      autoScaleInstanceCount(scalingProps: appscaling.EnableScalingProps): ScalableInstanceCount;
    class EndpointInstanceProductionVariant implements IEndpointInstanceProductionVariant { ... }

Is this a breaking change?


What alternative solutions did you consider?

  1. In the earliest PR attempting to add SageMaker L2 constructs to the CDK, the author did not create an EndpointConfig construct, instead hiding the resource's creation behind Endpoint (to which production variants could be added). Although a simplifier, this prevents customers from reusing configuration across endpoints. For this reason, an explicit L2 construct for endpoint configuration was incorporated into this RFC. This enables use-cases like the following:
    1. Producer A exposes ten endpoints, each unique to a different consumer (let's label these B thru K).
    2. Each of these endpoints could use one of, say, three endpoint configs (let's label these 1 thru 3) based on the features needed by each consumer.
    3. Consumer B's endpoint is currently associated with endpoint config 1.
    4. At some later point, consumer B wants to leverage a new feature, so in collaboration with the consumer, producer A updates B's endpoint to reference endpoint config 3. As a result, without switching endpoints, consumer B was able to begin using the features enabled via the pre-built, shared endpoint config 3.

What are the drawbacks of this solution?

Since production variants are configured via the EndpointConfig construct while the monitoring and auto-scaling of a deployed production variant is only possible once the EndpointConfig has been associated to an Endpoint (i.e., the dimension for most SageMaker model hosting metrics consists of endpoint name and production variant name), this RFC proposes the implementation of the function Endpoint.findProductionVariant(string), the return value for which contains metric* and autoScaleInstanceCount helper methods as demonstrated in the README. Although not necessarily a drawback, this separation of configuration-time and deploy-time APIs appears to be a novel pattern for the CDK, and thus, has the potential to be confusing to customers.

What is the high-level project plan?

As the proposed design has been fully implemented in CDK PR #20113, the delivery timeline of the implementation of this RFC will be contingent upon the scope of changes requested by reviewers. For baking, the L2 constructs for this module would be marked as experimental, leaving room for further adjustments prior to marking the APIs as stable.

Are there any open issues that need to be addressed later?

Feature Additions

The following list describes at a high-level future additions that can be made to the L2 constructs to enable SageMaker features not yet covered by this RFC but are already supported via CloudFormation. For the purposes of this RFC, this list should be reviewed to ensure that the proposed APIs are appropriately extensible in order to support these use-cases.

  1. AWS::SageMaker::EndpointConfig features:
    1. Serverless Inference: By default, upon endpoint deployment, SageMaker will provision EC2 instances (managed by SageMaker) for hosting purposes. To shield customers from the complexity of forecasting fleet sizes, the ServerlessConfig attribute was added to the ProductionVariant CloudFormation structure of an endpoint config resource. This configuration removes the need for customers to specify instance-specific settings (e.g., instance count, instance type), abstracting the runtime compute from customers, much in the same way Lambda does for its customers. In preparation for the addition of this feature into the CDK, all concrete production variant related classes and attributes have been prefixed with the string [Ii]nstance to designate that they are only associated with instance-based hosting. When later adding serverless support to the SageMaker module, [Ss]erverless-prefixed analogs can be created with attributes appropriate for the use-case with appropriate plumbing to the L1 constructs. Note, there are a number of features which do not yet work with serverless variants, so it may be necessary to incorporate a number of new synthesis-time checks or compile-time contracts to guard against mixing incompatible features. For example, as discussed with the bar raiser, alongside the proposed EndpointConfigProps attribute instanceProductionVariants?: InstanceProductionVariantProps[], a new mutually exclusive attribute serverlessProductionVariant?: ServerlessProductionVariantProps (as only a single variant is supported with serverless inference) could be added with a synthesis-time check confirming that the customer hasn't configured both instance-based and serverless production variants.
    2. Asynchronous Inference: By default, a deployed endpoint is synchronous: a customer issues an InvokeEndpoint operation to SageMaker with an attached input payload and the resulting response contains the output payload from the endpoint. To instead support asynchronous invocation, the AsyncInferenceClientConfig CloudFormation attribute was added to the endpoint config resource. To interact with an asynchronous endpoint, a customer issues an InvokeEndpointAsync operation to SageMaker with an attached input location in S3; SageMaker asynchronously reads the input from S3, invokes the endpoint, and writes the output to an S3 location specified within the AsyncInferenceClientConfig attribute. As discussed with the RFC bar raiser here, there are a few ways to tackle the addition of this functionlity. One option is to add attribute(s) to the L2 endpoint config construct to support asynchronous inference along with synthesis-time error handling to catch configuration conflicts (e.g., asynchronous endpoints are only capable of supporting a single instance-based production variant today). Alternatively, an AsyncEndpointConfig subclass of EndpointConfig could be introduced to provide a better compile-time contract to customers (while still implementing the generic functionality within EndpointConfig). Either way, the proposed contracts would only undergo backward-compatible changes.
    3. Model Monitoring: For the purposes of monitoring model performance, the DataCaptureConfig CloudFormation attribute was added which allows customers to configure a sampling rate of input and/or output endpoint requests that SageMaker should publish to an S3 destination. This functionlity is a side-effect of normal endpoint operation and has no bearing on other construct APIs, meaning its addition should be confined to new attribute(s) on the endpoint config construct.
  2. AWS::SageMaker::Endpoint features:
    1. Retention of Variant Properties: Once an endpoint has been deployed, the desired instance count and desired weight can be dynamically adjusted per production variant without changing the backing endpoint config resource. These changes can either be made automatically via Application Auto Scaling or manually by the customer via the SageMaker UpdateEndpointWeightsAndCapacities operation. After making such changes, by default, when updating a SageMaker endpoint to use a new endpoint config resource (such as when making a CloudFormation change an endpoint config that results in resource replacement), the desired instance count and desired weight is reset to match the new endpoint config resource. To bypass this resetting of variant properties, the RetainAllVariantProperties boolean flag was added to the endpoint resource, which when set to true, will not reset these variant properties. In addition to this field, ExcludeRetainedVariantProperties was also added to the endpoint resource to allow for selective retention of variant properties (e.g., keeping the desired instance count while resetting the desired weight). As the default behavior is already in place (no retention), adding the functionality should consist of incorporating new attribute(s) on the Endpoint L2 construct's props interface and plumbing it through to the underlying L1 resource definition.
    2. Deployment Guardrails: By default, when updating an endpoint, SageMaker uses an all-at-once blue/green deployment strategy: a new fleet is provisioned with the new approrpriate configuration, and upon successful provisioning, the traffic is flipped and the old fleet is terminated. To augment this functionality, the DeploymentConfig attribute was added to the Endpoint resource which now allows (1) the specification of a CloudWatch alarm for auto-rollback and (2) additional deployment policies beyond all-at-once, including canary and linear deployment strategies (along with more fine- grained timing settings). Adding this functionlity should consist of incorporating new attribute(s) on the Endpoint L2 construct's props interface and plumbing it through to the underlying L1 resource definition. This work should also include support for the RetainDeploymentConfig boolean flag which controls whether to reuse the previous deployment configuration or use the new one. Note, there are a number of SageMaker features which prevent the use of deployment configuration, so defending against combinations of features may improve the customer experience with the Endpoint construct.
  3. AWS::SageMaker::Model features:
    1. Multi-Model Endpoints: By default (and as described in the technical solution above), SageMaker expects the model data URL on each container to point to an S3 object containing a gzipped tar file of artifacts, which will be automatically extracted upon instance provisioning. To support colocation of multiple logical models into a single container, the Mode attribute was added to the ContainerDefinition CloudFormation structure to either explicit configure SingleModel mode (the default) or MultiModel mode. In multi-model mode, SageMaker now expects the customer configured model data URL to point to an S3 path under which multiple gzipped tar files exist. When invoking a multi-model endpoint, the client invoking the endpoint must specify the target model representing the exact S3 path suffix pointing to a specific gzipped tar file. To accommodate this feature, the proposed ModelData.fromAsset API should be adjusted to support zip file assets capable of containing one or more gzipped tar files within them. Even though the code need not be aware of .tar.gz files specifically, it might prove a better customer experience to at least put up guard rails to prevent zip file assets from being used in single model mode where as multi-model mode could be more permissive.
    2. Direct Invocation of Multi-Container Endpoints: By default (and as described in the proposed README), when a customer specifies multiple containers for a model, the containers are treated as an inference pipeline (also referred to as a serial pipeline). This means that the containers are treated as an ordered list, wherein the output of one container at runtime is passed as input to the next. Only the output from the last container is surfaced to the client invoking the model. To support a different invocation paradigm, the InferenceExecutionConfig structure was added to the model CloudFormation resource which allows customers to either explicitly configure Serial invocation mode (the default, as an inference pipeline) or the new Direct invocation mode. When using direct mode, a client invoking an endpoint must specify a container to target with their request; SageMaker then invokes only that single container. As SageMaker exposes a new dimension for CloudWatch metrics specific to each directly-invokable container, other than exposing a new inference execution mode attribute on the Model construct, this feature would likely also warrant the addition of a findContainer(containerHostName: string) method to IEndpointProductionVariant which will return a new interface on which additional metric* APIs are present for generating CloudWatch metrics against the dimension consisting of endpoint, variant, and container combined.
    3. Private Docker Registries: The ImageConfig type was added to the existing ContainerDefinition CloudFormation structure in order for customers to specify that a VPC-connected Docker registry will act as the source of the container's image (as opposed to ECR which acts as the default platform repository). This new type also contains an optional RepositoryAuthConfig nested structure in order to specify the ARN of a Lambda function capable of serving repository credentials to SageMaker. In order to deliver this functionality in a backward-compatible way, inspiration can be taken from ECS's ContainerImage.fromRegistry API (note though, ECS sources credentials from Secrets Manager rather than Lambda) in order to make the following additions to the SageMaker module:
      1. Add attributes to ContainerImageConfig to support the specification of a non-platform repository along with an optional Lambda function ARN.
      2. Implement a new, non-exported RegistryImage subclass of ContainerImage whose constructor takes an optional Lambda IFunction instance for generating a ContainerImageConfig instance with the appropriate Lambda function ARN for serving credentials.
      3. On ContainerImage, add a new static fromRegistry method which takes a props object consisting of an optional Lambda IFunction instance. This method acts as a simple static factory method for the non-exported RegistryImage class.
    4. Network Isolation: The EnableNetworkIsolation Cloudformation boolean flag (defaults to false) on a model resource prevents inbound and outbound network calls to/from the model container. Incorporating such an attribute into the Model L2 construct should not conflict with any proposed API.
    5. AWS Marketplace Models: The ModelPackageName string attribute was added to the ContainerDefinition CloudFormation structure to specify the ARN of a reusable, versioned model which can be listed on the AWS Marketplace. When creating a Model resource from a model package, the customer need no longer specify a container image as the model package contains all information about the underlying container(s) required for inference. To incorporate this support into the SageMaker module, it would likely entail creating a new L2 construct ModelPackage to represent the AWS::SageMaker::ModelPackage CloudFormation resource and modifying the proposed ContainerDefinition interface to support an optional IModelPackage as an attribute (while making image: ContainerImage an optional attribute).

Rough Edges

  1. As observed with Lambda and EKS, the Elastic Network Interfaces (ENIs) associated with a SageMaker model's VPC are not always cleaned up in a timely manner after downstream compute resources are deleted. As a result, attempts to delete a SageMaker endpoint along with its networking resources (e.g., subnets, security groups) from a CloudFormation stack may cause the stack operation to fail as the ENIs are still in use. From a CDK integration test perspective, specifying --no-clean will allow the generation of a snapshot regardless of whether stack deletion will succeed or fail but may hinder snapshot re-generation by subsequent CDK contributors. For this reason, it may be helpful to exclude VPC specification from the endpoint integration test at this time.

Cross-module API Convergence

  1. This RFC proposes a new ContainerImage API for the SageMaker module which closely resembles the same-named API from the ECS module. The primary difference between the two is that the ECS module's API binds on an ECS TaskDefinition whereas this new SageMaker module's API binds on a SageMaker Model. There may be an opportunity to unify these APIs in the future assuming that binding to a common type would sufficient for both use-cases (e.g., IGrantable).


Feel free to add any number of appendices as you see fit. Appendices are expected to allow readers to dive deeper to certain sections if they like. For example, you can include an appendix which describes the detailed design of an algorithm and reference it from the FAQ.