Skip to content

This checklist is your guide to the best practices for deploying secure, scalable, and highly available infrastructure in Azure. Before you go live, go through each item, and make sure you haven't missed anything important!

Notifications You must be signed in to change notification settings

ghostinthewires/Azure-Readiness-Checklist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation




Azure Readiness Checklist

Are you ready to go to prod on Azure? Use this checklist to find out

Building production-grade infrastructure (as in, the type of infrastructure you’d bet your company on) involves a thousand little details. The vast majority of developers don’t know what those details are, so when you’re estimating a project, you usually forget about a number of critical and time-consuming details.

To avoid this issue, every time you go to work on a new piece of infrastructure, go through the following checklist: https://azurechecklist.com/

PRs Welcome


Contributing

Open an issue or a pull request to suggest changes or additions.

Guide

The Azure Readiness Checklist repository consists of two branches:

1. master

This branch consists of the README.md file that is automatically reflected on the Azure Readiness Checklist website.

2. develop

This branch will be used to make some significant changes to the structure, content if needed. It is preferable to use the master branch to fix small errors or add a new item.

Support

If you have any question or suggestion, don't hesitate to use Twitter:


Azure Readiness Checklist Badge

If you want to show you are following the rules of the Azure Readiness Checklist, put this badge on your README file!

Azure Readiness Checklist followed

[![Azure Readiness Checklist followed](https://img.shields.io/badge/Azure%20Readiness%20Checklist-Followed-brightgreen)](https://github.com/ghostinthewires/Azure-Readiness-Checklist/)

Below you will find a raw version of the checklist but I highly recommend using the dynamic version at https://azurechecklist.com/ which allows you to generate a report!

Everything you need to do before you go live

This checklist is your guide to the best practices for deploying secure, scalable, and highly available infrastructure in Azure. Before you go live, go through each item, and make sure you haven't missed anything important!

Not every single piece of infrastructure needs every single item on the list but you should consciously and explicitly document which items you’ve implemented, which ones you’ve decided to skip, and why.

  1. Server-side
  2. Client-side
  3. Data
  4. Scalability and High Availability
  5. Continuous Integration
  6. Continuous Delivery
  7. Networking
  8. Security and Governance
  9. Monitoring
  10. Cost optimization

Server-side

Build VM images

If you want to run your apps directly on Virtual Machines, you should package them as a managed image using PowerShell or a tool such as Packer. Although I recommend Docker for all stateless apps (see below), I recommend directly using VM images and VM Instances for all stateful apps, such as any app that writes to its local disk (e.g., WordPress, Jenkins).

Deploy VM images using scale sets

The best way to deploy a VM image is typically to run it as a scale set . This will allow you to spin up multiple VM Instances that run your VM image, scale the number of instances up and down in response to load, and automatically replace failed Instances.

Build Docker images

If want to run your apps as containers, you should package your apps as Docker images and push those images to the Azure Container Registry (ACR). I recommend Docker for all stateless apps and for local development (along with Docker Compose).

Deploy Docker images using AKS

For running Docker containers in Azure I recommend using Azure Kubernetes Service (AKS), which is a Azure's managed Kubernetes.
Another option is Azure Container Instances (ACI), a service where Azure manages and scales the underlying VM Instances for you and you just hand it Docker containers to run. However, this is not recommended for scenarios where you need full container orchestration, including service discovery across multiple containers, automatic scaling, and coordinated application upgrades.

Deploy serverless apps using Azure Functions and API Management

If you want to build serverless apps, I recommend you use Azure Functions. You can expose your Azure Functions as HTTP endpoints using API Management.

Configure CPU, memory, and GC settings

Configure CPU settings, memory settings (e.g., -Xmx, -Xms settings for a JVM), and GC settings (if applicable) for your app. If you're deploying directly on VM Instances, these should be configured based on the available CPU and memory on your VM Instance (see Instance Types). If you are deploying Docker containers, then tell the scheduler the resources your app needs , and it will automatically try to find a VM Instance that has those resources.

Configure hard drives

Configure the OS disk on each VM Instance with enough space for your app and log files. For further data storage, attach one or more Data disks.

Client-side

Pick a JavaScript framework

If you are building client-side applications in the browser, you may wish to use a JavaScript framework such as React, Angular, or Ember. You'll need to update your build system to build and package the code appropriately (see Continuous Integration).

Pick a compile-to-JS language

JavaScript has a number of problems and limitations, so you may wish to use a compile-to-JS language, such as TypeScript, Scala.js, PureScript, Elm, or ClojureScript. You'll need to update your build system to build and package the code appropriately (see Continuous Integration).

Pick a compile-to-CSS language

CSS has a number of problems and limitations, so you may wish to use a compile-to-CSS language, such as SASS, less, cssnext, or postcss. You'll need to update your build system to build and package the code appropriately (see Continuous Integration).

Optimize your assets

All CSS and JavaScript should be minified and all images should be compressed. You may wish to concatenate your CSS and JavaScript files and sprite images to reduce the number of requests the browser has to make. Make sure to enable gzip compression. Much of this can be done using a build system such as Grunt, Gulp, or Broccoli.

Use a static content server

You should serve all your static content (CSS, JS, images, fonts) from a static content server so that your dynamic Ib framework (e.g., from Rails, Node.js, or Django) can focus solely on processing dynamic requests. The best static content host to use with Azure is Blob Storage.

Use a CDN

Use Azure CDN as a Content Delivery Network (CDN) to cache and distribute your content across servers all over the world. This significantly reduces latency for users and is especially effective for static content.

Data

Deploy relational databases

Use Azure Database to run MySQL, PostgreSQL, SQL Server, or MariaDB. Azure Database supports automatic failover, read replicas, and automated backup.

Deploy NoSQL databases

Use Azure Cache for Redis if you want key-value storage. If you need a managed, eventually consistent document store, consider Azure Cosmos as a highly scalable, cloud-native, No-SQL database. Azure Cosmos supports automatic failover, read replicas, and automated backup.

Deploy queues

Although Azure Queue Storage is good for simple use cases, for more advanced situations I recommend using either Service Bus, Event Hubs or Event Grid. You can find a comparison here.

Deploy search tools

Use Azure Search for operations such as full text search. Alternatively, you can run the Elasticsearch Service (ELK stack).

Deploy stream processing tools

Use Event Hubs to process streaming data. For Big Data related stream processing there are multiple options, this comparison should help

Deploy a data warehouse

Use Azure SQL Data Warehouse for data warehousing.

Deploy big data systems

Use Azure HDInsight to run Hadoop, Spark, HBase, Presto, and Hive.

Set up scheduled jobs

Use Azure Logic Apps to reliably run background jobs on a schedule (cron jobs).

Configure disk space

Configure enough disk space on your system for all the data you plan to store. If you are running a data storage system yourself, you'll probably want to store the data on one or more Data disks that can be attached and detached as VM instances are replaced.

Configure backup

Configure backup for all of your data stores, ensuring they are geo-redundant. Most Azure-managed data stores, such as Azure SQL, support automated backups. For backing up VM instances and attached disks, consider using Azure Backup.

Configure cross-subscription backup

Copy all of your backups to a separate Azure subscription for extra redundancy. This ensures that if a disaster happens in one Azure subscription—e.g., an attacker gets in or someone accidentally deletes all the backups—you still have a copy of your data available elsewhere.

Test your backups

If you never test your backups, they probably don't work. Create automated tests that periodically restore from your backups to check they are actually working.

Set up schema management

For data stores that use a schema, such as relational databases, define the schema in schema migration files, check those files into version control, and apply the migrations as part of the deployment process. See Flyway and Liquibase.

Scalability and High Availability

Choose between a Monolith and Microservices

Ignore the hype and stick with a monolithic architecture as long as you possibly can. Microservices have massive costs (operational overhead, performance overhead, more failure modes, loss of transactions/atomicity/consistency, difficulty in making global changes, backwards compatibility requirements), so only use them when your company grows large enough that you can't live without one of the benefits they provide (support for different technologies, support for teams working more independently from each other). See Don't Build a Distributed Monolith, Microservices — please, don't, and Microservice trade-offs for more info.

Configure service discovery

If you do go with microservices, one of the problems you'll need to solve is how services can discover the IPs and ports of other services they depend on. Some of the solutions you can use include Azure Service Fabric, Azure Kubernetes Service (AKS), and Consul.

Use multiple Instances

Always run more than one copy (i.e., more than one VM instance or Docker container) of each stateless application. This allows you to tolerate the app crashing, allows you to scale the number of copies up and down in response to load, and makes it possible to do zero-downtime deployments.

Use multiple Availability Zones

Configure your Scale Sets, Databases and other resources to make use of Availability Zones to achieve comprehensive business continuity on Azure, build your application architecture using the combination of Availability Zones with Azure region pairs. You can synchronously replicate your applications and data using Availability Zones within an Azure region for high-availability and asynchronously replicate across Azure regions for disaster recovery protection.

Set up load balancing

Distribute load across your apps and Availability Zones using Azure Load Balancers, which are designed for high availability and scalability. Use the Azure Application Gateway for all HTTP/HTTPS traffic and for DNS-based traffic use Traffic Manager.

Use Auto Scaling

Use auto scaling to automatically scale the number of resources you're using up to handle higher load and down to save money when load is lower.

Configure Auto Recovery

Configure a process supervisor such as systemd or supervisord to automatically restart failed processes. Configure your Scale Sets and Load Balancer health checks to automatically replace failed VM instances. Use your Docker orchestration tool to monitor the health of your Docker containers and automatically restart failed ones (e.g., Azure Monitor for Containers).

Configure graceful degradation

Handle failures in your dependencies (e.g., a service not responding) by using graceful degradation patterns, such as retries (with exponential backoff and jitter), circuit breaking, timeouts, deadlines, and rate limiting.

Perform load tests and use chaos engineering

Run load tests against your infrastructure to figure out when it falls over and what the bottlenecks are. Use chaos engineering to continuously test the resilience of your infrastructure (see also chaos monkey). I highly recommend using Gremlin.

Continuous Integration

Pick a Version Control System

Check all code into a Version Control System (VCS). The most popular choice these days is Git. You can use GitHub, GitLab, or BitBucket to host your Git repo but I highly recommend using Azure Repos.

Do code reviews

Set up a code review process in your team to ensure all commits are reviewed. Pull requests are an easy way to do this.

Configure a build system

Set up a build system for your project, I recommend using Azure Pipelines. The build system is responsible for compiling your app, as well as many other tasks described below.

Use dependency management

Your build systems should allow you to explicitly define all the of the dependencies for your apps. Each dependency should be versioned, and ideally, the versions of all dependencies, including transitive dependencies, are captured in a lock file (e.g., read about Yarn's lock file and Go's dep lock file. I recommend using Azure Artifacts.

Configure static analysis

Configure your build system so it can run static analysis tools on your code, such as linters and code coverage. I recommend SonarCloud for Azure DevOps

Set up automatic code formatting

Configure your build system to automatically format the code according to a well-defined style (e.g., with Go, you can run go fmt; with Terraform, you can run terraform fmt). This way, all your code has a consistent style, and your team doesn't have to spend any time arguing about tabs vs spaces or curly brace placement.

Set up automated tests

Configure your build system so it can run automated tests on your code. I recommend Azure Test Plans.

Publish versioned artifacts

Configure your build system so it can package your app into a deployable "artifact," such as an NuGet Package or Docker image. Each artifact should be immutable and have a unique version number that makes it easy to figure out where it came from (e.g., tag Docker images with the Git commit ID). Push the artifact to an artifact repository (e.g., ACR for Docker images) form which it can be deployed.

Set up a build server

Set up a server to automatically run builds, static analysis, automated tests, etc. after every commit. I recommend you use a hosted system such as Azure DevOps.

Continuous Delivery

Create deployment environments

Define separate "environments" such as dev, stage, and prod. Each environment can either be a separate Azure subscription (recommended for larger teams and security-sensitive and compliance use cases) or separate VNet's within a single Azure subscription (recommended only for smaller teams).

Set up per-environment configuration

Your apps may need different configuration settings in each environment: e.g., different memory settings, different features on or off. Define these either in Variable Groups or config files that get checked into version control(e.g., dev-config.yml, stage-config.yml, prod-config.yml) and packaged with your app artifact (i.e., packaged directly into the Docker image for your app), and have your app boot up code pick the proper config file for the current environment during boot.

Define your infrastructure as code

Do not deploy anything by hand, by using the Azure Portal. Instead, define all of your infrastructure as code using tools such as Terraform and Azure Resource Manager Templates.

Test your infrastructure code

If all of your infrastructure is defined as code, you can create automated tests for it. The goal is to verify your infrastructure works as expected after every single commit, long before those infrastructure changes affect prod. See Terratest for more info.

Set up immutable infrastructure

Don't update VM instances or Docker containers in place. Instead, launch completely new VM instances and new Docker containers and, once those are up and healthy, remove the old VM instances and Docker images. Since we never "modify" anything, but simply replace, this is known as immutable infrastructure, and it makes it easier to reason about what's deployed and to manage that infrastructure.

Promote artifacts

Deploy immutable artifacts to one environment at a time, and promote it to the next environment after testing. For example, you might deploy v0.3.2 to dev, and test it there. If it works well, you promote the exact same artifact, v0.3.2, to stage, and test it there. If all goes well, you finally promote v0.3.2 to prod. Since it's the exact same code in every environment, there's a good chance that if it works in one environment, it'll also work in the others.

Roll back in case of failure

If you use immutable, versioned artifacts as your unit of deployment, then any time something goes wrong, you have the option to roll back to a known-good state by deploying a previous version. If your infrastructure is defined as code, you can also see what changed between versions by looking at the diffs in version control.

Automate your deployments

One of the advantages of defining your entire infrastructure as code is that you can fully automate the deployment process, making deployments faster, more reliable, and less stressful.

Do zero-downtime deployments

There are several strategies you can use for Zero-downtime deployments, such as blue-green deployment (works best for stateless apps) or rolling deployment (works best for stateful apps).

Use canary deployments

Instead of deploying the new version of your code to all servers, and risking a bug affecting all users at once, you limit the possible damage by first deploying the new code to a single "canary" server. You then compare the canary to a "control" server running the old code and make sure there are no unexpected errors, performance issues, or other problems. If the canary looks healthy, roll out the new version of your code to the rest of the servers. If not, roll back the canary.

Use feature toggles

Wrap all new functionality in an if-statement that only evaluates to true if the feature toggle is enabled. By default, all feature toggles are disabled, so you can safely check in and even deploy code that isn't completely finished (as long as it compiles!), and it won't affect any user. When the feature is done, you can use a UI to gradually enable the feature toggle for specific users: e.g., initially just for your company's employees, then for 1% of all users, then 10% of all users, and so on. At any stage, if anything goes wrong, you can turn the feature toggle off again. Feature toggles allow you to separate deployment of new code from the release of new features in that code. They also allow you to do bucket testing. See LaunchDarkly, Split, and Optimizely for more info.

Networking

Set up VNets

Create one or more VNets, each with their own IP address range (see VNet Planning), and deploy all of your apps into those VNets.

Set up subnets

Create six "tiers" of subnets in each VNet: gateway, management, firewall, web-tier, business-tier and data-tier. See A Reference VNet Architecture.

Configure Network Security Groups

Create Network Security Groups (NSGs) to control what traffic can go between different subnets. I recommend allowing the firewall subnets to receive traffic from anywhere, the web-tier subnets to only receive traffic from the firewall subnets, and so on.
By default, no traffic is allowed in or out. Follow the Principle of Least Privilege and open up the absolute minimum number of ports you can for each resource. When opening up a port, you can also specify either the CIDR block (IP address range) or ID of another Security Group that is allowed to access that port. Reduce these to solely trusted servers where possible. For example, VM instances should only allow RDP access (port 3389) from the Security Group of a single, locked-down, trusted server (the Bastion Host).

Configure Static IPs

By default, all Azure resources (e.g., VM instances, Load Balancers etc.) have dynamic IP addresses that could change over time (e.g., after a redeploy). When possible, use Service Discovery to find the IPs of services you depend on. If that's not possible, you can create static IP addresses.

Configure DNS using Azure DNS

Manage DNS entries using Azure DNS. You can buy public domain names by using a third-party domain name registrar or create custom private domain names, accessible only from within your VNet, using Azure Private DNS.

Security and Governance

Configure encryption in transit

Encrypt all network connections using TLS. Many Azure services support TLS connections by default (e.g., Azure SQL) or if you enable them (e.g., Azure App Service. You can get free, auto-renewing TLS certificates for your public domain names from Let's Encrypt.

Configure encryption at rest

Enable encryption on the OS and Data disks of each VM instance. Many Azure services optionally support encryption: e.g., see Always Encrypted Azure SQL

Deploy a Bastion Host

All VM instances should be in a private subnet and NOT accessible directly from the public Internet. Only a single, locked-down VM instance, known as the Bastion Host, should run in the public subnets. You must first connect to the Bastion Host, which gets you "in" to the network, and then you can use it as a "jump host" to connect to the other VM instances. I recommend using Azure Bastion which is a fully-managed PaaS service.

Deploy a VPN Server

I typically recommend running a VPN Server as the entrypoint to your network. OpenVPN is the most popular option for running a VPN server. However, I would recommend using a PaaS option such as VPN Gateway. Alternatively, to extend your on-premises networks I would recommend using ExpressRoute.

Set up a secrets management solution

NEVER store secrets in plaintext. Developers should store their secrets in a secure secrets manager, such as pass, 1Password, or LastPass. Applications should store all their secrets (such as DB passwords and API keys) either in secret variables within a Azure DevOps variable group or in a secret store such as Azure Vault or Hashicorp Vault.

Use server hardening practices

Every server should be hardened to protect against attackers. This may include: running CIS Hardened Images, unattended upgrades to automatically install critical security patches, firewall software, anti-virus software, and file integrity monitoring software.

Go through the OWASP Top 10

Browse through the Top 10 Application Security Risks list from the Open Web Application Security Project (OWASP) and check your app for vulnerabilities such as injection attacks, CSRF, and XSS.

Review against the latest CIS Azure Benchmark

Review against the latest CIS Microsoft Azure Foundations Benchmark document to check that any Centre for Internet Security recommended security considerations have been made, to harden the environment against potential exploits.

Go through a security audit

Have a third party security service perform a security audit and do penetration testing on your services. Fix any issues they uncover.

Sign up for security advisories

Join the security advisory mailing lists for any software you use and monitor those lists for announcements of critical security vulnerabilities.

Set up automated security tests

Configure your build system so it can run automated security tests on your code. I recommend WhiteSource Bolt for Azure DevOps.

Create Active Directory Users

Create an Active Directory User for each developer. Accounts should not be shared.

Create Active Directory Groups

Manage permissions for Active Directory users using Active Directory Groups. Follow the Principle of Least Privilege, assigning the minimum permissions possible to each Active Directory Group and User.

Create Active Directory Roles

Give your Active Directory Groups access to Azure resources by assigning Roles (RBAC).

Create a password policy and enforce MFA

Set a password policy that requires a long password for all users and require every user to enable Multi-Factor Authentication (MFA).

Record audit Logs

Configure audit logs of all changes happening in your Azure subscription. I recommend Azure Security Centre to help manage this.

Configure Azure Policy

Configure Azure Policy to monitor for compliance across your Azure resources and enforce different rules and effects to meet your company requirements. Azure provide many example policy definitions you can use to get you started, in addition to these you can use the Azure policy initiative definitions. These initiatives group policies together to monitor and enforce policies for a common goal.
An example Initiative definition is Audit VMs with insecure password security settings. This initiative includes policies for password complexity, password re-use policy, password age policy, amongst others, all to achieve the goal of ensuring password security settings are correct.
Policies can also be deployed using Azure Blueprints. This includes blueprint samples for ISO 27001 and CIS Microsoft Azure Foundations Benchmark. Deploying your infrastructure and policies together using blueprints means you can deploy your solution in a way that meets your compliance needs using a trusted, repeatable process.

Monitoring

Track availability metrics

The most basic set of metrics: can a user access your product or not? Useful tools: Application Insights which is part of Azure Monitor.

Track business metrics

Metrics around what users are doing with your product, such as what pages they are viewing, what items they are buying, and so on. Useful tools: Google Analytics, and Mixpanel.

Track application metrics

Metrics around what your application is doing, such as QPS, latency, and throughput. Useful tools: Application Insights which is part of Azure Monitor.

Track server metrics

Metrics around what your hardware is doing, such as CPU, memory, and disk usage. Useful tools: Application Insights which is part of Azure Monitor.

Configure services for observability

Record events and stream data from all services. Slice and dice it using tools such as Kafka, Honeycomb, and of course Application Insights which is part of Azure Monitor.

Store logs

To prevent log files from taking up too much disk space, configure log rotation on every server. To be able to view and search all log data from a central location (i.e., a web UI), set up log aggregation using tools such as Azure Monitor, Filebeat, Logstash etc.

Set up alerts

Configure alerts when critical metrics cross pre-defined thresholds, such as CPU usage getting too high or available disk space getting too low. Most of the metrics and log tools listed earlier in this section support alerting. Set up an on-call rotation using tools such as PagerDuty, Opsgenie and VictorOps.

Cost optimization

Pick proper VM instance types and sizes

Azure offers a number of different instance Types, each optimized for different purposes: compute, memory, storage, GPU, etc. Use Azure Price to slice and dice the different instance types across a variety of parameters. Try out a variety of instance sizes by load testing your app on each type and picking the best balance of performance and cost. In general, running a larger number of smaller Instances ("horizontal scaling") is going to be cheaper, more performant, and more reliable than a smaller number of larger Instances ("vertical scaling").

Use Low Priority VM instances for background jobs

Low Priority VM instances are available in conjunction with Azure Batch and are offered at a much lower price for VM instances than what you'd pay on-demand (as much as 80% lower!), and when there is capacity to fulfill your request, Azure will give you the VM instances at that price. Note that if Azure needs to reclaim that capacity, it may terminate the VM instance at any time. This makes Low Priority VM instances a great way to save money on any workload that is non-urgent (e.g., all background jobs, machine learning, image processing) and pre-production environments.

Use Azure Reserved instances for dedicated work

Azure Reserved instances allow you to reserve capacity ahead of time in exchange for a significant discount (up to 72%) over on-demand pricing. This makes Reserved Instances a great way to save money when you know for sure that you are going to be using a certain number of instances consistently for a long time period. Azure Reserved instances are a billing optimization, so no code changes are required: just reserve the Instance Type, and next time you use it, Azure will charge you less for it.

Shut down VM instances when not using them

You can shut down VM instances when you're not using them, such as in your pre-prod environments at night and on weekends. You could even create an Azure Automation solution that does this on a regular schedule.

Use Scale Sets

Use Scale Sets to increase the number of VM instances when load is high and then to decrease it again—and thereby save money—when load is low.

Use Docker when possible

If you deploy everything as an directly on your VM instances, then you will typically run exactly one type of app per VM instance. If you use a Docker orchestration tool (e.g., AKS), you can give it a cluster of VM instances to manage, and it will deploy Docker containers across the cluster as efficiently as possible, potentially running multiple apps on the same instances when resources are available.

Use Azure Functions when possible

For all short (5 min or less) background jobs, cron jobs, ETL jobs, event processing jobs, and other glue code, use Azure Functions. You not only have no servers to manage, but Azure Function pricing is incredibly cheap, with the first 1 million executions and 400,000 GB-seconds per month being completely free! After that, it's just £0.150 per million executions and £0.000012 for every GB-second.

Clean up old data with Azure Blob Lifecycle Management

If you have a lot of data in Azure Blob Storage, make sure to take advantage of Azure Blob Lifecycle Management to save money. You can configure the Azure Blob to move files older than a certain age either to cheaper storage tiers or to delete those files entirely.

Clean up unused resources

Use Azure Advisor to identify unused or underutilised Azure resources, such as old VM instances that no one is using any more.

Learn to analyze your Azure bill

Learn to use tools such as Azure Advisor, and Cloudyn to understand where you're spending money. If you find something you can't explain, reach out to Azure Support, and they will help you track it down.

Create billing alerts

Create alerts to notify you when your Azure bill crosses important thresholds. Make sure to have several levels of alerts: e.g., at the very least, one when the bill is a little high, one when it's really high, and one when it is approaching bankruptcy levels.


Inspired by the Gruntwork Production Readiness Checklist which covers AWS

About

This checklist is your guide to the best practices for deploying secure, scalable, and highly available infrastructure in Azure. Before you go live, go through each item, and make sure you haven't missed anything important!

Topics

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •