This is a two part blog series about how Coursera migrated from EC2 Classic to VPC. This post highlights general strategies and service migration, while the next post will deal with migrating our storage subsystems.
As Coursera grows, the infrastructure team is always looking for ways to provide better performance and security while reducing operating costs. Late last year, we identified that migrating from EC2 Classic (“Classic”) to Amazon VPC would provide us benefits such as (i) better networking infrastructure (ii) access to more instance types and (iii) more flexible security features, such as more flexible security groups and better network filtering systems.
These features allow us access to higher performance instances at a lower cost while also enabling new platform features, such as our new programming assignment grading system, to be developed with the more flexible security controls in VPC. Thus, we developed and executed a plan to migrate the bulk of our infrastructure into VPC.
Before we jump into the migration strategy, here is a brief overview of our infrastructure ecosystem. Coursera is powered by a service oriented architecture in its backend, with over 70 services of various types communicating with each other. We have three tiers of services:
- Edge Services. This is where we handle and route incoming request to their correct destination. We also handle session management and authentication in this layer.
- Product Services. These services power the bulk of our product offerings. For instance, we have a service for our discussion forum, another for our assignment grading system and yet another to manage our authoring interfaces.
- Backend Services. These services provide generalized store abstractions for use by product services. For example, our document store and asset service are utilized by almost all our product services to store and retrieve learner and course data.
We use a blue-green deployment pattern, where each new version of a service is built and deployed with its own ASG. To validate and deploy a new build of a service, we use canary requests and granular traffic shaping. A lot of this is orchestrated using Zookeeper, that also serves as a service discovery layer.
In addition to the services, our backend services are ultimately powered by a combination of Apache Cassandra and Amazon RDS for general data storage, Apache Kafka for pub-sub messaging and Amazon Redshift for data analysis.
General Migration Strategy
Contrary to popular opinion, we decided to first migrate our services to VPC instead of our data storage servers. This was a lower risk for us as we could rollback our services back to the EC2 Classic platform easily if anything went wrong, simply by deploying a new version of the service within Classic. This would be hard to do with data stores, where migration is often a multi-step process potentially involving downtime. We also picked non-critical services first as that allowed us to gain hands-on operational experience with the migration process - something that was invaluable in helping us migrate Cassandra, RDS and Zookeeper later.
Our migration strategy heavily depended on a relatively recently introduced feature, Amazon ClassicLink, to enable communications between instances within a VPC and instances in Classic. ClassicLink allowed us to migrate each service and data store independently while allowing all instances to talk to each other properly, eliminating the risk inherent in an all-or-nothing migration.
In designing our architecture within VPC, we decided to go with a broadly similar design to our EC2 Classic system with a parallel set of security groups. Since VPC allows changing security groups of running instances on the fly, we have the option to move to a more sophisticated security and network architecture later on.
We have also decided to allocate a /16 for the new production VPC so that we have enough IP addresses to use as we grow Coursera.
ClassicLink, Proxies and Route53
Before starting to migrate services, we turned on ClassicLink for all our instances in Classic so migrated services in VPC could still communicate with unmigrated instances.
As RDS and Redshift does not support ClassicLink, we worked around it by creating proxy instances in Classic running HAProxy in TCP proxy mode. We run two instances of this proxy for each RDS database and Redshift instance in different availability zones for high availability. We then enabled ClassicLink on these instances so that instances in VPC can communicate with these instances.
To make the proxies transparent to all other instances, we created a private DNS zone for the VPC in Route53 and made sure that the domain names used for the proxies in the private zone matched the domain names for the underlying resources in the public zone. When configured it in this manner so that no configuration change would be necessary when migrating between Classic and VPC. So instances in Classic would resolve to the underlying resource IP, while services in VPC would resolve to the proxy.
Coursera uses our own home grown deployment tool, Quack, to deploy our services. Quack uses the AWS Java SDK to interact with APIs to create and terminate instances and autoscaling groups. A simple code change was made to Quack to enable ClassicLink on all new VPCs created and to allow creation of VPC-based autoscaling groups.
We identified a non-critical service, render (our server-side rendering service for React pages) to move to VPC first. We created a new autoscaling group within VPC and allowed the service to startup and register itself with our Zookeeper instances. We then shifted traffic to the new instance slowly, while monitoring our internal metrics via Datadog and any learner reports over the course of 15 minutes. Once 100% of traffic to the service is served by VPC instances, we can then safely terminate the Classic autoscaling group.
This was repeated for all our 70+ services over the course of three weeks and we are very happy to report that there was zero downtime associated with this migration without any intervention from our product engineers. This switch also enabled us to start using t2 and m4 instance types for our low and high volume production services respectively, allowing for cost savings in running our infrastructure.
In Part 2 of our technical blog post series, you will learn about how we migrated our Cassandra and RDS instances from ClassicLink to VPC.