Out with the old…
In the early days of Coursera, we had a variety of long-running jobs needed to support our platform, such as batch email sending, class-wide quiz regrades, gradebook exports for our instructors, and more. This resulted in us building Cascade, a simple PHP framework using worker threads to poll Amazon SQS for new jobs and execute them.
However, we found that there were a number of drawbacks with the system we had built, such as a lack of isolation between colocated workers and a fragile and manual deployment process. In addition, tight integration with SQS resulted in a poor development story that made it difficult for developers to easily prototype and test new jobs on our framework. At first, building Cascade in PHP allowed us to integrate tightly with existing code for our online PHP stack. However, as we transitioned to Scala for both the online and offline worlds, confining our jobs to PHP became a hindrance rather than an advantage. As a result, we decided to write a more flexible successor to Cascade, without the inefficiencies of our first system.
… in with the new
We named this new system “Iguazú,” after the famous South American waterfalls. Rather than construct Iguazú from scratch, we chose to leverage Docker and Mesos, which were a great fit for our needs in several ways. We also generalized the framework to support pluggable queuing services, thereby streamlining the development lifecycle by allowing local queues.
As a lightweight packaging tool, Docker allows us to easily transition away from Cascade, simply by bundling our existing code for long-running jobs inside a Docker image. Moreover, our deployment process has become a quick two-step process: build a new Docker image using a Dockerfile, and upload it to a private registry.
While Docker helps us manage our job code, Mesos does the heavy lifting in managing how the jobs are run. By design, Mesos allows us to isolate our jobs and ensure that no one runaway job will cause other jobs to be terminated. Furthermore, Mesos still leaves us with enough control over how our jobs are scheduled and run, allowing us to autoscale without terminating machines that are still running jobs.
By using Mesos and Docker, we have built a new job-running system that we plan to use for many functions across Coursera, with use cases ranging from export jobs for instructors to grading student-submitted programming assignments to running batch analytics jobs for internal teams. We are currently vetting Iguazú in production and making it as robust and performant as we need it to be. Nevertheless, Mesos and Docker already provide us with numerous wins that we believe will make our new system a great tool for the many kinds of jobs we want to run.