Data plays an important role at Coursera. We use data to improve our learner experience, gather insights in MOOC pedagogy, and provide instructors insight into their courses via our instructor dashboards. The data infrastructure team at Coursera seeks to provide data consumers with great tools that enable them to transform and analyze data effectively.
One need that arose was the ability to write complicated data flows that leverage Hadoop MapReduce. MapReduce is revolutionary in that two simple distributed operations, map and reduce, could be used to effectively parallelize computations across large datasets. However, this same simple API makes it inconvenient to perform operations like joins, secondary sorts, and aggregations.
Coursera has started writing some of its Hadoop transformations in Scalding, and so far results are great. Scalding is concise, performant (including powerful optimizations like skewed join support), and allows us to write all our transformations in Scala. In addition, Scalding makes it really easy to unit-test our data flows without having to run Hadoop at all.
It’s only natural, then, that as part of our Talks @ Coursera series, we had the pleasure of hosting Ian O’Connell, Scalding contributor and Sr. Software Engineer at Twitter, to talk about real world examples of using Scalding and Algebird at Twitter-scale.
We also had Daniel Chia talk about why Coursera uses Scalding, and what our experience has been like so far.
We hope to see you at our next talk soon!