We've got to maintain a certain level of 'street-cred'.

Facebook’s Java-Based Presto Engine Soon to Go Open Source

There are many people who look at Facebook as being a frivolity and even go as far as to call it a waste of time. We can also vouch for the existence of the uber-rare Facebook holdouts – never once had a profile and do not plan on having one anytime soon.

However, regardless of any naysayer’s attitude or even reports from Forbes that Facebook may soon face “Death by a Thousand SnapChats,” the fact is that Facebook is a company that has grown to one billion users in less than a decade. Because of this, it could be regarded as one of the world’s biggest laboratories for scaling up information technology. In fact, from under that pile of cat pictures and gaming requests there is soon to emerge something that no one can consider frivolous – the Java-based Presto engine that Facebook uses internally to query over 250PB of data.

Before going any further, let us discuss Hadoop, what makes it tick, and what Facebook is doing differently with it. Hadoop is an open source framework that implements MapReduce – a programming model developed by none other than Google – which allows tasks to map easily over to thousands of machines.

MapReduce is somewhat tricky to explain; it is not a piece of software, nor is it a language, nor is it really an algorithm, but it is a very clever model that borrows from functional programming theories in order to allow legions of commodity servers to execute operations that could take days in mere hours. Google used MapReduce to replace much of its original engine code as well as all of the other internal functions that required its engineers to analyze the content of billions of indexed web pages and other pieces of data. Google eventually released a white paper detailing MapReduce, which became the foundation of Hadoop.

Like many massive web companies, Facebook relies on Hadoop in conjunction with Hive, Hadoop’s data warehousing tool. However, Facebook has found that there are limitations (at least for its needs) with the batch processing model inherent to MapReduce – hence the need for Presto.

We will have to wait until around Fall 2013 to really be able to dig into Presto, but there are some details already available. Though similar to Cloudera’s Impala, Presto was built internally and from the ground up to address Facebook’s needs. Interesting that we are seeing such a spirited back-and-forth of data technologies, and that even if some question the social relevance of social media, we are at least seeing some impressive data science tools emerge out of the web wars.