A Look at MongoDB and Hadoop

Posted on August 7, 2013

Hadoop is possibly one of the hottest technologies in existence right now, and despite this fact a lot of people (even those knowledgeable about IT) trip up a bit when defining it.

Calling it a “big data” solution is accurate but somewhat incomplete because it is only one of many ways of managing high-volume data analytics. Calling it a “database” is completely wrong, as is calling it a “file system.”

The core of Hadoop is the API and task engine for MapReduce jobs, which operates on data sets disbursed on nodes using the Hadoop Distributed File System (the open source twin of the Google File System), and optionally uses the HBase columnar database (based on Google’s Big Table). However, with Hadoop 2.0 around the corner, these three components won’t even be requisite anymore.

With Hadoop becoming an increasingly flexible platform, where do NoSQL document stores such as MondgoDB fit in? Both technologies are exploding, but since Hadoop’s initial database solution is column-based, does that mean that it is not compatible with MongoDB?

The answer is an emphatic no. One of the advantages of the simplified, schema-free nature of a database such as MongoDB is that it is not too difficult to adapt it to work with different technologies. This is especially advantageous when the database is capable of storing massive amounts of data and the processing engine is able to churn through enormous jobs in batches.

MongoDB’s Hadoop adapter, with only a few tweaks to configuration files and class definitions, it is possible to feed MongoDB data into Hadoop. More complex aggregation and analysis is possible by feeding the results of batch processes back to MongoDB. This connector is also effective for data warehousing or acting as a complex ETL mechanism. Ultimately, no matter how you decide to use them, MongoDB and Hadoop can make an excellent pair.

We've got to maintain a certain level of 'street-cred'.