MongoDB with Apache Spark

Author: Wavicle Data Solutions

With the advent of big data technology, Apache Spark has gained much popularity in the world of distributed computing by offering a faster, easier to use, and in-memory framework as compared to the MapReduce framework. It can be used with any distributed storage such as HDFS, Apache Cassandra and MongoDB.

MongoDB is one of the most popular NoSQL databases. Its unique capabilities to store document-oriented data using the built-in sharding and replication features provide horizontal scalability and high availability.

By using Apache Spark as a data processing platform on top of a MongoDB database, one can leverage the following Spark API features:

The Resilient Distributed Datasets model
The SQL (HiveQL) abstraction
The Machine learning libraries – Scala, Java, Python, and R

Mongodb Connector for Spark Features

The MongoDB connector for Spark is an open-source project written in Scala, to read and write data from MongoDB.

The connector offers various features that include:

The ability to read/write BSON documents directly from/to MongoDB.
Converting a MongoDB collection into a Spark RDD.
Utility methods to load collections directly into a Spark Data Frame or Dataset.
Predicate pushdown, which is a Spark SQL’s Catalyst optimization to push the where clause filters and the select projections down to the data source to prevent unnecessary loading of data into memory. When considering MongoDB as the data source, the connector will convert the Spark’s filters to a MongoDB aggregation pipeline match. As a result, the actual filtering and projections are done on the MongoDB node before returning the data to the Spark node.
Integration with the MongoDB aggregation pipeline where the connector accepts MongoDB’s pipeline definitions on a MongoRDD to execute aggregations on the MongoDB nodes instead of the Spark nodes.

This feature holds good in rare cases since most of the work to optimize the data load in the workers is done automatically by the connector.

Data locality – If the Spark nodes are deployed on the same nodes as the MongoDB nodes and correctly configured with a Mongo Sharded Partitioner, then the Spark nodes will load the data according to their locality in the cluster. This will avoid costly network transfers when first loading the data in the Spark nodes.

Technical Architecture

Different use cases can benefit from Spark built on top of a MongoDB database. They all take advantage of MongoDB’s built-in replication and sharding mechanisms to run Spark on the same large MongoDB cluster used by the business applications to store their data. Typically, applications read/write on the primary replica set while the Spark nodes read data from a secondary replica set.

Spark fuels analytics and it can be used to extract data from MongoDB, run complex queries and then write the data back to another MongoDB collection. This processing power of Spark eliminates the need of new data storage. If there is an already existing centralized storage such as a Data Lake built with HDFS for instance, Spark can extract and transform data from MongoDB before writing it to HDFS.

The advantage is that Spark can be used as a simple and effective ETL tool to move the data from MongoDB to the data lake. The ability to load the data on Spark nodes based on their MongoDB shard location is another optimization from MongoDB. The MongoDB connector’s utility methods simplify the interactions between Spark and MongoDB, thus making it a powerful combination to build sophisticated analytical applications.