In the advent of big data technology, Apache Spark has gained much
popularity in the world of distributed computing by offering an easier
to use, faster, and in-memory framework as compared to the MapReduce
framework. It can be used with any distributed storage such as HDFS,
Apache Cassandra and MongoDB.
MongoDB is one of the most popular NoSQL databases. Its unique
capabilities to store document-oriented data using the built-in sharding
and replication features provide horizontal scalability and high
By using Apache Spark as a data processing platform on top of a
MongoDB database, one can leverage the following Spark API features:
The Resilient Distributed Datasets model
The SQL (HiveQL) abstraction
The Machine learning libraries – Scala, Java, Python and R
Mongodb Connector for Spark Features
The MongoDB connector for Spark is an open source project written in Scala, to read and write data from MongoDB.
The connector offers various features that include:
The ability to read/write BSON documents directly from/to MongoDB.
Converting a MongoDB collection into a Spark RDD.
Utility methods to load collections directly into a Spark Data Frame or Dataset.
Predicate pushdown, which is a Spark SQL’s Catalyst optimization to push the where clause filters and the select
projections down to the data source to prevent unnecessary loading of
data into memory. When considering MongoDB as the data source, the
connector will convert the Spark’s filters to a MongoDB aggregation
pipeline match. As a result, the actual filtering and projections are
done on the MongoDB node before returning the data to the Spark node.
Integration with the MongoDB aggregation pipeline
where the connector accepts MongoDB’s pipeline definitions on a MongoRDD
to execute aggregations on the MongoDB nodes instead of the Spark
This feature holds good in rare cases since most of the work to
optimize the data load in the workers is done automatically by the
Data locality – If the Spark nodes are deployed on
the same nodes as the MongoDB nodes and correctly configured with a
Mongo Sharded Partitioner, then the Spark nodes will load the data
according to their locality in the cluster. This will avoid costly
network transfers when first loading the data in the Spark nodes.
use cases can benefit from Spark built on top of a MongoDB database.
They all take advantage of MongoDB’s built-in replication and sharding
mechanisms to run Spark on the same large MongoDB cluster used by the
business applications to store their data. Typically, applications
read/write on the primary replica set while the Spark nodes read data
from a secondary replica set.
Spark fuels analytics and it can be used to extract data from
MongoDB, run complex queries and then write the data back to another
MongoDB collection. This processing power of Spark eliminates the need
of a new data storage. If there is an already existing centralized
storage such as a Data Lake built with HDFS for instance, Spark can
extract and transform data from MongoDB before writing it to HDFS.
The advantage is that Spark can be used as a simple and effective ETL
tool to move the data from MongoDB to the data lake. The ability to
load the data on Spark nodes based on their MongoDB shard location is
another optimization from the MongoDB. The MongoDB connector’s utility
methods simplify the interactions between Spark and MongoDB thus making
it a powerful combination to build sophisticated analytical
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.
3rd Party Cookies
This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.
Keeping this cookie enabled helps us to improve our website.
Please enable Strictly Necessary Cookies first so that we can save your preferences!