A Data Migration Story: Leveraging Databricks for Performance, Maintenance, and Cost Benefits

Author: Wavicle Data Solutions


The amount of data produced worldwide continues to increase rapidly, and businesses that don’t adapt will not be competitive. However, the fact remains that even though people understand this point of view, it is challenging to develop and deploy a data solution where the data is democratized and prepared for advanced analytics. At Wavicle, migrating data from on-premises to cloud-based solutions is one of our specialties. Businesses that attempt a cloud migration without strong strategic planning and an experienced migration partner like Wavicle can fail to produce the benefits of these solutions and ultimately increase operational costs and reduce productivity.

 

Many of our customers are focused on building advanced analytics with high-performance platforms that include built-in tools, frameworks, and accelerators that allow for improved operational efficiency, reduced costs, and the deployment of valuable insights. In this example, we helped a customer migrate from AWS Elastic MapReduce (EMR), a managed cluster platform that allows the execution of big data frameworks such as Apache Spark and Apache Hadoop, to Databricks Lakehouse to reach operational and business goals.

 

Why Databricks Lakehouse

Databricks, the lakehouse company, provides data engineering tools for processing and transforming vast volumes of data to build machine learning models. Databricks is built on top of distributed cloud computing environments like Azure, AWS, or Google Cloud that facilitate running applications on CPUs or GPUs based on analysis requirements.

 

It simplifies big data analytics by incorporating a lakehouse architecture that provides data warehousing capabilities to a data lake. It eliminates unwanted data silos created while pushing data into data lakes or multiple data warehouses and provides data teams with a single data source by leveraging lakehouse architecture.

 

Two key areas made Databricks an ideal platform for this migration:

  1. The AWS EMR cluster did not support effective autoscaling. When working with large data volumes, it is crucial to ensure that the clusters don’t utilize more resources than necessary. With Databricks, exploring and finalizing the configuration that works is still required, but it is possible to scale with less overhead.
  2. When AWS EMR pipelines failed, we typically needed to restart them from scratch. This can have cost implications as the billing for cluster and other resources is calculated one more time. In comparison, Databricks provides the repair run option enabling the pipelines to restart from the point of failure.

 

The motivation to migrate

This company’s motivation to migrate was based on four main points:

 

(1) Enhanced processing

One main reason to migrate was the advanced processing engine in the Databricks Lakehouse. Businesses have or should have a data management roadmap that details their journey up the data maturity curve. A robust processing engine that supports faster data processing and analysis is central to the goal of providing predictive analytics and automated decision-making (the top of the data maturity curve). The Databricks Lakehouse provides the deployment of advanced tools such as machine learning and artificial intelligence to drive the data maturity roadmap. 

 

(2) Reduced costs

Compared to AWS EMR, the overall cost associated with managing a large data platform was reduced by migrating to Databricks. This is due to the data processing engine found in Databricks, which reduces the computing time for processing the data and operational spend. Recently, Databricks added a pay-as-you-go pricing model that helps customers save money when compared to alternatives with fixed pricing models.

 

(3) Collaboration and data sharing

The Databricks Lakehouse offers a centralized platform that supports data management and processing. This architecture makes it easy for teams to collaborate on data projects. Further, the lakehouse platform enables data sharing across the enterprise. This allows companies to realize the power of data sharing, which results in improved data insights and decision-making.

 

(4) Security

Data security is a significant concern for companies managing sensitive personally identifiable information (PII), protected health information (PHI), or other private data. The lakehouse platform provides strong encryption, strong access controls, and robust auditing capabilities. These features are central to a mature data security program to serve the enterprise.

 

 

Managing the migration to Databricks

Wavicle’s team of expert cloud migration consultants has supported numerous seamless Databricks migrations. The following are some of the highlights:

 

  • We migrated the data pipelines (PySpark code) from AWS EMR to Databricks. Though many of our customers prefer a lift-and-shift approach, we worked with our customers to ensure many optimizations were included that leverage specific features in Databricks, which resulted in performance improvements.
  • We performed migrations where the data also remained in Amazon S3. When a no-data migration project is executed, the PySpark code on Databricks reads the data from Amazon S3, performs transformations, and persists the data back to Amazon S3
  • We converted existing PySpark API scripts to Spark SQL. The pyspark.sql is a module in PySpark to perform SQL-like operations on the data stored in memory. This change was intended to make the code more maintainable.
  • We fine-tuned Spark code to reduce/optimize data pipelines’ run-time and improve performance.
  • We leveraged the use of Hive tables.
  • The deployment team tested jobs with multiple clusters since, in Databricks, each cluster has a different cost in charge. The team then selected the job cluster that improved performance and reduced cost.

 

The results of a seamless Databricks migration

Migrating to the Databricks Lakehouse provides many benefits to the enterprise, including an improved data processing engine, reduced costs, improved security, and enhanced data sharing and collaboration capabilities. Our team completed this Databricks migration successfully and ensured all the best practices were followed. With this migration, the customer achieved the following:

 

Performance benefits
  • Achieved 60% performance improvement overall with data pipeline execution time
  • Reduced transformation time from four hours for some transformations in AWS EMR to only 90 minutes with Databricks
  • Gained the ability to run some transformations with five years’ worth of data that were not running on AWS EMR

 

Cost benefits
  • Achieved approximately a 20% overall cost reduction

 

Maintenance benefits
  • Reduced cluster spin-up time to 5-7 minutes compared to AWS EMR, where it consumed 10-15 minutes
  • Obtained the ability to automatically update to the latest versions of components like Jar files
  • Gained access to built-in support for data warehouses and tools like notebooks, clusters, and models

 

Databricks intrinsic features
  • Leveraged features such as autoscaling, repair run, and photon clusters

 

Interested in Databricks? We can help you evaluate what’s best for your business and gain the most value from your investment. Contact our team today to learn more and get started.

 

Authored by

Bob Fairchild, Director of Data Engineering

Kannathasan Kesavamani, Senior Data Engineer

Saravanan Marimuthu, Senior Data Engineer

Sivaraj Shanmugam, Lead Data Engineer