AWS Glue or IBM DataStage: Benefits and best practices for migrating to serverless ETL

Author: Ranjith Ramachandran


 

Now that much of the world’s data resides in cloud or hybrid data warehouses and data lakes, many organizations are turning their focus to extract, transform, and load (ETL) modernization.

 

That is, they are shifting from on-premises ETL solutions to serverless ETL solutions that offer lower cost and maintenance requirements, increased scalability, and easier integration with other cloud services and applications.

 

To illustrate this point, the market for cloud computing data integration services is expected to more than double in the coming years, growing from $445.3 billion in 2021 to $947.3 billion by 2026.

 

In this blog, we’ll explore the option to migrate from IBM DataStage on-prem solution to AWS Glue serverless ETL solution. We’ll cover the benefits of serverless data integration and then outline key steps for a successful migration to AWS Glue.

 

About IBM DataStage

IBM DataStage is an ETL solution that helps businesses design, develop, and run jobs that collect and deliver data. It was originally designed as an on-premises solution, which requires license-holders to install the software locally, handle software upgrades internally, and manage/maintain the infrastructure that runs it.

 

Though DataStage is available on the cloud via IBM Cloud Pak for Data, many customers are still using the DataStage on-prem solution.

 

About AWS Glue

AWS Glue is a serverless, cloud-native data integration service that helps customers discover, prepare, and combine data for analytics, machine learning, and application development.

 

It is a fully managed data integration service that allows companies to categorize their data, clean it, enrich it, and move it reliably between various data stores and data streams without the responsibility for infrastructure management and maintenance.

 

For organizations looking to increase the performance and throughput of their data integration solutions, AWS Glue can be a good fit.  It has built-in Spark and its own compute and processing capability, which supports batch and streaming data and handles concurrent jobs well.

 

Additionally, it integrates natively with the AWS ecosystem, which includes technology and tools for IoT, streaming, analytics, machine learning, natural language processing, and much more. It also integrates easily with third-party technologies, such as non-native connectors.

 

Benefits of migrating to serverless ETL

Legacy ETL solutions are being put to the test as data volumes continue to grow and more users wish to engage with the data.

 

With much of our data now residing on a big data landscape, older data integration solutions often struggle to keep up with modern use cases that demand concurrent workloads, large data loads, real-time and event-based jobs, and streaming data. Or they may struggle with poorly designed queries that affect performance. In any case, users become frustrated by slow data (or no data) and it loses its value.

 

Moving to a cloud-based, serverless data integration solution offers several benefits that help overcome these challenges:

  • Scalability
    On-prem ETL tools are complex to install, manage and scale. As organizations increasingly run real-time and/or concurrent ETL jobs, existing servers can easily become overwhelmed, resulting in delayed or broken jobs. Scalability is a key success metric when it comes to distributed data processing. AWS Glue’s scalability is derived by using several containers and clusters that scale automatically to meet today’s dynamic workloads.
  • Cost
    Maintenance and licensing costs of on-prem solutions can quickly add up. Additionally, advanced features, like centralized data catalog and streaming data often must be licensed separately. Connectors are also priced high and lack usage-based pricing. When it comes to cost, AWS Glue has an all-in-one pricing model that includes infrastructure. With no licensing or maintenance fees, you only pay for what you consume, and you’re not locked into any specific time constraints. This leads to a more predictable, and ultimately, lower cost because there are fewer maintenance, management, and execution costs. Third-party connectors are not included in Glue pricing and would be paid for separately.
  • Managed service
    On-prem ETL tools require your organization to manage and maintain not only the software but also the infrastructure that runs it. This means more cost and resources. AWS Glue is offered as a fully managed service, meaning AWS manages the environment so you don’t have to maintain any software or infrastructure. AWS boasts 99.95% uptime.
  • Performance
    On-prem ETL tools are limited by underlying infrastructure and therefore, may struggle with performance as data grows or complexity increases. AWS is constantly launching new features and making improvements. AWS Glue 3.0 is the latest version which is up to 4X faster than Glue 2.0.
  • Skillset
    On-prem ETL tools require advanced skills to build ETL jobs and pipelines. AWS Glue uses Scala and Python with Spark engines, and also provides options for different personas such as data engineers, data scientists, data stewards, and data-savvy team members with no code/low code options (such as Glue Studio, DataBrew, and built-in transforms (ML) along with advanced coding options on generally available skillsets like Python.

 

Check list: Best practices for migrating from IBM DataStage to AWS Glue

If your organization is planning a migration from IBM DataStage to AWS Glue, there are several steps to keep in mind. Below are some helpful processes to consider when making this important transition.

 

View the checklist

 

When it comes to development and testing environments:

  • Setup AWS Glue libraries (v1.0) available through public Amazon S3 buckets
  • Consider packaging the libraries as Docker containers for portability
  • Setup an IDE like PyCharm/Jupyter
  • Explore using Glue development endpoints and Glue Studio with an interactive approach, as alternative
  • Build unit test suites leveraging the local libraries:
    • Mock data or use sample files
    • PyTest or ScalaTest
    • Modularize the code for streamlined testing
    • Integrate with your source code repository locally
  • Leverage the full open-source Spark APIs

 

When you reach the deployment environment:

  • Confirm the network communication paths are available for your resources
  • Subnet config, Firewall, DNS configs
  • Available IP addresses for higher DPU jobs
  • Ensure AWS Glue has the right access
  • Check ACLs, Egress Security Groups
  • VPC Flow logs can be used to troubleshoot connectivity issues
  • Explore using S3 access via Endpoint
  • Consider using AWS Glue with VPC Endpoints
  • Start with the standard worker type
  • Use the G.1X or G.2X worker types for memory intensive jobs
  • Set up Spark UI for better details about job metrics and performance (Spark jobs)

 

Lastly, when it comes to the deployment process, consider these steps:

  • Confirm the network communication paths are available for your resources
  • Maintain the Glue crawler/job definition on your source code repo
    • json file or CloudFormation templates
  • Depending on your git lifecycle practices, build/create the updated codebase
    • Example: Create python library, jar, configuration files, modified job definition, etc.)
    • Execute the test cases on local sample data gets executed during this step
  • Deploy the artifacts to a staging environment on AWS
    • Create/update the Glue crawler/jobs
    • Move the generated libraries and scripts to S3
    • Run manual/automated integration tests, data validation
    • Approve production deployment

 

Leverage automation tools to reduce manual effort

ETL code conversion traditionally has been a complex and time-consuming manual effort. Today, you can leverage ETL code converters to reduce the level of effort and improve the quality of converted code. For example, Wavicle’s code converters can successfully convert 90% or more of the DataStage mappings. As a result, the overall effort of conversion is greatly reduced.

 

Final thoughts

DataStage and AWS Glue are both rated among the best ETL solutions available and can both be good options for your organization, depending on your data management environment. If you are integrating data into your cloud environment with requirements for real-time data integration and concurrent workloads, you might consider the advantages of AWS Glue.

 

ETL modernization is an important but challenging process for many organizations. To help you identify and deploy the best tools for your organization’s use cases, Wavicle offers expert consulting services to help you assess, mobilize, migrate, and modernize your AWS environment.

 

Learn more about Wavicle’s Glue Converter solution  – and how to migrate your data in 80-90% less time.