Wavicle Insights, Opinions, Commentary, and More.
How to get faster, more reliable analytics from data lakes
Date: Wednesday December 4, 2019
Modern data architecture with Delta Lake and Talend
Unmanaged data lakes lead to slow analytics
If your data lakes have grown to the extent that analytics performance is suffering, you’re not alone. The very nature of data lakes, which allows organizations to quickly ingest raw, granular data for deep analytics and exploration, can actually stand in the way of fast, accurate insights.
Data lakes remain a go-to repository for storing all types of structured and unstructured, historical and transactional data. But with the high volume of data that is added every day, it is very easy to lose control of indexing and cataloging the contents of the data lake.
The data becomes unreliable, inconsistent, and generally hard to find. This has several effects on the business, ranging from poor decisions based on delayed or incomplete data to an inability to meet compliance mandates.
Databricks has designed a new solution to restore reliability to data lakes: Delta Lake. Based on a webinar Wavicle delivered with Databricks and Talend, this article will explore the challenges that data lakes present to organizations and explain how Delta Lake can help.
Core challenges with data lakes
Data lakes were designed to solve the problem of data silos, providing access to a single repository of any type of data from any source. Yet most organizations find it impossible to keep up with the extreme growth of data lakes.
The term “data swamp” has been used to define data lakes that have no curation or data lifecycle management and little to no contextual metadata or data governance. Due to the way it is stored, data has become hard to use or unusable.
Users may start to complain that analytics are slow, data is unreliable or inconsistent, or it simply takes too long to find the data they’re looking for. These performance and reliability issues are caused by a variety of issues, including:
- Too many small or very large files require more time to open and close files, rather than reading content (this is even worse with streaming data)
- Partitioning or indexing breaks down when data has many dimensions and/or high cardinality columns
- Storage systems and processing engines struggle to handle large number of subdirectories and files
- Failed production jobs corrupt the data, requiring tedious recovery
- Lack of consistency makes it hard to mix appends, deletes, and upserts and get consistent reads
- Lack of schema enforcement creates inconsistent, low-quality data
Delta Lake to the rescue
Databricks has created Delta Lake, which solves many of these challenges and restores reliability to data lakes with minimal changes to data architecture. Databricks defines Delta Lake as an open source “storage layer that sits on top of data lakes to ensure reliable data sources for machine learning and other data science-driven pursuits.” Several features of Delta Lake enable users to query large volumes of data for accurate, reliable analytics. These include ACID compliance, time travel (data versioning), unified batch and streaming processing, scalable storage, metadata, and schema check and validation. Addressing the major challenges of data lakes, Delta Lake delivers:
- Reliability: Failed write jobs do not update the commit logs, so if there are any partial or corrupt files, any DELTA users who are using the DELTA table will not be able to see the corrupted file.
- Consistency: Changes to DELTA tables are stored as ordered, atomic commits. DELTA readers read logs in atomic, consistent snapshot each time and each commit is a set of actions filed in a directory. In practice, most writes don’t conflict with tunable isolation levels.
- Performance: Compaction is performed on transactions using OPTIMIZE; optimize using multi-dimensional clustering on multiple columns.
- Reduced system complexity: Delta Lake is able to handle both batch and streaming data (via a direct integration with structured streaming for low latency updates) including the ability to concurrently write batch and streaming data to the same data table.
Architecting a modern Delta Lake platform
Below is a sample architecture of a Delta Lake platform. In this example, we’ve shown the data lake on Microsoft Azure cloud platform using Azure Blob for storage and an analytics layer consisting of Azure Data Lake Analytics and HDInsight. An alternative would be to use Azure Blob storage with no compute attached to it.
Alternatively, in an Amazon Web Service environment, the data lake can be built based on Amazon S3 with all other analytical services sitting on top of S3.
In this example, Talend provides data integration. It provides a rich base of built-in connectors as well as MQTT and AMQP to connect to real-time streams, allowing for easy ingestion of real-time, batch, and API data into the data lake environment. Talend has voiced its support of Delta Lake, committing to “natively integrate data from any source to and from Delta Lake.”
Following the architecture are instructions for converting a data lake to Delta Lake using Talend for data integration.
Creating or Converting data lake project to Delta Lake through Talend
Below are instructions that highlight how to use Delta Lake through Talend.
Configuration: Set up the Big Data Batch job with Spark Configuration under Run tab. Select the distribution to Databricks and the corresponding version. Under Databricks section update the Databricks Endpoint(it could be Azure or AWS), Cluster Id, Authentication Token.
Sample Flow: In this sample job, click events are collected from mobile app and events are joined against customer profile and loaded as parquet file into DBFS. This DBFS file will be used in next step for creating delta table.
Create Delta Table: Creating delta table needs keyword “Using Delta” in the DDL and in this case since the file is already in DBFS, Location is specified to fetch the data for Table.
Convert to Delta table: If the source files are in Parquet format, we can use the SQL Convert to Delta statement to convert files in place to create an unmanaged table:
SQL: SQL: CONVERT TO DELTA parquet.`/delta_sample/clicks`
Partition data: Delta Lake supports partitioning of tables. In order to speed up queries that have predicates involving the partition columns, partitioning of data can be done.
CREATE TABLE clicks ( date DATE, eventId STRING, eventType STRING, data STRING) USING delta PARTITIONED BY (date)
Batch upserts: To merge a set of updates and inserts into an existing table, we can use the MERGE INTO statement. For example, the following statement takes a stream of updates and merges it into the clicks table. If a click event is already present with the same eventId, Delta Lake updates the data column using the given expression. When there is no matching event, Delta Lake adds a new row.
MERGE INTO clicks USING updates ON events.eventId = updates.eventId WHEN MATCHED THEN UPDATE SET events.data = updates.data WHEN NOT MATCHED THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
Read Table: All Delta tables can be accessed either by choosing the file location or using the delta table name.
SQL: Either SELECT * FROM delta./delta_sample/clicks or SELECT * FROM clicks
Talend in Data Egress, analytics and machine learning on high level:
- Data Egress: Using Talend API services, create APIs faster by eliminating the need to use multiple tools or manually code. Talend covers the complete API development lifecycle, from design, test, documentation, implementation to deployment – using simple, visual tools and wizards.
- Machine learning: With the Talend toolset, machine learning components are ready to use off the shelf. This ready-made ML software allows data practitioners, no matter their level of experience, to easily work with algorithms—without needing to know how the algorithm works or how it was constructed. At the same time, experts can fine-tune those algorithms as desired. Talend supports below machine learning algorithm. Talend machine learning components include tALSModel, tRecommend, tClassifySVM, tClassify, tDecisionTreeModel, tPredict, tGradientBoostedTreeModel, tSVMModel, tLogicRegressionModel, tNaiveBayesModel and tRandomForestModel.