Data validation
Accelerated Data Validation With Wavicle’s Data Capture Tool
Managing data is always difficult and leads to complex processes. Data quality and ETL testing are critical to validate, verify, and qualify data while preventing duplicate records and data loss. This process ensures data transfer from various heterogeneous environments to a central data repository while adhering to strict quality protocols and compliance regulations.
However, data testing may require manual validation and auditing. This manual work increases the risk of defects leaking into the production environment without proper quality checks. Ultimately, it can be difficult for even the most talented and certified testers to create an accurate testing strategy to validate all possible data combinations and potential failures.
Project background
One of North America’s largest collision repair center operators approached Wavicle for help building testing strategies and solutions within strict deadlines. They needed data validated across two stages: first, when moving from its source to a data lake, and second, when moving from a data lake to a data warehouse. The testing strategies included validating data in various file types and comparing it across the staging and warehouse layers.
This project included several challenges:
- An accelerated timeline of two months
- Large data volumes with complex logic
- Multiple file types, including XML, JSON, CSV & Excel data comparison
- Database-to-database validation for a process that included files coming from the business, then loading into a data lake (Athena), then moving into a data warehouse (RedShift) based on business logic, and then being translated onto dashboards
Wavicle’s approach
The Wavicle team presented a data capture tool that compares various data sources and identifies variances in the same data group. Using multiple connectors to relational and non-relational databases, cloud databases, and filesystems, the tool compares the test data and generates test results into smart reports. The test results are then stored and accessed through user-defined Amazon S3 storage.
This approach ensured the quality of the test deliverables and provided a better focus on testing. It paved a hassle-free path for testers to test high volumes of data containing more than one million records. With this system, testers could easily compare the results as the reports narrowed down the exact differences between the source data and target databases to identify bugs.
Wavicle’s team used their data capture accelerator to address a variety of challenges, including the following:
- Complexity: The tool provides quality assessment to visualize test plans in workspaces to reduce the complexity of managing the test plans and retesting. Output-matching reports allowed for simpler evaluation of results and quicker understanding of actual errors for reporting.
- Time: The tool significantly reduced the time needed to create test plans and corresponding SQL, verify the results, and communicate errors. This resulted in a 50% reduction in data validation efforts in complex data platforms. Proper testing of data flows can create hundreds of test steps for an average-size implementation, so time and cost savings can be significant.
- Effort: Parameterizing the query for multiple batches in the same test plan reduced the effort that would be required to create multiple test plans.
- Validation: Manually doing file-to-file validation took approximately 2 – 3 hours. Execution with Wavicle’s data capture accelerator reduced the time by 75% to validate within 30 minutes.
- Defect identification: The tool rapidly identified bugs using a smart report.
In addition, Wavicle’s data capture tool helped by retesting defects. Using the data capture tool, no new test plans or additional efforts were needed after fixing defects, which reduced effort. Having the ability to edit the existing test plan in response to minor changes was faster and more straightforward than creating a new one. The test plans could also parameterize the queries, which enabled the team to run SQL queries for different conditions. This allowed them to run the test plans with different values for the conditions.
During the project, Wavicle’s team had to adapt to overcome unforeseen challenges. For example, they encountered issues with a server that went down while running high-volume test plans and deployed a fix to bring up the server automatically after 15 minutes of downtime and improve performance through coding solutions. In addition, handling XML and hierarchical JSON comparisons against CSV or DB was challenging. Using JSON conversion utilities, the team converted the XMLs to JSON, loaded them to MongoDB, and queried the MongoDB to flatten data, which enabled them to compare it against flat files like CSVs or DBs.
Project results
Wavicle’s data capture tool reduced risk with the ability to manage hundreds of testing tasks. Plus, the tool made understanding test outputs and identifying key errors easier, allowing for faster and more reliable break/fix cycles.
Wavicle’s consultants completed the two-month project on time, tested vast volumes of data in the order of a few hundred million records, and identified all data mismatches. This success was recognized and appreciated by the customer, as the rate of execution was three times faster than the manual execution, and the testing quality was noteworthy.