Data scientists and data engineers are both critical roles for data-driven organizations. When they work well together, it can be magical. But too often, their relationships are fraught with tension and misunderstanding.
This is because their roles and tasks are related, but they often struggle with imprecise boundaries and lack of clarity about each other’s roles. They often use similar language for slightly different tasks—the language and subtleties can cause conflict and confusion and even bring projects to a standstill.
In this article, we’ll look at three areas where data engineering and data science teams do things differently, which can lead to conflict if not properly managed or understood.
1. Data ingestion versus curation
When a data scientist asks a data engineer, “Can you get this data for me?” data engineers often hear, “Can you curate this data for me?”
Data curation is a very different and more time-consuming process than what the data scientist is requesting. This is important because a misunderstanding about the effort can push it down the priority list and delay valuable projects.
Yes, the data scientist is asking for a new data source to be loaded, but there are implied or unspoken caveats about how the data will be used.
Ideally, the data science group simply wants data loaded into their secure artificial intelligence (AI) lab, assurance that there is no malware present, and clarity about any legal requirements regarding access and usage of the data. Nothing else.
A snapshot of the data should be loaded into the AI lab. This is not a production data location. It is not available to the business for decision-making and has not yet been assessed for quality.
The data scientist believes the data might be useful for business decisions but has not yet made the final determination. The 80-20 rule applies here, and only 20% of the data will likely have value.
Data scientists will do their own data profiling and data quality assessment and determine suitability for business decision-making. This could serve as a sort of triage for data engineering regarding what data really should be curated.
2. Data wrangling versus data engineering
Data wrangling is the process data scientists use to take a one-time snapshot of data to extract, transform and load into a one-time analysis data set. Often written in Python, sometimes built in another tool, wrangling processes differ from data engineering pipelines in several ways.
• Data wrangling code runs on a snapshot of data and only has to run correctly once (the last time it is run).
• Data wrangling code only needs to handle edge cases that exist in the snapshot one-time analysis data set.
• If a wrangling code fails, it is a minor annoyance to a development process.
Conversely, data engineering pipelines:
• Must run against streaming data and must run correctly every time.
• Must handle edge cases that have not been seen before; otherwise, the job may fail.
• Can be a major problem for production when a production data pipeline fails.
Data wrangling is a process that differs from data engineering. It has a different purpose (build a one-time table on static data versus update/append a dynamic production table with incremental and/or streaming data). It should be treated differently. Ideally, data wranglers should work with data engineers so that their wrangling code can evolve into robust engineered data pipelines.
3. AI Modeling Versus Production Scoring
The split between modeling code and production scoring engines is similar to that of wrangling and engineering code for the underlying data. Modeling code built by data scientists early in the process is code designed to run in an AI lab to identify the best (or better) analytical model(s). A scoring engine is a production-level program designed to score incoming data with the selected model(s).
AI code by the modeling group:
• Runs on a snapshot of data to determine the best model(s); since this often runs multiple models, it may have a significant run time.
• Is validated, often cross-validated, against only the snapshot of data and may fail when an edge case appears that it hasn’t seen before.
• Can cause minor setbacks to a development project when there is a failure in the lab environment of the AI code.
A production scoring engine:
• Must score the best model(s) against incremental incoming data; run time is at a premium.
• Must anticipate edge cases it has never seen before.
• Can halt a production job and significantly impact the business when there is a failure in the production environment of the scoring engine.
Model development and production scoring have different objectives and should be handled differently. There is enough similarity that strong collaboration between data science and data engineering will facilitate better results.
Encapsulating modeling code in a container and pushing it to production is not the viable option often presented. First, the modeling code must evolve into a scoring engine.
Conclusion
These activities between data scientists and data engineers perform, while similar, are far enough apart in their paradigms that thinking and collaboration are critical. Understanding the subtle distinctions between their roles and tasks can eliminate roadblocks and accelerate value.
This article was originally published as Three Keys To A Harmonious Relationship Between Data Science And Data Engineering on June 3, 2022, on Forbes.