Data Quality

Deep dive into data quality issues: causes and solutions

In this blog post we will dive into the main causes of data quality issues and offer optional solutions. We will also see which types of data quality issues can be solved by a universal semantic layer and how.

Data quality is crucial for any initiative to get value from data - either through analytics and BI as well as through predictive models and AI. Building data products on top of poor data, is like cooking a meal with rotten ingredients - no matter how great the cook is, the dish will not be so pleasant to the palate.

Recently I had a conversation with a data lead at a medium size tech company. Unsurprisingly she stated that “trust” in data was her biggest challenge.

“It often happens that stakeholders ask me for a new analysis, and I estimate the time to be just a few hours. But as we dive a bit into the numbers, I find out that the numbers do not align and something does not make sense. So it turns out that from just a few hours of estimated work, we find ourselves working on “simple” tasks for days and sometimes even weeks.”

We all know that feeling.

And we also know that it can get worse - finding out about those miss-aligned numbers after we already delivered the results to the business stakeholders. 

Diving into such issues can spiral into tedious work of finding out why numbers do not match, and as data moves through many stations - data quality issues might be found in any station of the data supply chain from data producers downstream to the data consumption layer.

The data supply chain

Producers > Extract > Load > Transform > Semantics 

Producers

Producers are people, software code and machines that produce (generate) data. This is where raw data is born. These can be software developers firing events or filling the app operational databases, sensors (hardware) sending data samples to a data lake, people manually adding data to CRMs, third party tools that expose structured or unstructured data and more.

Data Producers - Potential Issues

There may be different reasons for quality issues related to producers. The most common ones are

Data Producers - Solutions

The best case scenario to track data quality issues on the production phase is to catch those before they apply and affect the data. 

For software code, this can be achieved by applying CI/CD monitoring for software development workflows - for catching issues in code, before deployment to production and affecting data (e.g. upriver has a great product for exactly that).

As a second line of defense, in case bad data already flowed into the system from the producers, we would want to monitor our raw data for changes that might indicate quality issues. This is where data observability tools come in (e.g. Monte Carlo, BigEye and Elementary are well known data observability products)

Extract and Load

Moving data from one place to another is a task that requires several staps - mainly extracting data from the source and loading it into the target system. For example, extracting from an operational database like MySQL and loading it into a data warehouse platform like Snowflake. There are many tools available for that step, some of them are Fivetran, Stitch, Rivery and AirByte.

Extract and Load - Potential Issues

Extract and Load - Solutions

First, make sure you are connecting your ETL tool to the data source via allowing CDC. If CDC is not possible for some reason, try loading data incrementally using a bin log. If that is not possible again - Use an incremental key where it makes sense or full refresh for small amounts of data where all of the other methods would not work. 

In case you’re using an incremental key, make sure to set the refresh window to be big enough to cover data that might be updated backwards on the source - to avoid gaps of missing data. 

Here too, data observability applied to our data lake (or data warehouse) can help us monitor and detect issues as they happen - but only after our data gets affected. If you’re using dbt, another tool we can use for Extract and load issues are dbt source freshness checks and some dbt tests and constraints.

Transformations

Transforming data from its raw shape into a shape which is valuable to the business requires multiple steps and this is probably the most complicated and error-prone part in the data supply chain.

Data transformations can be split into two main areas: 

  1. Dim and Fact tables - Normalized data model
  2. Applying business logic - Entities denormalization

Dim and Fact tables

Often referred to as a “data normalization” process, this part of data transformations is about creating independent Dim and Fact tables.

Dim and Fact tables represent “things” in a clear and clean way. Fact tables represent actions or events that happened, while dim tables represent entities and the things we know about them (usually static attributes like “customer name” or references to any-to-one relationships). 

This part of transformations is more technical in its nature and is mainly about cleaning data and changing its form (e.g. filters, unions, pivots), and we should try to avoid applying business logic and semantic definitions here.

Dim and Fact tables - potential Issues

Dim and Fact tables - Solutions

Applying business logic - Entities Denormalization

This is where the magic of turning data into information happens. 

After transforming raw data into clean dim and fact tables, applying semantics and business-specific logic is required. Often this step is recognized as “denormalization” or creating “wide tables”. This is where we define business entities, their relations (how they should be joined and how these entities relate to data assets) and the features each entity has. This step turns our data into an analysis-ready meaningful shape.

Applying business logic is the second piece in the data transformations part, and usually the importance of this part is overlooked. Business definitions are often scattered across different tools - hidden in SQL code and in BI tools, creating serious data trust and accessibility issues.

Applying business logic - potential Issues

Applying business logic - Solutions

Bring your data to the era of AI

Automate data workflows with consistency, clarity and trust, to enable AI and business users succeed with data

Request Access