๐Ÿ” Power BI Retail ETL Part 4 | Investigating Data Quality Issues in Online Retail Data

๐Ÿš€ Power BI Retail ETL Project Part 4

๐Ÿ” Investigating Real-World Data Quality Issues in the Online Retail II Dataset

Welcome back to the Retail ETL Pipeline Series.

In Part 1, we imported the Online Retail II dataset into Power BI. In Part 2, we combined both yearly datasets into a unified staging layer. In Part 3, we performed data profiling to understand the overall health of the dataset.

Now comes one of the most important responsibilities of every Data Engineer:

๐Ÿ’ก Data Quality Assessment

Before building dashboards, KPIs, reports, or business analytics, we must understand the quality of the data itself.

Professional Data Engineers never assume data is correct. They investigate. They validate. They challenge assumptions.

That is exactly what we did in this phase of the project.


๐Ÿ“Š What We Discovered During Data Profiling

After combining both years of retail data, our staging layer contained:

  • ✅ More than 1 Million Transaction Records
  • ✅ Multiple Countries
  • ✅ Multiple Years of Business Activity
  • ✅ Real Customer Purchasing History
  • ✅ Real Data Quality Challenges

And this is where things became interesting.


๐Ÿ‘ค Missing Customer IDs (23%)

The first major issue identified was missing customer information.

Power Query profiling revealed that approximately:

⚠ 23% of Customer IDs were missing

At first glance, this might look like bad data.

However, experienced Data Engineers know that missing values often tell a business story.

Possible explanations include:

  • Guest purchases
  • Anonymous customers
  • System limitations
  • Legacy transaction records
  • Operational exceptions

Rather than deleting the records immediately, we documented the issue for further investigation.


๐Ÿ”„ Negative Quantities Discovered

The next investigation focused on transaction quantities.

Filtering records where:

Quantity < 0

revealed numerous unusual transactions.

Examples included:

  • Damaged inventory
  • Lost stock
  • Manual corrections
  • Returns
  • Business adjustments

Several records contained descriptions such as:

  • lost
  • damages
  • sold as gold
  • manual adjustments

This is exactly the type of operational complexity found inside real business systems.


๐Ÿ’ฐ Negative Price Transactions

The next validation focused on product pricing.

Filtering:

Price < 0

revealed a very small number of records.

Interestingly, all of them were associated with:

๐Ÿ“‰ Adjust Bad Debt

These were not product sales.

They represented accounting adjustments where unpaid customer balances were written off.

This discovery highlights an important lesson:

Raw datasets often contain operational, accounting, and business process records that are not immediately obvious.

๐Ÿงพ Zero Price Transactions

Further investigation focused on records where:

Price = 0

These records contained various unusual business activities including:

  • Inventory corrections
  • Returns
  • Stock adjustments
  • Damaged items
  • Lost inventory
  • Internal operational records

Again, these are not necessarily errors.

They represent real-world business events that occurred during retail operations.


๐Ÿ— Why This Matters for Data Engineering

Many beginners believe ETL starts with dashboards.

In reality:

Dashboards come last. Data understanding comes first.

Before creating reports, engineers must answer questions like:

  • What data is missing?
  • Which records represent returns?
  • Which transactions are adjustments?
  • What business rules are hidden inside the data?
  • Which records should remain and which should be cleaned?

These decisions directly affect business reporting accuracy.


๐Ÿ“ˆ What We Have Learned So Far

  • ✅ Importing Real Business Data
  • ✅ Combining Multiple Sources
  • ✅ Building a Staging Layer
  • ✅ Data Profiling Techniques
  • ✅ Missing Value Investigation
  • ✅ Return & Cancellation Analysis
  • ✅ Inventory Adjustment Discovery
  • ✅ Accounting Adjustment Identification

Step by step, our retail dataset is revealing how real businesses operate behind the scenes.


๐Ÿš€ What Happens Next?

Now that we understand the major data quality issues, the next phase begins:

๐Ÿงน Data Cleaning & Transformation

In the next part of the series, we will begin creating a cleaner and more reliable dataset by applying practical ETL rules used in professional environments.

This is where our raw retail data starts transforming into analytics-ready business information.


๐Ÿ’™ The DETLENG Philosophy

Most people collect datasets.

Data Engineers build understanding.

At DETLENG, our goal is not simply to learn tools.

Our goal is to understand how data behaves, how businesses operate, and how modern engineering transforms raw information into trusted insights.

๐Ÿš€ Great dashboards are built on clean data.
๐Ÿ’ก Great Data Engineers are built through curiosity.
❤️ Every expert started with a single row of data.

๐ŸŒ DETLENG.COM

Build • Learn • Engineer • Grow

One Dataset. Seven ETL Journeys. Endless Possibilities.

Popular posts from this blog

Production ETL Pipeline Execution Overview

Choosing the Right Dataset for a Realistic ETL / ELT Pipeline Project