🚀 Power BI Retail ETL Project – Part 3: Data Profiling & Data Quality Assessment
🚀 Power BI Retail ETL Project – Part 3
Data Profiling & Data Quality Assessment
Understanding Your Data Before Cleaning It
DETLENG.COM • One Dataset • Seven ETL Journeys • Endless Possibilities
📖 What We Accomplished So Far
In Part 1, we imported the Online Retail II Dataset into Power BI and loaded both yearly worksheets into Power Query.
In Part 2, we combined both datasets using Append Queries and created our first professional staging layer:
- ✅ retail_raw_2009_2010
- ✅ retail_raw_2010_2011
- ✅ retail_staging
Now it is time for one of the most important activities in modern Data Engineering:
🔍 Data Profiling
Before cleaning data, engineers first investigate the quality of the data.
Professional ETL pipelines never start by deleting records blindly. They start by understanding the dataset.
🎯 Today's Goal
We want answers to several important questions:
- How many records do we have?
- Which columns contain missing values?
- Are there any errors?
- Can we identify potential data quality issues?
- What should be cleaned later?
Think of this step as a health checkup for your data.
⚡ Enabling Data Profiling
Inside Power Query, navigate to:
View → Column Quality
Then switch profiling from:
Top 1000 Rows
to
Entire Dataset
This ensures Power BI analyzes all records rather than only a sample.
📸 Insert Data Profiling Screenshot Here
📊 What Did We Discover?
Dataset Size
After combining both years, our staging table contains:
✅ 1,067,371 Records
This is no longer a small Excel exercise.
This is a realistic business dataset suitable for professional ETL practice.
Column Quality Results
- ✅ Invoice → 100% Valid
- ✅ StockCode → 100% Valid
- ✅ Quantity → 100% Valid
- ✅ InvoiceDate → 100% Valid
- ✅ Price → 100% Valid
- ✅ Country → 100% Valid
- ⚠ Description → Less than 1% Empty
- ⚠ Customer ID → 23% Empty
🤔 Should We Delete Missing Customer IDs?
Many beginners immediately think:
❌ Missing values = Delete rows
Professional Data Engineers think differently.
Before removing data, we first ask:
- Could these be guest purchases?
- Could these be anonymous customers?
- Could the business intentionally allow sales without customer registration?
- Could this be a data capture issue?
At this stage, we do not delete anything.
We investigate first and decide later.
🏗 Why This Step Matters
A dashboard can hide bad data.
A Data Engineer must find bad data before it reaches the dashboard.
This is why Data Profiling is one of the most valuable skills in ETL and Analytics Engineering.
- ✅ Detect missing values
- ✅ Detect errors
- ✅ Understand data quality
- ✅ Prepare cleaning strategy
- ✅ Prevent reporting problems later
🚀 Updated ETL Architecture
retail_raw_2009_2010
↓
retail_raw_2010_2011
↓
APPEND
↓
retail_staging
↓
Data Profiling
↓
Future Cleaning Steps
Every enterprise ETL platform follows a similar layered approach.
🎓 What Happens Next?
Now that we understand the quality of our data, we can begin investigating real business issues.
In the next part, we will explore:
- 🔍 Cancelled Orders
- 🔍 Returned Transactions
- 🔍 Negative Quantities
- 🔍 Business Exceptions
- 🔍 Retail Data Quality Rules
This is where the project starts becoming a real Data Engineering case study.
💡 DETLENG Engineering Insight
Never clean data before understanding it.
The best Data Engineers are not the ones who delete data fastest.
They are the ones who understand the business behind the data.
🚀 One Dataset. Seven ETL Journeys. Endless Possibilities.
Build • Learn • Engineer • Grow
🌐 DETLENG.COM
