Case Study — Data Quality Audit & Remediation | Altograph Environmental Analytics

📋 Executive Summary

The Problem: Nobody Knows How Bad the Data Is

The company had been accumulating data debt for years. Customer records had duplicate entries, addresses with missing postal codes, phone numbers in 14 different formats, email addresses that were clearly fake, and transaction records that didn't match between systems. Revenue reports didn't tie out. Marketing was emailing people who'd been dead for years. Sales reps were calling numbers that belonged to other companies.

But nobody knew the scale of the problem. Was it 5% of records? 50%? Which types of errors were most common? Which tables were the worst offenders? Without a structured audit, remediation was impossible — the team was playing whack-a-mole with individual bad records instead of fixing root causes.

We conducted a structured data quality audit across 8 dimensions — Completeness, Accuracy, Consistency, Timeliness, Uniqueness, Validity, Integrity, and Conformity — profiling every record in the 4 core tables (Customers, Transactions, Products, Interactions). Then we built automated cleanup scripts and deployed real-time prevention rules that catch errors at the point of entry.

340K

Total data quality issues found

28.3% of all records affected

58 → 94

Overall quality score (0–100)

+36 points post-remediation

94%

New errors prevented

Via automated validation rules

4 weeks

Audit to full remediation

Including prevention deployment

The 8-Dimension Audit Framework

Data quality isn't one thing — it's eight. We scored every table across each dimension on a 0–100 scale, creating a quality fingerprint that shows exactly where the problems are concentrated:

Quality Scorecard — Before vs After

Radar view across all 8 dimensions

Issues by Dimension

Count of issues found per quality dimension

🔍 Finding #1

42% of Customer Records Are Missing Critical Fields

The worst dimension was Completeness. In the Customers table alone, 42% of records were missing at least one critical field — postal code (34% null), phone number (28% null), email (19% null), or company name (15% null). Some records had nothing but a name and an ID — imported from a legacy system migration in 2019 that nobody ever cleaned up.

The downstream impact: marketing campaigns were being sent to incomplete segments, shipping estimates were wrong (no postal code), and the sales team couldn't call or email a quarter of their pipeline.

Field Completeness — Customer Table

% of records with non-null values per field

Key insight: The 2019 legacy migration was the root cause of 61% of all completeness issues. A single bulk import — done without validation — created more data debt than 4 years of organic data entry combined.

🐛 Finding #2

What Bad Data Actually Looks Like

Abstract quality scores are useful for executives, but the engineering team needs to see the actual mess. Here's a side-by-side comparison of real records before and after cleanup — every highlighted field was an issue our scripts detected and fixed:

❌ Before (Raw Data)

id: 10482

name: john doe

email: john@gmial.com

phone: (555) 123-4567

postal: NULL

company: Acme corp.

created: 13/25/2023

status: actve

revenue: $1,234.56

source: fb ad

✓ After (Cleaned)

id: 10482

name: John Doe

email: john@gmail.com

phone: +15551234567

postal: 90210 (geocoded)

company: Acme Corp

created: 2023-03-25

status: active

revenue: 1234.56 (numeric)

source: facebook_paid

This single record had 9 issues across 5 quality dimensions: extra whitespace and casing (Conformity), typo in email domain (Accuracy), inconsistent phone format (Consistency), null postal code (Completeness), invalid date format (Validity), misspelled enum value (Validity), currency symbol in numeric field (Conformity), and unstandardized source label (Consistency).

👥 Finding #3

14,200 Duplicate Customer Records — Some with 6+ Copies

The Uniqueness dimension revealed 14,200 duplicate customer records — 8.4% of the table. The worst case: one customer existed 7 times with slight name variations ("John Doe", "john doe", "J. Doe", "JOHN DOE", "John D.", "Jon Doe", "Doe, John"). Each duplicate had its own transaction history, making the customer's true lifetime value invisible.

We built a fuzzy matching engine that groups probable duplicates by normalized name + email + phone similarity, then merges transaction histories into a single golden record.

Duplicate Distribution

How many copies per duplicate cluster

Issues by Table

Which tables have the most quality problems

📊 Finding #4

The Error Taxonomy: 340K Issues in 12 Categories

We classified every issue into a structured taxonomy. The top 3 categories alone — null values (31%), format inconsistencies (22%), and duplicates (14%) — account for 67% of all issues. These are also the three most automatable fix types, which is why the remediation scripts achieved such high coverage.

Error Taxonomy — All 340K Issues Classified

Percentage of total issues by error type

Key insight: 67% of all data quality issues fall into just 3 categories that can be fixed with automated scripts. The remaining 33% require human review — but focusing automation on the top 3 first gives you the highest ROI per engineering hour.

🛠️ Finding #5

From Audit to Prevention in 4 Weeks

Week 1 — Discovery

Data Profiling & Dimension Scoring

Ran automated profiling on all 4 core tables (1.2M records). Scored each table across 8 quality dimensions. Produced the initial quality scorecard showing 58/100 overall score.

Week 2 — Classification

Error Taxonomy & Root Cause Analysis

Classified all 340K issues into 12 error categories. Identified the 2019 legacy migration as the root cause of 61% of completeness issues. Mapped every error type to its automation potential.

Week 3 — Remediation

Cleanup Scripts & Deduplication

Built and deployed 18 Python cleanup scripts targeting the top error categories. Ran fuzzy deduplication, standardized formats, fixed typos, backfilled nulls via geocoding and enrichment APIs. Quality score jumped to 89/100.

Week 4 — Prevention

Validation Rules & Monitoring Dashboard

Deployed 24 real-time validation rules in the data pipeline (dbt tests + custom SQL checks). Built a quality monitoring dashboard that alerts when any dimension drops below threshold. New error introduction dropped 94%.

Quality Score Progression — Weekly

Overall score (0–100) across the 4-week remediation sprint

Prevention Automation: 24 Rules That Catch Errors at Entry

Fixing bad data is expensive. Preventing it is cheap. We deployed 24 automated validation rules that run on every data write — catching errors before they enter the system:

✓

Email format validation

Regex + MX record check. Rejects obviously fake domains (test.com, example.com) and typos (gmial, yaho).

ON INSERT → customers.email

✕

Duplicate prevention

Fuzzy match on name + email + phone before insert. Blocks creation if match confidence >85%.

ON INSERT → customers.*

⟲

Phone standardization

Auto-converts any phone format to E.164 international standard (+[country][number]).

ON INSERT/UPDATE → customers.phone

⚠

Null field alerting

Allows insert but flags record for review if critical fields (email, phone, postal) are null.

ON INSERT → customers.* WHERE field IS NULL

✓

Date format enforcement

Rejects non-ISO dates. Auto-converts common formats (MM/DD/YYYY, DD-Mon-YY) to YYYY-MM-DD.

ON INSERT → *.date_fields

⟲

Enum standardization

Maps free-text values to controlled vocabulary. "fb ad" → "facebook_paid", "actve" → "active".

ON INSERT/UPDATE → *.enum_fields

✕

Referential integrity check

Blocks transactions with non-existent customer_id or product_id. Catches broken foreign keys at write time.

ON INSERT → transactions.*

⚠

Anomaly detection

Flags transactions >3σ from mean (value, quantity). Catches fat-finger entries like $100,000 instead of $1,000.

ON INSERT → transactions.amount

Interactive Quality Monitor

Select a table and time window to explore quality scores, issue trends, and dimension breakdowns. This monitoring view runs daily in production.

Quality Scorecard

Issues by Dimension

Issue Trend (Weekly)

Error Category Breakdown

Key Features

📐 8-dimension audit framework

Completeness, Accuracy, Consistency, Timeliness, Uniqueness, Validity, Integrity, Conformity — each scored 0–100 per table for a precise quality fingerprint.

🐛 340K issue classification

Every issue categorized into a 12-type error taxonomy with root cause tagging, automation potential scoring, and remediation priority ranking.

🛠️ 18 automated cleanup scripts

Python + SQL scripts targeting the top error categories: deduplication, format standardization, null backfill, typo correction, and enum mapping.

🛡️ 24 real-time prevention rules

Validation rules running on every data write — from email format checks to anomaly detection — cutting new error introduction by 94%.

📊 Quality monitoring dashboard

Daily-refresh dashboard tracking quality scores across all 8 dimensions with threshold alerts, trend lines, and table-level drill-down.

📖 Data quality playbook

Delivered a 30-page runbook documenting every dimension definition, scoring methodology, cleanup script, prevention rule, and escalation procedure.

The Outcome

58 → 94

Quality score improved 36 points

94% fewer

New errors entering the system

4 weeks

From audit to fully automated prevention

Data Quality Audit &
Remediation System

The Problem: Nobody Knows How Bad the Data Is

The 8-Dimension Audit Framework

Quality Scorecard — Before vs After

Issues by Dimension

42% of Customer Records Are Missing Critical Fields

Field Completeness — Customer Table

What Bad Data Actually Looks Like

14,200 Duplicate Customer Records — Some with 6+ Copies

Duplicate Distribution

Issues by Table

The Error Taxonomy: 340K Issues in 12 Categories

Error Taxonomy — All 340K Issues Classified

From Audit to Prevention in 4 Weeks

Quality Score Progression — Weekly

Prevention Automation: 24 Rules That Catch Errors at Entry

Interactive Quality Monitor

Quality Scorecard

Issues by Dimension

Issue Trend (Weekly)

Error Category Breakdown

Key Features

📐 8-dimension audit framework

🐛 340K issue classification

🛠️ 18 automated cleanup scripts

🛡️ 24 real-time prevention rules

📊 Quality monitoring dashboard

📖 Data quality playbook

The Outcome

How much is bad data costing you?

Want to tackle a similar challenge?