Project 02 · Lakehouse + orchestration

Telecom Billing Lakehouse — Medallion Architecture

Bronze, Silver, and Gold tiers built over synthetic Call Detail Records. Airflow orchestrates ingestion onto Iceberg-on-S3, Great Expectations enforces contracts at the boundary, and dbt models analyst-ready marts on top.

View on GitHub Explore the tiers Architecture deep-dive

Bronze · rawSilver · cleansedGold · martsClick a band to explore that tier

Tier drill-down

Bronze, Silver, Gold — what changes between them

Click a tier above or use the tabs below to see the schema, sample rows, transformations, and data-quality checks at each stage of the medallion.

Bronze · Raw landings

Append-only landing zone. We keep raw payloads as they arrived so any downstream issue is replayable.

Sample rows

raw_payload	ingested_at	source_file	partition
{"caller":"+1-214-555-0142","callee":"+1-415-555-0188",…	2026-04-01T08:15:01Z	cdr_2026_04_01_08.json.gz	dt=2026-04-01/hr=08
{"caller":"+1-512-555-0190","callee":"+44-20-7946-0958"…	2026-04-01T08:15:02Z	cdr_2026_04_01_08.json.gz	dt=2026-04-01/hr=08
{"caller":"+1-214-555-0173","callee":"+1-718-555-0119",…	2026-04-01T08:15:13Z	cdr_2026_04_01_08.json.gz	dt=2026-04-01/hr=08
{"caller":"+1-415-555-0188","callee":"+1-214-555-0142",…	2026-04-01T08:15:45Z	cdr_2026_04_01_08.json.gz	dt=2026-04-01/hr=08
{"caller":"+1-303-555-0166","callee":"+1-303-555-0166",…	2026-04-01T08:16:04Z	cdr_2026_04_01_08.json.gz	dt=2026-04-01/hr=08

Schema

raw_payloadstring (JSON)
ingested_attimestamp
source_filestring
schema_versionstring
ingest_partitiondate

Transformations applied

Schema-on-read JSON parse
Append to ingest partition
Capture source file + ingested_at

Data-quality checks

Schema validation via Great Expectations
Null-rate <= 2% on caller/callee
Source-file checksum match

Orchestration

Daily Airflow DAG

A linear pipeline that generates synthetic CDR data, lands it on Bronze, validates with Great Expectations, normalizes into Silver, builds Gold marts via dbt, and notifies on completion.

Airflow DAG · daily lakehouse pipeline

6 tasks · runs nightly at 02:00 UTC · LocalExecutor

queuedrunningsuccess

Data quality

Scorecard across the four pillars

Completeness, validity, uniqueness, and freshness — measured by the dbt + Great Expectations test surface and surfaced into a simple scorecard.

Completeness

% of records with all required fields populated.

Validity

% of records that pass type, range, and regex checks.

Uniqueness

% of records with no duplicate primary key.

Freshness

% of partitions arriving within their SLA window.

Data quality scorecard - illustrative placeholder values.
Metric	Value (%)	Description
Completeness	99.4	% of records with all required fields populated.
Validity	98.7	% of records that pass type, range, and regex checks.
Uniqueness	100.0	% of records with no duplicate primary key.
Freshness	99.9	% of partitions arriving within their SLA window.

dbt lineage

From sources to marts

Hover or focus a model to see its source file. The graph mirrors the dbt project's actual model layout.

dbt lineage · sources to gold

Hover or focus a model to see its source file.

Sources (Bronze)Silver modelsGold marts

Architecture

From producer to BI

Mermaid diagram of the full pipeline. The engineering write-up underneath calls out the trade-offs that recruiters and reviewers tend to ask about.

flowchart LR
          G[Synthetic CDR generator] -->|parquet| R[(MinIO/S3 raw zone)]
          R -->|Airflow ingest DAG| BR[(Iceberg Bronze)]
          BR -->|Airflow transform DAG
GE checks| SI[(Iceberg Silver)]
          SI -->|dbt run| GO[(Gold marts
revenue_by_market, arpu_monthly, churn_signals)]
          GO --> BI[BI / consumers]

Why medallion over a single table? Telecom CDR volumes are huge and lossy. Keeping a Bronze append-only layer means a schema change, a parsing bug, or a vendor-side correction can be replayed without re-ingesting from upstream. Silver is where validation happens but rows are never dropped — bad rows are flagged via is_valid + validation_notes so analysts can audit them. Gold is the only layer BI tools see.

Why Iceberg over Delta? Both work. Iceberg's hidden partitioning and partition evolution made schema migrations on the synthetic data far less painful than rewrite-on-write tables, and it gives engine portability if a downstream consumer wants Trino or Athena instead of Spark. Delta would be a fine choice in a Spark-mostly shop.

Why Great Expectations at Bronze and dbt tests at Silver/Gold? GE catches the bad records before they enter the lake — schema, null rates, source checksums. dbt tests catch the bad relationships: uniqueness, referential integrity, accepted values. The two tools cover complementary failure modes, and both block the downstream task in Airflow if they fail.

Stack

What this project uses, and why

Apache Airflow
Apache Iceberg
MinIO / S3
Great Expectations
dbt
Terraform

See the full code

The repo runs locally on docker-compose with no AWS account. Airflow, MinIO, Iceberg, GE, and the dbt project are all there.

View repository Read the data contracts doc