Skip to content

Project 02 · Lakehouse + orchestration

Telecom Billing Lakehouse — Medallion Architecture

Bronze, Silver, and Gold tiers built over synthetic Call Detail Records. Airflow orchestrates ingestion onto Iceberg-on-S3, Great Expectations enforces contracts at the boundary, and dbt models analyst-ready marts on top.

BronzeRaw landingsraw_payload · ingested_atSilverCleansed eventscaller · callee · duration_secGoldBusiness martsmarket · arpu · churn_rate
Bronze · rawSilver · cleansedGold · martsClick a band to explore that tier

Tier drill-down

Bronze, Silver, Gold — what changes between them

Click a tier above or use the tabs below to see the schema, sample rows, transformations, and data-quality checks at each stage of the medallion.

Bronze · Raw landings

~ 1.2 B rows (illustrative)

Append-only landing zone. We keep raw payloads as they arrived so any downstream issue is replayable.

Sample rows

raw_payloadingested_atsource_filepartition
{"caller":"+1-214-555-0142","callee":"+1-415-555-0188",…2026-04-01T08:15:01Zcdr_2026_04_01_08.json.gzdt=2026-04-01/hr=08
{"caller":"+1-512-555-0190","callee":"+44-20-7946-0958"…2026-04-01T08:15:02Zcdr_2026_04_01_08.json.gzdt=2026-04-01/hr=08
{"caller":"+1-214-555-0173","callee":"+1-718-555-0119",…2026-04-01T08:15:13Zcdr_2026_04_01_08.json.gzdt=2026-04-01/hr=08
{"caller":"+1-415-555-0188","callee":"+1-214-555-0142",…2026-04-01T08:15:45Zcdr_2026_04_01_08.json.gzdt=2026-04-01/hr=08
{"caller":"+1-303-555-0166","callee":"+1-303-555-0166",…2026-04-01T08:16:04Zcdr_2026_04_01_08.json.gzdt=2026-04-01/hr=08

Schema

  • raw_payloadstring (JSON)
  • ingested_attimestamp
  • source_filestring
  • schema_versionstring
  • ingest_partitiondate

Transformations applied

  • Schema-on-read JSON parse
  • Append to ingest partition
  • Capture source file + ingested_at

Data-quality checks

  • Schema validation via Great Expectations
  • Null-rate <= 2% on caller/callee
  • Source-file checksum match

Orchestration

Daily Airflow DAG

A linear pipeline that generates synthetic CDR data, lands it on Bronze, validates with Great Expectations, normalizes into Silver, builds Gold marts via dbt, and notifies on completion.

Airflow DAG · daily lakehouse pipeline

6 tasks · runs nightly at 02:00 UTC · LocalExecutor

queuedrunningsuccess

Data quality

Scorecard across the four pillars

Completeness, validity, uniqueness, and freshness — measured by the dbt + Great Expectations test surface and surfaced into a simple scorecard.

Values are illustrative placeholders for the demo - real numbers come from the project's Great Expectations + dbt test runs.

Completeness

% of records with all required fields populated.

Validity

% of records that pass type, range, and regex checks.

Uniqueness

% of records with no duplicate primary key.

Freshness

% of partitions arriving within their SLA window.

Data quality scorecard - illustrative placeholder values.
MetricValue (%)Description
Completeness99.4% of records with all required fields populated.
Validity98.7% of records that pass type, range, and regex checks.
Uniqueness100.0% of records with no duplicate primary key.
Freshness99.9% of partitions arriving within their SLA window.

dbt lineage

From sources to marts

Hover or focus a model to see its source file. The graph mirrors the dbt project's actual model layout.

dbt lineage · sources to gold

Hover or focus a model to see its source file.

Sources (Bronze)Silver modelsGold marts

Architecture

From producer to BI

Mermaid diagram of the full pipeline. The engineering write-up underneath calls out the trade-offs that recruiters and reviewers tend to ask about.

flowchart LR
          G[Synthetic CDR generator] -->|parquet| R[(MinIO/S3 raw zone)]
          R -->|Airflow ingest DAG| BR[(Iceberg Bronze)]
          BR -->|Airflow transform DAG
GE checks| SI[(Iceberg Silver)]
          SI -->|dbt run| GO[(Gold marts
revenue_by_market, arpu_monthly, churn_signals)]
          GO --> BI[BI / consumers]
        

Why medallion over a single table? Telecom CDR volumes are huge and lossy. Keeping a Bronze append-only layer means a schema change, a parsing bug, or a vendor-side correction can be replayed without re-ingesting from upstream. Silver is where validation happens but rows are never dropped — bad rows are flagged via is_valid + validation_notes so analysts can audit them. Gold is the only layer BI tools see.

Why Iceberg over Delta? Both work. Iceberg's hidden partitioning and partition evolution made schema migrations on the synthetic data far less painful than rewrite-on-write tables, and it gives engine portability if a downstream consumer wants Trino or Athena instead of Spark. Delta would be a fine choice in a Spark-mostly shop.

Why Great Expectations at Bronze and dbt tests at Silver/Gold? GE catches the bad records before they enter the lake — schema, null rates, source checksums. dbt tests catch the bad relationships: uniqueness, referential integrity, accepted values. The two tools cover complementary failure modes, and both block the downstream task in Airflow if they fail.

Stack

What this project uses, and why

  • Apache Airflow
  • Apache Iceberg
  • MinIO / S3
  • Great Expectations
  • dbt
  • Terraform

See the full code

The repo runs locally on docker-compose with no AWS account. Airflow, MinIO, Iceberg, GE, and the dbt project are all there.