Project 02 · Lakehouse + orchestration
Telecom Billing Lakehouse — Medallion Architecture
Bronze, Silver, and Gold tiers built over synthetic Call Detail Records. Airflow orchestrates ingestion onto Iceberg-on-S3, Great Expectations enforces contracts at the boundary, and dbt models analyst-ready marts on top.
Tier drill-down
Bronze, Silver, Gold — what changes between them
Click a tier above or use the tabs below to see the schema, sample rows, transformations, and data-quality checks at each stage of the medallion.
Bronze · Raw landings
Append-only landing zone. We keep raw payloads as they arrived so any downstream issue is replayable.
Sample rows
| raw_payload | ingested_at | source_file | partition |
|---|---|---|---|
| {"caller":"+1-214-555-0142","callee":"+1-415-555-0188",… | 2026-04-01T08:15:01Z | cdr_2026_04_01_08.json.gz | dt=2026-04-01/hr=08 |
| {"caller":"+1-512-555-0190","callee":"+44-20-7946-0958"… | 2026-04-01T08:15:02Z | cdr_2026_04_01_08.json.gz | dt=2026-04-01/hr=08 |
| {"caller":"+1-214-555-0173","callee":"+1-718-555-0119",… | 2026-04-01T08:15:13Z | cdr_2026_04_01_08.json.gz | dt=2026-04-01/hr=08 |
| {"caller":"+1-415-555-0188","callee":"+1-214-555-0142",… | 2026-04-01T08:15:45Z | cdr_2026_04_01_08.json.gz | dt=2026-04-01/hr=08 |
| {"caller":"+1-303-555-0166","callee":"+1-303-555-0166",… | 2026-04-01T08:16:04Z | cdr_2026_04_01_08.json.gz | dt=2026-04-01/hr=08 |
Schema
- raw_payloadstring (JSON)
- ingested_attimestamp
- source_filestring
- schema_versionstring
- ingest_partitiondate
Transformations applied
- Schema-on-read JSON parse
- Append to ingest partition
- Capture source file + ingested_at
Data-quality checks
- Schema validation via Great Expectations
- Null-rate <= 2% on caller/callee
- Source-file checksum match
Orchestration
Daily Airflow DAG
A linear pipeline that generates synthetic CDR data, lands it on Bronze, validates with Great Expectations, normalizes into Silver, builds Gold marts via dbt, and notifies on completion.
Airflow DAG · daily lakehouse pipeline
6 tasks · runs nightly at 02:00 UTC · LocalExecutor
Data quality
Scorecard across the four pillars
Completeness, validity, uniqueness, and freshness — measured by the dbt + Great Expectations test surface and surfaced into a simple scorecard.
Completeness
% of records with all required fields populated.
Validity
% of records that pass type, range, and regex checks.
Uniqueness
% of records with no duplicate primary key.
Freshness
% of partitions arriving within their SLA window.
| Metric | Value (%) | Description |
|---|---|---|
| Completeness | 99.4 | % of records with all required fields populated. |
| Validity | 98.7 | % of records that pass type, range, and regex checks. |
| Uniqueness | 100.0 | % of records with no duplicate primary key. |
| Freshness | 99.9 | % of partitions arriving within their SLA window. |
dbt lineage
From sources to marts
Hover or focus a model to see its source file. The graph mirrors the dbt project's actual model layout.
dbt lineage · sources to gold
Hover or focus a model to see its source file.
Architecture
From producer to BI
Mermaid diagram of the full pipeline. The engineering write-up underneath calls out the trade-offs that recruiters and reviewers tend to ask about.
flowchart LR
G[Synthetic CDR generator] -->|parquet| R[(MinIO/S3 raw zone)]
R -->|Airflow ingest DAG| BR[(Iceberg Bronze)]
BR -->|Airflow transform DAG
GE checks| SI[(Iceberg Silver)]
SI -->|dbt run| GO[(Gold marts
revenue_by_market, arpu_monthly, churn_signals)]
GO --> BI[BI / consumers]
Why medallion over a single table? Telecom CDR
volumes are huge and lossy. Keeping a Bronze append-only layer
means a schema change, a parsing bug, or a vendor-side correction
can be replayed without re-ingesting from upstream. Silver is
where validation happens but rows are never dropped — bad rows are
flagged via is_valid + validation_notes
so analysts can audit them. Gold is the only layer BI tools see.
Why Iceberg over Delta? Both work. Iceberg's hidden partitioning and partition evolution made schema migrations on the synthetic data far less painful than rewrite-on-write tables, and it gives engine portability if a downstream consumer wants Trino or Athena instead of Spark. Delta would be a fine choice in a Spark-mostly shop.
Why Great Expectations at Bronze and dbt tests at Silver/Gold? GE catches the bad records before they enter the lake — schema, null rates, source checksums. dbt tests catch the bad relationships: uniqueness, referential integrity, accepted values. The two tools cover complementary failure modes, and both block the downstream task in Airflow if they fail.
Stack
What this project uses, and why
- Apache Airflow
- Apache Iceberg
- MinIO / S3
- Great Expectations
- dbt
- Terraform
See the full code
The repo runs locally on docker-compose with no AWS account. Airflow, MinIO, Iceberg, GE, and the dbt project are all there.