Project 01 · Streaming + lakehouse
Real-Time Fraud Signals Pipeline
A production-pattern streaming pipeline: synthetic transaction events ingested through Kafka, scored by Spark Structured Streaming with exactly-once semantics, persisted to Delta, modeled in dbt, and surfaced to an operator dashboard.
Live demo
Synthetic stream + rolling throughput
Plays back a 1,000-event fixture at ~2 events per second. About 5% are anomalies, balanced across velocity, geographic, and amount-outlier patterns.
Live transaction feed
Synthetic stream from fixture · 25 sample txns
- ACCT_••84•Whole Foods MarketUS$42.18
- ACCT_••84•StarbucksUS$12.50
- ACCT_••70•SpotifyUS$9.99
- ACCT_••55•LuxeWatches OnlineRU$4280.00FRAUD: AMOUNT OUTLIER
- ACCT_••44•TargetUS$18.75
- ACCT_••70•Best BuyUS$119.45
- ACCT_••91•Gulf Gas StationUS$67.20
- ACCT_••82•Crypto Exchange WireVN$1899.00FRAUD: GEO ANOMALY
Throughput (events/sec)
Rolling 60-second window · synthetic stream
latency p50/p95/p99
p50
~120ms
p95
~410ms
p99
~860ms
Latency values shown as placeholders. Real numbers come from local Spark runs in the project repo.
| Second | Events per second |
|---|---|
| -59 | 16.6 |
| -58 | 17.1 |
| -57 | 17.7 |
| -56 | 18.3 |
| -55 | 18.9 |
| -54 | 19.5 |
| -53 | 20.0 |
| -52 | 20.5 |
| -51 | 20.9 |
| -50 | 21.2 |
| -49 | 21.4 |
| -48 | 21.6 |
| -47 | 21.6 |
| -46 | 21.5 |
| -45 | 21.4 |
| -44 | 21.1 |
| -43 | 20.8 |
| -42 | 20.4 |
| -41 | 19.9 |
| -40 | 19.3 |
| -39 | 18.8 |
| -38 | 18.2 |
| -37 | 17.6 |
| -36 | 17.0 |
| -35 | 16.4 |
| -34 | 15.9 |
| -33 | 15.5 |
| -32 | 15.1 |
| -31 | 14.8 |
| -30 | 14.5 |
| -29 | 14.4 |
| -28 | 14.4 |
| -27 | 14.5 |
| -26 | 14.7 |
| -25 | 14.9 |
| -24 | 15.3 |
| -23 | 15.7 |
| -22 | 16.2 |
| -21 | 16.7 |
| -20 | 17.3 |
| -19 | 17.9 |
| -18 | 18.5 |
| -17 | 19.1 |
| -16 | 19.6 |
| -15 | 20.2 |
| -14 | 20.6 |
| -13 | 21.0 |
| -12 | 21.3 |
| -11 | 21.5 |
| -10 | 21.6 |
| -9 | 21.6 |
| -8 | 21.5 |
| -7 | 21.3 |
| -6 | 21.0 |
| -5 | 20.7 |
| -4 | 20.2 |
| -3 | 19.7 |
| -2 | 19.2 |
| -1 | 18.6 |
| 0 | 18.0 |
Detection logic
Three fraud patterns
Each pattern is implemented in PySpark with a watermarked window or stateful aggregation. See the project repo for the full operators and unit tests.
Velocity
Many transactions on the same account in a short window — often after a card-not-present compromise.
Detection logic
# stateful aggregation per account, last 60s window
df.withWatermark("ts", "5 minutes") \
.groupBy(window("ts", "60 seconds"), "account_id") \
.agg(count("*").alias("txn_count")) \
.filter("txn_count > 8")Geographic Anomaly
Two transactions on the same account from countries that cannot be reached in the elapsed time.
Detection logic
prev = lag("country").over(by_account)
gap = (col("ts") - lag("ts").over(by_account)).cast("long")
df.withColumn("impossible_travel",
(prev != col("country")) & (gap < 3600))Amount Outlier
Amount falls outside the account's recent distribution — Z-score above the rolling threshold.
Detection logic
mean = avg("amount").over(by_account_30d)
sd = stddev("amount").over(by_account_30d)
df.withColumn("z", (col("amount") - mean) / sd) \
.filter("abs(z) > 3.5")Architecture
From event to dashboard
Mermaid below. The engineering write-up underneath calls out the trade-offs that recruiters and senior reviewers tend to ask about.
flowchart LR
P[Synthetic event producer] -->|JSON events| K[(Kafka topic: tx-events)]
K -->|Structured Streaming| S[Spark consumer
watermark + checkpoint]
S -->|exactly-once| D[(Delta lakehouse
bronze + silver)]
D --> M[dbt models
fct_transactions, dim_account]
M --> B[Streamlit dashboard]
S -.scored events.-> A[Anomaly detector
velocity, geo, z-score]
A --> D
Why streaming over batch? Fraud ages out fast. A
batch ETL that lands an anomaly two hours late is informational, not
operational. The producer publishes JSON events to a Kafka topic
partitioned by account_id, which preserves per-account
ordering even when consumers scale horizontally.
Why Spark Structured Streaming over Flink? Both are mature; Flink has lower latency at the tail and richer stateful semantics. I picked Spark for this project because (a) the downstream lakehouse is already Spark-native via Delta, so the interop story is one engine instead of two, and (b) Structured Streaming's batch-equivalence model matches how analytics teams already reason about SQL. For sub-100ms detection, Flink would be the right call.
Exactly-once trade-offs. Spark's exactly-once
guarantee is "exactly-once across the checkpoint boundary," which
assumes the sink is idempotent. Delta MERGE on a deterministic key
is idempotent; raw Kafka-to-Kafka is not. I document the failure
modes (e.g., committing the offset before the Delta write) in the
repo's docs/architecture.md, alongside the watermark
policy that bounds late-event memory.
Stack
What this project uses, and why
- Python
- PySpark
- Apache Kafka
- Delta Lake
- dbt
- Streamlit
- Docker
- GitHub Actions
See the full code
The repo runs locally on docker-compose with no cloud account. Architecture doc, dbt tests, and the producer are all there.