Skip to content

Project 01 · Streaming + lakehouse

Real-Time Fraud Signals Pipeline

A production-pattern streaming pipeline: synthetic transaction events ingested through Kafka, scored by Spark Structured Streaming with exactly-once semantics, persisted to Delta, modeled in dbt, and surfaced to an operator dashboard.

01Producer02Kafka03Spark Stream04Delta Lake05dbt06Dashboard
Normal eventFlagged as fraud

Live demo

Synthetic stream + rolling throughput

Plays back a 1,000-event fixture at ~2 events per second. About 5% are anomalies, balanced across velocity, geographic, and amount-outlier patterns.

Live transaction feed

Synthetic stream from fixture · 25 sample txns

  • ACCT_••84•Whole Foods MarketUS
    $42.18
  • ACCT_••84•StarbucksUS
    $12.50
  • ACCT_••70•SpotifyUS
    $9.99
  • ACCT_••55•LuxeWatches OnlineRU
    $4280.00FRAUD: AMOUNT OUTLIER
  • ACCT_••44•TargetUS
    $18.75
  • ACCT_••70•Best BuyUS
    $119.45
  • ACCT_••91•Gulf Gas StationUS
    $67.20
  • ACCT_••82•Crypto Exchange WireVN
    $1899.00FRAUD: GEO ANOMALY
Window of last 8 events6 normal · 2 flagged

Throughput (events/sec)

Rolling 60-second window · synthetic stream

latency p50/p95/p99

illustrative placeholders

p50

~120ms

p95

~410ms

p99

~860ms

Latency values shown as placeholders. Real numbers come from local Spark runs in the project repo.

Throughput data points, last 60 seconds
SecondEvents per second
-5916.6
-5817.1
-5717.7
-5618.3
-5518.9
-5419.5
-5320.0
-5220.5
-5120.9
-5021.2
-4921.4
-4821.6
-4721.6
-4621.5
-4521.4
-4421.1
-4320.8
-4220.4
-4119.9
-4019.3
-3918.8
-3818.2
-3717.6
-3617.0
-3516.4
-3415.9
-3315.5
-3215.1
-3114.8
-3014.5
-2914.4
-2814.4
-2714.5
-2614.7
-2514.9
-2415.3
-2315.7
-2216.2
-2116.7
-2017.3
-1917.9
-1818.5
-1719.1
-1619.6
-1520.2
-1420.6
-1321.0
-1221.3
-1121.5
-1021.6
-921.6
-821.5
-721.3
-621.0
-520.7
-420.2
-319.7
-219.2
-118.6
018.0

Detection logic

Three fraud patterns

Each pattern is implemented in PySpark with a watermarked window or stateful aggregation. See the project repo for the full operators and unit tests.

Velocity

Many transactions on the same account in a short window — often after a card-not-present compromise.

Detection logic

# stateful aggregation per account, last 60s window
df.withWatermark("ts", "5 minutes") \
  .groupBy(window("ts", "60 seconds"), "account_id") \
  .agg(count("*").alias("txn_count")) \
  .filter("txn_count > 8")

Geographic Anomaly

Two transactions on the same account from countries that cannot be reached in the elapsed time.

Detection logic

prev = lag("country").over(by_account)
gap = (col("ts") - lag("ts").over(by_account)).cast("long")
df.withColumn("impossible_travel",
   (prev != col("country")) & (gap < 3600))

Amount Outlier

Amount falls outside the account's recent distribution — Z-score above the rolling threshold.

Detection logic

mean = avg("amount").over(by_account_30d)
sd   = stddev("amount").over(by_account_30d)
df.withColumn("z", (col("amount") - mean) / sd) \
  .filter("abs(z) > 3.5")

Architecture

From event to dashboard

Mermaid below. The engineering write-up underneath calls out the trade-offs that recruiters and senior reviewers tend to ask about.

flowchart LR
          P[Synthetic event producer] -->|JSON events| K[(Kafka topic: tx-events)]
          K -->|Structured Streaming| S[Spark consumer
watermark + checkpoint]
          S -->|exactly-once| D[(Delta lakehouse
bronze + silver)]
          D --> M[dbt models
fct_transactions, dim_account]
          M --> B[Streamlit dashboard]
          S -.scored events.-> A[Anomaly detector
velocity, geo, z-score]
          A --> D
        

Why streaming over batch? Fraud ages out fast. A batch ETL that lands an anomaly two hours late is informational, not operational. The producer publishes JSON events to a Kafka topic partitioned by account_id, which preserves per-account ordering even when consumers scale horizontally.

Why Spark Structured Streaming over Flink? Both are mature; Flink has lower latency at the tail and richer stateful semantics. I picked Spark for this project because (a) the downstream lakehouse is already Spark-native via Delta, so the interop story is one engine instead of two, and (b) Structured Streaming's batch-equivalence model matches how analytics teams already reason about SQL. For sub-100ms detection, Flink would be the right call.

Exactly-once trade-offs. Spark's exactly-once guarantee is "exactly-once across the checkpoint boundary," which assumes the sink is idempotent. Delta MERGE on a deterministic key is idempotent; raw Kafka-to-Kafka is not. I document the failure modes (e.g., committing the offset before the Delta write) in the repo's docs/architecture.md, alongside the watermark policy that bounds late-event memory.

Stack

What this project uses, and why

  • Python
  • PySpark
  • Apache Kafka
  • Delta Lake
  • dbt
  • Streamlit
  • Docker
  • GitHub Actions

See the full code

The repo runs locally on docker-compose with no cloud account. Architecture doc, dbt tests, and the producer are all there.