Project 01 · Streaming + lakehouse

Real-Time Fraud Signals Pipeline

A production-pattern streaming pipeline: synthetic transaction events ingested through Kafka, scored by Spark Structured Streaming with exactly-once semantics, persisted to Delta, modeled in dbt, and surfaced to an operator dashboard.

View on GitHub See the live feed Architecture deep-dive

Normal eventFlagged as fraud

Live demo

Synthetic stream + rolling throughput

Plays back a 1,000-event fixture at ~2 events per second. About 5% are anomalies, balanced across velocity, geographic, and amount-outlier patterns.

Live transaction feed

Synthetic stream from fixture · 25 sample txns

ACCT_••84•Whole Foods MarketUS
$42.18
ACCT_••84•StarbucksUS
$12.50
ACCT_••70•SpotifyUS
$9.99
ACCT_••55•LuxeWatches OnlineRU
$4280.00FRAUD: AMOUNT OUTLIER
ACCT_••44•TargetUS
$18.75
ACCT_••70•Best BuyUS
$119.45
ACCT_••91•Gulf Gas StationUS
$67.20
ACCT_••82•Crypto Exchange WireVN
$1899.00FRAUD: GEO ANOMALY

Window of last 8 events6 normal · 2 flagged

Throughput (events/sec)

Rolling 60-second window · synthetic stream

latency p50/p95/p99

illustrative placeholders

p50

~120ms

p95

~410ms

p99

~860ms

Latency values shown as placeholders. Real numbers come from local Spark runs in the project repo.

Throughput data points, last 60 seconds
Second	Events per second
-59	16.6
-58	17.1
-57	17.7
-56	18.3
-55	18.9
-54	19.5
-53	20.0
-52	20.5
-51	20.9
-50	21.2
-49	21.4
-48	21.6
-47	21.6
-46	21.5
-45	21.4
-44	21.1
-43	20.8
-42	20.4
-41	19.9
-40	19.3
-39	18.8
-38	18.2
-37	17.6
-36	17.0
-35	16.4
-34	15.9
-33	15.5
-32	15.1
-31	14.8
-30	14.5
-29	14.4
-28	14.4
-27	14.5
-26	14.7
-25	14.9
-24	15.3
-23	15.7
-22	16.2
-21	16.7
-20	17.3
-19	17.9
-18	18.5
-17	19.1
-16	19.6
-15	20.2
-14	20.6
-13	21.0
-12	21.3
-11	21.5
-10	21.6
-9	21.6
-8	21.5
-7	21.3
-6	21.0
-5	20.7
-4	20.2
-3	19.7
-2	19.2
-1	18.6
0	18.0

Detection logic

Three fraud patterns

Each pattern is implemented in PySpark with a watermarked window or stateful aggregation. See the project repo for the full operators and unit tests.

Velocity

Many transactions on the same account in a short window — often after a card-not-present compromise.

Detection logic

# stateful aggregation per account, last 60s window
df.withWatermark("ts", "5 minutes") \
  .groupBy(window("ts", "60 seconds"), "account_id") \
  .agg(count("*").alias("txn_count")) \
  .filter("txn_count > 8")

Geographic Anomaly

Two transactions on the same account from countries that cannot be reached in the elapsed time.

Detection logic

prev = lag("country").over(by_account)
gap = (col("ts") - lag("ts").over(by_account)).cast("long")
df.withColumn("impossible_travel",
   (prev != col("country")) & (gap < 3600))

Amount Outlier

Amount falls outside the account's recent distribution — Z-score above the rolling threshold.

Detection logic

mean = avg("amount").over(by_account_30d)
sd   = stddev("amount").over(by_account_30d)
df.withColumn("z", (col("amount") - mean) / sd) \
  .filter("abs(z) > 3.5")

Architecture

From event to dashboard

Mermaid below. The engineering write-up underneath calls out the trade-offs that recruiters and senior reviewers tend to ask about.

flowchart LR
          P[Synthetic event producer] -->|JSON events| K[(Kafka topic: tx-events)]
          K -->|Structured Streaming| S[Spark consumer
watermark + checkpoint]
          S -->|exactly-once| D[(Delta lakehouse
bronze + silver)]
          D --> M[dbt models
fct_transactions, dim_account]
          M --> B[Streamlit dashboard]
          S -.scored events.-> A[Anomaly detector
velocity, geo, z-score]
          A --> D

Why streaming over batch? Fraud ages out fast. A batch ETL that lands an anomaly two hours late is informational, not operational. The producer publishes JSON events to a Kafka topic partitioned by account_id, which preserves per-account ordering even when consumers scale horizontally.

Why Spark Structured Streaming over Flink? Both are mature; Flink has lower latency at the tail and richer stateful semantics. I picked Spark for this project because (a) the downstream lakehouse is already Spark-native via Delta, so the interop story is one engine instead of two, and (b) Structured Streaming's batch-equivalence model matches how analytics teams already reason about SQL. For sub-100ms detection, Flink would be the right call.

Exactly-once trade-offs. Spark's exactly-once guarantee is "exactly-once across the checkpoint boundary," which assumes the sink is idempotent. Delta MERGE on a deterministic key is idempotent; raw Kafka-to-Kafka is not. I document the failure modes (e.g., committing the offset before the Delta write) in the repo's docs/architecture.md, alongside the watermark policy that bounds late-event memory.

Stack

What this project uses, and why

Python
PySpark
Apache Kafka
Delta Lake
dbt
Streamlit
Docker
GitHub Actions

See the full code

The repo runs locally on docker-compose with no cloud account. Architecture doc, dbt tests, and the producer are all there.

View repository Read the architecture doc