Project 03 · AI-assisted data engineering
AI-Assisted SQL Optimizer
A CLI that uses Claude to suggest Spark and Snowflake query rewrites and partition strategies. Benchmarked against a 50-query corpus with sqlglot-driven heuristics and an optional EXPLAIN verifier.
Live demo
Pick a query, see the rewrite
Output is replayed from a static fixture so the page never makes an external API call by default. The repo includes the full CLI plus an optional Cloudflare Worker proxy for self-hosted live mode.
Live demo
Optimize a query
Pick one of the five corpus samples. The original goes in; Claude returns a rewrite, reasoning, and an estimated cost delta. Output is replayed from a fixture so the page never makes an API call by default.
Original SQL
Inline diff
original vs optimizedWhy this rewrite
Claude’s reasoning
- Push country and date filters into subqueries so each side scans fewer rows before the join.
- On Snowflake / Iceberg this lets the engine prune partitions and apply Bloom filters before shuffle.
- Typical 3-10x reduction in shuffle bytes on joins with selective filters.
Pattern note · Filters higher in the plan get pushed down to scan-time, dropping rows before they participate in the join.
Corpus
Estimated cost reduction across the sample queries
Per-query cost-reduction estimates from the 50-query corpus. The five queries in the live demo are shown here; full results are in the repo.
Estimated cost reduction by query
Illustrative cost-reduction estimates — see methodology below
Numbers reflect planner heuristics on the 50-query corpus, not production runs. EXPLAIN-based scoring noted in the repo.
| Query | Estimated cost reduction |
|---|---|
| Predicate pushdown into the inner join | 62% |
| Broadcast small dimension instead of shuffle join | 45% |
| Add partition predicate so the planner can prune | 88% |
| Replace self-join with window function | 70% |
| Project only required columns | 30% |
Methodology
How the optimizer works, scores, and where it fails
Three columns: how it works, how suggestions are scored, and the limitations a senior reviewer would call out.
How it works
- sqlglot parses the input into an AST and a small heuristic pass collects findings (function-wrapped partition columns, SELECT *, exact COUNT DISTINCT, missing broadcast hints).
- Findings + raw SQL are sent to the Anthropic Messages API with a structured optimizer prompt. Output is constrained to a JSON envelope of rewrite + reasoning.
- An optional EXPLAIN runner (Spark or Snowflake) executes both plans against a shadow database and reports the cost delta back to the CLI.
Scoring rubric
- Keyword overlap with ground-truth labels in
corpus/ground_truth.yaml— coarse but cheap to run on every change. - EXPLAIN cost delta when a Spark or Snowflake engine is reachable; treated as approximate, not authoritative.
- Human review on the full 50-query corpus, recorded in the repo with the reviewer’s rationale per query.
Limitations
- Claude can hallucinate column or table names not in the schema — especially for novel dialects.
- EXPLAIN-based scoring is approximate: a "cheaper" plan can still be slower in practice.
- Dialect drift between Spark SQL and Snowflake catches edge cases (QUALIFY, lateral views, named struct fields).
- Suggestions require human review before applying to production. The CLI defaults to printing diffs, never auto-rewriting files.
Live mode
Want to run this against a real Claude key?
The on-page demo uses pre-computed responses for security — no
API keys are bundled into the client. The GitHub repo ships the
full CLI plus an optional Cloudflare Worker proxy at
src/workers/sql-optimizer-proxy/
for live API calls when self-hosted with your own
ANTHROPIC_API_KEY. When you
deploy the Worker, set
PUBLIC_LIVE_DEMO_URL at build
time and the demo swaps fixture lookup for a Worker fetch. The
default build never reaches the network.
Architecture
How it works under the hood
Sequence diagram from the project content. Mirrors the animated hero diagram, with the additional EXPLAIN-runner branch the hero collapses into the Verified node.
sequenceDiagram
participant U as User
participant CLI as sql-optimizer CLI
participant A as Analyzer (sqlglot + heuristics)
participant C as Claude API
participant E as EXPLAIN runner (optional)
U->>CLI: analyze query.sql
CLI->>A: parse + collect findings
A-->>CLI: AST + findings
CLI->>C: optimizer_prompt + sql + findings
C-->>CLI: rewrite + reasoning
CLI->>E: EXPLAIN original / EXPLAIN rewrite
E-->>CLI: cost deltas
CLI-->>U: markdown diff + cost report
Why heuristics + LLM, not just the LLM? Sending raw SQL to Claude works, but the model spends most of its budget re-deriving facts a parser already knows: which columns appear in the projection, where the partition column lives, whether the predicate is wrapped in a function. A small sqlglot pass extracts those cheaply and ships them as findings, leaving the model to decide what to do with them.
Why EXPLAIN is optional. EXPLAIN cost numbers are informative, not ground truth. They reflect the planner’s stale statistics and its own broadcast threshold. The CLI treats them as a tiebreaker between candidate rewrites, never as the deciding signal.
Prompt sensitivity. Small changes to the optimizer prompt materially change suggestion quality. The prompt template is versioned in the repo, and corpus tests pin a known prompt version so regressions surface in CI rather than in production.
Stack
What this project uses, and why
- Python
- Typer
- Anthropic SDK
- sqlglot
- pytest
See the full code
The CLI, the 50-query corpus, the prompt template, and the optional EXPLAIN runner are all in the repo. Self-hosted live mode goes through a small Cloudflare Worker that holds your key.