Project 03 · AI-assisted data engineering

AI-Assisted SQL Optimizer

A CLI that uses Claude to suggest Spark and Snowflake query rewrites and partition strategies. Benchmarked against a 50-query corpus with sqlglot-driven heuristics and an optional EXPLAIN verifier.

View on GitHub Try the demo Methodology

Live demo

Pick a query, see the rewrite

Output is replayed from a static fixture so the page never makes an external API call by default. The repo includes the full CLI plus an optional Cloudflare Worker proxy for self-hosted live mode.

Live demo

Optimize a query

Pick one of the five corpus samples. The original goes in; Claude returns a rewrite, reasoning, and an estimated cost delta. Output is replayed from a fixture so the page never makes an API call by default.

Query

Original SQL

Loading editor…

Optimized SQL

Loading editor…

Inline diff

original vs optimized

Loading diff…

Why this rewrite

Claude’s reasoning

Push country and date filters into subqueries so each side scans fewer rows before the join.
On Snowflake / Iceberg this lets the engine prune partitions and apply Bloom filters before shuffle.
Typical 3-10x reduction in shuffle bytes on joins with selective filters.

Pattern note · Filters higher in the plan get pushed down to scan-time, dropping rows before they participate in the join.

Corpus

Estimated cost reduction across the sample queries

Per-query cost-reduction estimates from the 50-query corpus. The five queries in the live demo are shown here; full results are in the repo.

Estimated cost reduction by query

Illustrative cost-reduction estimates — see methodology below

Numbers reflect planner heuristics on the 50-query corpus, not production runs. EXPLAIN-based scoring noted in the repo.

Estimated cost reduction by query
Query	Estimated cost reduction
Predicate pushdown into the inner join	62%
Broadcast small dimension instead of shuffle join	45%
Add partition predicate so the planner can prune	88%
Replace self-join with window function	70%
Project only required columns	30%

Methodology

How the optimizer works, scores, and where it fails

Three columns: how it works, how suggestions are scored, and the limitations a senior reviewer would call out.

How it works

sqlglot parses the input into an AST and a small heuristic pass collects findings (function-wrapped partition columns, SELECT *, exact COUNT DISTINCT, missing broadcast hints).
Findings + raw SQL are sent to the Anthropic Messages API with a structured optimizer prompt. Output is constrained to a JSON envelope of rewrite + reasoning.
An optional EXPLAIN runner (Spark or Snowflake) executes both plans against a shadow database and reports the cost delta back to the CLI.

Scoring rubric

Keyword overlap with ground-truth labels in corpus/ground_truth.yaml — coarse but cheap to run on every change.
EXPLAIN cost delta when a Spark or Snowflake engine is reachable; treated as approximate, not authoritative.
Human review on the full 50-query corpus, recorded in the repo with the reviewer’s rationale per query.

Limitations

Claude can hallucinate column or table names not in the schema — especially for novel dialects.
EXPLAIN-based scoring is approximate: a "cheaper" plan can still be slower in practice.
Dialect drift between Spark SQL and Snowflake catches edge cases (QUALIFY, lateral views, named struct fields).
Suggestions require human review before applying to production. The CLI defaults to printing diffs, never auto-rewriting files.

Live mode

Want to run this against a real Claude key?

The on-page demo uses pre-computed responses for security — no API keys are bundled into the client. The GitHub repo ships the full CLI plus an optional Cloudflare Worker proxy at src/workers/sql-optimizer-proxy/ for live API calls when self-hosted with your own ANTHROPIC_API_KEY. When you deploy the Worker, set PUBLIC_LIVE_DEMO_URL at build time and the demo swaps fixture lookup for a Worker fetch. The default build never reaches the network.

Architecture

How it works under the hood

Sequence diagram from the project content. Mirrors the animated hero diagram, with the additional EXPLAIN-runner branch the hero collapses into the Verified node.

sequenceDiagram
          participant U as User
          participant CLI as sql-optimizer CLI
          participant A as Analyzer (sqlglot + heuristics)
          participant C as Claude API
          participant E as EXPLAIN runner (optional)

          U->>CLI: analyze query.sql
          CLI->>A: parse + collect findings
          A-->>CLI: AST + findings
          CLI->>C: optimizer_prompt + sql + findings
          C-->>CLI: rewrite + reasoning
          CLI->>E: EXPLAIN original / EXPLAIN rewrite
          E-->>CLI: cost deltas
          CLI-->>U: markdown diff + cost report

Why heuristics + LLM, not just the LLM? Sending raw SQL to Claude works, but the model spends most of its budget re-deriving facts a parser already knows: which columns appear in the projection, where the partition column lives, whether the predicate is wrapped in a function. A small sqlglot pass extracts those cheaply and ships them as findings, leaving the model to decide what to do with them.

Why EXPLAIN is optional. EXPLAIN cost numbers are informative, not ground truth. They reflect the planner’s stale statistics and its own broadcast threshold. The CLI treats them as a tiebreaker between candidate rewrites, never as the deciding signal.

Prompt sensitivity. Small changes to the optimizer prompt materially change suggestion quality. The prompt template is versioned in the repo, and corpus tests pin a known prompt version so regressions surface in CI rather than in production.

Stack

What this project uses, and why

Python
Typer
Anthropic SDK
sqlglot
pytest

See the full code

The CLI, the 50-query corpus, the prompt template, and the optional EXPLAIN runner are all in the repo. Self-hosted live mode goes through a small Cloudflare Worker that holds your key.

View repository Read the prompt template