When you let an LLM turn natural language into SQL, you need to know: is it correct, will it run on your database, and is it efficient? SQLclMCP is an open-source framework that answers those questions by comparing LLM-generated SQL to human-written baselines on Oracle Database — using the Model Context Protocol (MCP) and a 500-question TPC-H benchmark. MCP keeps “how SQL is generated” behind a single HTTP API: the evaluator sends a question and gets back SQL, so you can swap models, prompts, or even the server implementation and still run the same evaluation. This article walks through the pipeline, how to run it, what gets measured, a few example graphs and tables, and Oracle gotchas we fixed in the prompt.
Why This Matters
Natural language to SQL (NL2SQL) works well for ad-hoc questions and app backends — until the model returns the wrong rows or a query that fails or runs too slowly in production. To ship with confidence you need three guarantees: the result set is correct (same logical result as the intended query), the SQL executes on your database without syntax or runtime errors, and it’s efficient enough (reasonable latency and plan quality, e.g. Oracle EXPLAIN PLAN). The only reliable way to get those guarantees is to compare LLM output to a gold standard on a real database, in a repeatable pipeline — so you can improve prompts, compare models, and catch dialect gotchas (Oracle vs MySQL, EXTRACT vs LIMIT, and the like). This framework gives you that pipeline.