The promise (and reality) of natural language to SQL

Unless you've spent the last year in a particularly deep cave, you've probably heard the phrase "just ask your data a question" thrown around with wild abandon. The dream is seductive: instead of bothering your already overwhelmed data team with yet another ad hoc query, you simply type "How many users spent more than 20 minutes on site yesterday?" and watch the magic happen. SQL appears, results flow, insights materialise.

I'll be honest, I've been one of the people promising this future. But having spent considerable time poking, prodding, and occasionally pleading with these tools, I think it's time for a more nuanced conversation about where we actually are. Because here's the thing: Natural Language to SQL is genuinely impressive. It's also genuinely limited. And understanding the gap between those two truths is crucial if you want to get real value from these technologies.

What are we actually talking about?

At its core, Natural Language to SQL (NL-SQL) does exactly what it says on the tin. You write a question in normal human language, and an AI model translates that into SQL that queries your specific data set/project/table. In the BigQuery world, this means turning on the Gemini API, clicking "generate SQL with Gemini," and watching your question transform into a fully working query within seconds.

This isn't some distant future technology, it's available right now in BigQuery. And it works. Sort of.

March 2026 update

Since we first wrote this, the underlying models powering NL-SQL in Google Cloud particularly via Gemini, have improved noticeably in handling longer prompts and more detailed schema context. Larger context windows mean you can now pass more metadata (table descriptions, example queries, glossary definitions) directly into a single interaction.

The metadata problem

Here's where things get interesting. We recently ran a webinar where we put these tools through their paces using the Google Analytics 4 data for the Google Store. For those who've worked with GA4 data in BigQuery, you'll understand the pain, the event-based schema with its nested parameters is one of the most challenging structures to query correctly.

We asked a simple question: "How many sessions came from Google?"

The NL-SQL produced a query that, on first inspection, looked perfect. It counted distinct combinations of user pseudo ID and session ID, filtered by source. Textbook stuff. But look closer and you'll spot two problems. First, it used traffic_source.source instead of the session-scoped source from event parameters or the newer session_traffic_source_last_click.manual_campaign.source field, a subtle but important distinction in attribution. Second, it filtered for "Google" with a capital G, while the actual data uses lowercase.

March 2026 update

NL-SQL tools are better at matching natural language to column names, and BigQuery's Gemini integration more reliably handles nested fields and UNNEST patterns. But improvements are probabilistic, not deterministic. Well-designed table descriptions and example queries can reduce error rates significantly - but human validation, especially for attribution logic and business-defined metrics, remains essential.

This isn't a failure of the AI. It's a failure of context.

The model doesn't know your data. It doesn't know that your organisation uses lowercase values, or that session attribution lives in a different place than first-user attribution. It's making educated guesses based on schema names and general knowledge of analytics concepts. Sometimes those guesses are right. Often, they're plausibly wrong, which is arguably worse than being obviously wrong.

The semantic layer is everything

In our second example, we simply asked NL2SQL to show how many bots were on site. The immediate output was nonsensical - the SQL generated counts the number of unique user_pseudo_ids where user_pseudo_ids is NULL…

The problem here is that this information is held under event_params - a nested column that NL2SQL cannot read information about just from looking at column names.

So how do you fix this? Metadata. Lots of it.

When we added table and column descriptions explaining that bot detection flags could be found in event parameters, the model correctly started pulling from the right place. When we pre-unnested the data with clearly named columns like session_source, the attribution problem disappeared.

This is the uncomfortable truth about conversational analytics: the more effort you put into describing your data, the better results you get out. There's no magic shortcut. If you want these tools to understand that "conversions" means your specific purchase event, or that "engaged users" follows your particular definition, you need to tell them. Explicitly. Repeatedly.

Think of it like onboarding a new analyst. You wouldn't expect them to intuitively understand your attribution model, your naming conventions, or your business-specific metrics on day one. The AI is no different, except it forgets everything between prompts unless you've baked that knowledge into the system instructions.

March 2026 update

The industry has shifted toward formal semantic layers (LookML, dbt Semantic Layer, modern BI metric tools) as the grounding layer for AI-driven analytics. When NL-SQL runs against curated models with clearly defined metrics, hallucination drops dramatically. The emerging best practice isn't "add more metadata" - it's "don't expose raw data to AI in the first place."

The tool landscape

Currently, there are several ways to access NL-SQL capabilities in the Google Cloud ecosystem, each with different trade-offs.

BigQuery's built-in Gemini integration is the most accessible starting point. It's quick, free (beyond standard Gemini token costs), and genuinely useful for analysts who have enough SQL knowledge to critique and refine the output. This is the key point: it's a productivity accelerator for people who already understand SQL, not a replacement for SQL knowledge.

Looker Studio Pro's Conversational Analytics promises a more user-friendly interface. You create agents, connect them to data sources, and let stakeholders ask questions directly. In theory, this is exactly what data teams have been crying out for, a way to handle ad hoc questions without becoming a permanent bottleneck.

In practice? It's frustratingly inconsistent. We had an agent working perfectly one day, returning session-attributed traffic correctly. The next day, same agent, same question, different answer, it was arguing with us about whether it was doing session-level attribution (it wasn't). By day 3 everything fell apart, it refused to talk to us entirely, complaining about access which was unchanged. Add to all of this a level of obscurity between questions asked and BQ costs, which could get away from you if not properly managed. The technology will get there, the potential is obvious, but right now it's not something I'd put in front of stakeholders expecting reliable, trustworthy results.

The Conversational Analytics API and Agent Development Kit (ADK) represent the more powerful end of the spectrum. These let you build custom applications with multiple data sources, sophisticated prompting, and the ability to embed conversational interfaces wherever your users already work, Slack, Looker Studio dashboards, and internal tools.

The trade-off is complexity. You're now in the territory of Python development, cloud infrastructure, and understanding concepts like MCP toolboxes. But you also get far more control over the "semantic layer", those crucial descriptions, glossaries, and relationship mappings that make the difference between useful and useless responses.

March 2026 update

BigQuery + Gemini is more predictable when working against curated views. Looker Studio Pro has improved governance and query auditability, though conversational agents still aren't fully deterministic. Custom implementations via Google Cloud APIs are increasingly where serious deployments land - RAG patterns, structured system prompts, and automated SQL validation are now standard practice.

The pattern that emerges

If there's one insight I'd want you to take away, it's this: there's a direct relationship between effort invested and value extracted.

NL-SQL in BigQuery is low effort but only useful for analysts. Looker Studio Pro is moderate effort but currently too buggy for production use. The API and ADK approaches require significant investment but unlock genuine democratisation of data across an organisation.

The deciding factor isn't the technology, it's the semantic layer you build around your data. The descriptions, the glossaries, the example queries, the business context. Without that investment, even the most sophisticated tools will give you plausible-sounding nonsense.

March 2026 update

The organisations extracting real value share three habits: restricting accessible tables, pre-defining metrics, and programmatically validating generated SQL before execution. Reliability comes from tighter constraints, not smarter models.

What about dashboards?

A reasonable question at this point: if conversational analytics works, do we still need dashboards?

Absolutely. Dashboards remain essential for consistent, repeatable reporting, the kind of thing multiple stakeholders need to glance at and get the same answer. Conversational interfaces are brilliant for exploration, ad hoc questions, and surfacing unexpected insights. But when the CFO needs to see last month's revenue broken down by region, you want a dashboard, not a chatbot.

This isn't an either/or proposition. It's about using the right tool for the right job.

March 2026 update

Rather than replacing dashboards, teams are layering conversational interfaces on top of curated reporting environments - governed KPIs at the top, exploratory chat within safe data boundaries. Chat works best as an augmentation layer, not a replacement.

Getting started without getting lost

If you're thinking about exploring this space, here's my honest advice.

First, assess your data readiness. If your data is a mess, inconsistent naming, undocumented transformations, tribal knowledge locked in people's heads, these tools will amplify the mess, not fix it. Get your house in order first.

Second, start with a specific use case. Don't try to build a general-purpose "ask anything" solution. Identify a particular stakeholder group with a particular type of question, build for that, and learn from the experience.

Third, budget time for the semantic layer. This is where the real work happens. Every hour spent documenting your data, defining your metrics, and creating clear descriptions will pay dividends in the quality of AI-generated responses.

Finally, keep humans in the loop. Monitor what questions people are asking, what answers the system is giving, and where it's going wrong. Feed that learning back into your system instructions. This isn't a "set and forget" technology.

March 2026 update

Conversational analytics should be treated like any production system. Forward-thinking teams are creating benchmark question sets, tracking accuracy over time, measuring cost per answered question, and logging hallucination types. If you're not measuring answer accuracy, you're operating on vibes.

The honest assessment

Natural Language to SQL represents genuine progress. It can meaningfully accelerate workflows for data-literate users and, with sufficient investment, can put analytical capabilities in the hands of stakeholders who'd never write SQL themselves.

But it's not magic. It requires good data, extensive metadata, and realistic expectations. The gap between the marketing promise ("just ask your data anything!") and the practical reality ("carefully instruct the AI about your specific context and verify its answers") is substantial.

That gap will close over time. The technology is improving rapidly, and each generation of models gets better at understanding context and nuance. But right now, in early 2025, the organisations getting value from these tools are the ones approaching them with eyes open, investing in the unsexy work of data documentation while keeping humans firmly in the loop.

If you want to explore how conversational analytics could work for your organisation, or if you need help getting your data foundations in place first, get in touch. We've built these solutions, broken these solutions, and learned quite a lot in the process. Sometimes the most valuable insight is knowing which pitfalls to avoid.

March 2026 update

Models, context handling, and tooling have all improved. But the fundamental truth holds: AI is a force multiplier for well-modelled, well-documented data and a chaos amplifier for poorly governed data. The organisations succeeding here aren't chasing novelty. They're investing in modeling discipline, semantic clarity, and human oversight.

This post is based on a recent Measurelab webinar on Conversational Analytics. Watch the full recording to see the tools in action, including the parts where they didn't quite work as expected.

FAQs

Is Natural Language to SQL accurate enough for financial reporting?

No. In 2026, NL-SQL is considered an exploratory tool. For deterministic financial reporting, traditional dashboards or governed semantic layers (like LookML) remain the industry standard to ensure 100% data accuracy and prevent costly hallucinations.

How do I stop AI from hallucinating SQL queries?

The most effective way to eliminate hallucinations is to provide "grounding" via a semantic layer. This includes adding rich table descriptions, providing example SQL queries (Few-Shot Prompting), and using pre-joined views to flatten complex nested structures like GA4 event parameters.

What is the difference between NL2SQL and a Semantic Layer?

NL2SQL is the generative engine that performs the translation from English to code, while the Semantic Layer is the "map" it uses to understand context. Without a semantic layer, the AI guesses your business logic; with it, the AI follows your predefined metrics and relationship rules.

Ask your data a question: the promise (and reality) of natural language to SQL

What are we actually talking about?

The metadata problem

This isn't a failure of the AI. It's a failure of context.

The semantic layer is everything

The tool landscape

The pattern that emerges

What about dashboards?

Getting started without getting lost

The honest assessment

FAQs

Is Natural Language to SQL accurate enough for financial reporting?

How do I stop AI from hallucinating SQL queries?

What is the difference between NL2SQL and a Semantic Layer?

Suggested content

The AI revolution already happened

Measurelab awarded Google Cloud Marketing Analytics Specialisation

BigQuery AI.GENERATE tutorial: turn SQL queries into AI-powered insights