Skip to main content

How to review SQL code in BigQuery for duplication and errors

Steven Elliott26 October 20224 min read
How to review SQL code in BigQuery for duplication and errors

Sometimes we can spend hours, days or even weeks writing SQL in BigQuery. Hopefully what you have ended up with is a code that will run error free. Unfortunately, just because a code has run and completed without any errors, it does not mean it's error free! Joins can cause duplication, calculations could be incorrect, column orders could be in the wrong place for unioned tables. The possibilities are endless, lucky us!

The larger the data set, the harder it will be to notice any discrepancies. It is good practice to implement a process that enables us to check through our SQL code in a logical way to help identify errors. None of them are foolproof, but they should help you to spot issues. This article outlines what I do to avoid the most common mistakes and issues.

Understanding your data

The best way to ensure you are extracting the correct data is to understand your data in the first place. Unfortunately, we do not always have that luxury as we often deal with lots of different data from lots of different companies, not to mention set up in various different ways.

The best way for us to understand these tables is to review your table schema to help you to understand each column, and then ask the people that have dealt with the tables before (or the stakeholder directly) any questions you may have.

Order by

Using ORDER BY in BigQuery is a great way to try and spot unwanted duplication. This is especially true if its duplication is caused by a JOIN, as it should be pretty obvious that there are multiples of each line of data.

Sometimes it will take a little more hunting, but putting data in order allows you to cluster together groups of data making it easier to review and potentially check against the original dataset.

Test a subset of data

You can use the ORDER BY above to choose a selection of data that contains as many variations of the final dataset as possible. Use this output to analyse against the original dataset to ensure the final data extract is behaving in the way you would expect. Using a subset of data will allow you to focus on a small selection of data, usually if you can get the code to work for the small subset it will work for the larger dataset also.

Compare calculated fields

When we are doing a calculated field, it will group data together in the way you have defined. It is good practice to do a quick check against the original data, filtering the original data for what you are grouping by. Do the counts and sums correlate as they should? If so bingo! If not, you may need to break up your SQL code and investigate why you are getting differences.

Quick and dirty row numbers

A lot of duplication or a reduction in rows can come after we start joining on tables. This can happen for a multitude of reasons, such as joining on columns that contain null values. A really quick and easy way to find out if anything has changed is to make a note of the row numbers in the bottom right corner of BigQuery before and after the join. If there is a change, was it approximately what you expected?

Peer review

It should go without saying, but it is beneficial to implement peer reviews to check your work. The more questions they can ask the better. Sometimes we all look at the same piece of work for too long that we get lost in it and can’t see certain issues. Having someone look at your work and question why you have written it in the way you have can help to evaluate your work better.

Hopefully these many points above can help you to make less error-prone SQL in BigQuery. If you have any questions on these, or want us to check your code, or peer review what you've done, then get in touch and we can book in a chat!


Suggested content

Measurelab awarded Google Cloud Marketing Analytics Specialisation

At the start of the year, if you’d asked us whether Measurelab would be standing shoulder to shoulder with Europe’s biggest consultancies by September, we would've been surprised. Not because we don't believe in ourselves, but because these things feel so distant - until suddenly, they’re not. So, here it is: we’ve been awarded the Marketing Analytics Services Partner Specialisation in Google Cloud Partner Advantage. What’s the big deal? In Google’s own words (with the obligatory Zs): “Spec

Will Hayes11 Sept 2025

BigQuery AI.GENERATE tutorial: turn SQL queries into AI-powered insights

BigQuery just got a major upgrade, you can now plug directly into Vertex AI using the new AI.GENERATE function. Translation: your analytics data and generative AI are now best friends, and they’re hanging out right inside SQL. That opens up a whole world of new analysis options for GA4 data, but it also raises some questions: * How do you actually set it up? * What’s it good for (and when should you avoid it)? * Why would you batch the query? Let’s walk through it step by step. Step 1: H

Katie Kaczmarek3 Sept 2025

How to start forecasting in BigQuery with zero training

If you’d told me five years ago that I’d be forecasting product demand using a model trained on 100 billion time points… without writing a single line of ML code… I probably would’ve asked how many coffees you’d had that day ☕️ But its a brand new world. And it’s possible. Let me explain What is TimesFM? TimesFM is a new foundation model from Google, built specifically for time-series forecasting. Think of it like GPT for time, instead of predicting the next word in a sentence, it predicts t

Katie Kaczmarek14 Jul 2025