How to review SQL code in BigQuery for duplication and errors
Sometimes we can spend hours, days or even weeks writing SQL in BigQuery. Hopefully what you have ended up with is a code that will run error free. Unfortunately, just because a code has run and completed without any errors, it does not mean it’s error free! Joins can cause duplication, calculations could be incorrect, column orders could be in the wrong place for unioned tables. The possibilities are endless, lucky us!
The larger the data set, the harder it will be to notice any discrepancies. It is good practice to implement a process that enables us to check through our SQL code in a logical way to help identify errors. None of them are foolproof, but they should help you to spot issues. This article outlines what I do to avoid the most common mistakes and issues.
Understanding your data
The best way to ensure you are extracting the correct data is to understand your data in the first place. Unfortunately, we do not always have that luxury as we often deal with lots of different data from lots of different companies, not to mention set up in various different ways.
The best way for us to understand these tables is to review your table schema to help you to understand each column, and then ask the people that have dealt with the tables before (or the stakeholder directly) any questions you may have.
Using ORDER BY in BigQuery is a great way to try and spot unwanted duplication. This is especially true if its duplication is caused by a JOIN, as it should be pretty obvious that there are multiples of each line of data.
Sometimes it will take a little more hunting, but putting data in order allows you to cluster together groups of data making it easier to review and potentially check against the original dataset.
Test a subset of data
You can use the ORDER BY above to choose a selection of data that contains as many variations of the final dataset as possible. Use this output to analyse against the original dataset to ensure the final data extract is behaving in the way you would expect. Using a subset of data will allow you to focus on a small selection of data, usually if you can get the code to work for the small subset it will work for the larger dataset also.
Compare calculated fields
When we are doing a calculated field, it will group data together in the way you have defined. It is good practice to do a quick check against the original data, filtering the original data for what you are grouping by. Do the counts and sums correlate as they should? If so bingo! If not, you may need to break up your SQL code and investigate why you are getting differences.
Quick and dirty row numbers
A lot of duplication or a reduction in rows can come after we start joining on tables. This can happen for a multitude of reasons, such as joining on columns that contain null values. A really quick and easy way to find out if anything has changed is to make a note of the row numbers in the bottom right corner of BigQuery before and after the join. If there is a change, was it approximately what you expected?
It should go without saying, but it is beneficial to implement peer reviews to check your work. The more questions they can ask the better. Sometimes we all look at the same piece of work for too long that we get lost in it and can’t see certain issues. Having someone look at your work and question why you have written it in the way you have can help to evaluate your work better.
Hopefully these many points above can help you to make less error-prone SQL in BigQuery. If you have any questions on these, or want us to check your code, or peer review what you’ve done, then get in touch and we can book in a chat!