How to flatten tables with user-defined functions in BigQuery

Flattening tables in Google BigQuery can be painful when doing multiple fields with UNNEST. User-defined functions (UDFs) make life easier and this one will do it for you automatically.

As we all know Google is continuously improving and expanding the Google Cloud Platform and BigQuery has truly benefited from this. It’s taken a simple scalable data warehouse into what is now a comprehensive and dynamic system. As you may guess, I’m a massive fan of BigQuery and it’s user-defined functions.

BigQuery offers a huge amount of functionality including Data Definition Language (DDL) and Data Manipulation Language (DML), however, sometimes there simply isn’t the function to help you complete what you are doing. Often in the past, I would end up accessing BigQuery via Python or SAS and applying the missing function.

User-Defined Functions (UDFs) cover some of the current gaps, and since they are now persistent (I’ll explain this further later), you can share and re-use these with your colleagues.

What is a UDF?

A UDF simply is a function you create that applies a logic not available within the standard SQL function to your data. Generally, you will be using JavaScript, SQL, Simple Logic or base Python (albeit in a rather hacky approach).

How do you generate UDFs?

Using BigQuery’s comprehensive guides, you can see the general structure is very similar to some DDL statements, however, it can take a little getting used to. I go into this in more detail over on Medium with examples.

How to flatten tables with UDFs

Imagine a scenario where you have a Google Analytics 4 (GA4) or Firebase (FB) dataset where you need to flatten multiple fields. Normally you would need to write out the standard UNNEST script of:

Code language: JavaScript (javascript)

However, that is super painful to repeat multiple times and likely to produce errors. So using a UDF, scripting and the fab SQL execute statement, I have generated the code below which will automatically UNNEST your GA4 data.

Be careful to select the specific dates and events you wish to review. You can even remove the SQL execute and simply copy the query into another window or a tool such as Python.

DECLARE SQLRUN STRING DEFAULT ''; CREATE TEMP FUNCTION ga4_firebase( key1 STRING, params ARRAY <STRUCT <key STRING, value STRUCT <string_value STRING, int_value INT64, float_value FLOAT64, double_value FLOAT64 >>>) AS ( ( SELECT param.value FROM UNNEST(params) param WHERE param.key=key1) ); # ***************************************** # Set the tables names and define row numbers and update times # ***************************************** SET SQLRUN = ( SELECT CONCAT('SELECT EVENT_NAME , ', STRING_AGG(CONCAT("ga4_firebase('",key,"', event_params).", event_parameter_value, " AS ", key, ',' ), '\n'), ' FROM `project.analytics_yyyymmdd.events_*` WHERE _TABLE_SUFFIX = "20210801" ORDER BY 1' ) FROM ( SELECT DISTINCT event_name, params.key AS key, (CASE WHEN params.value.string_value IS NOT NULL THEN 'string_value' WHEN params.value.int_value IS NOT NULL THEN 'int_value' WHEN params.value.double_value IS NOT NULL THEN 'double_value' WHEN params.value.float_value IS NOT NULL THEN 'float_value' END ) AS event_parameter_value, FROM `project.analytics_yyyymmdd.events_*`, UNNEST(event_params) AS params WHERE _table_suffix BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)) AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) ) AND EVENT_NAME = 'page_view' ); # ************************************* # Execute the query # ************************************* EXECUTE IMMEDIATE (SQLRUN);
Code language: PHP (php)

Saving and persisting UDFs

I mentioned earlier that recently Google introduced persistent UDFs, ones that can be saved in your database and this is relatively simple to do. You simply need a destination dataset and the name of your UDF. This will then save in your dataset and include a function flag to make them easy to spot. You can then see the function values when you click on them and apply them to any code using the dataset.function_name.

Code language: CSS (css)

In fact, a great UDF community has formed an open-sourced UDF project which you can actually apply to your own work, including many statistical techniques such as P-Values and much more simply by using the following syntax:

SELECT bqutil.fn.udf_name(variable)
Code language: CSS (css)

Be sure to check them out!

Written by

Lead data specialist, and data programming obsessive. Loves statistical languages especially Python with a passion for Google Cloud Platform for apps and web marketing data.

Subscribe to our newsletter: