How to flatten tables with user-defined functions in BigQuery

Flattening tables in Google BigQuery can be painful when doing multiple fields with UNNEST. User-defined functions (UDFs) make life easier and this one will do it for you automatically.

As we all know Google is continuously improving and expanding the Google Cloud Platform and BigQuery has truly benefited from this. It’s taken a simple scalable data warehouse into what is now a comprehensive and dynamic system. As you may guess, I’m a massive fan of BigQuery and it’s user-defined functions.

BigQuery offers a huge amount of functionality including Data Definition Language (DDL) and Data Manipulation Language (DML), however, sometimes there simply isn’t the function to help you complete what you are doing. Often in the past, I would end up accessing BigQuery via Python or SAS and applying the missing function.

User-Defined Functions (UDFs) cover some of the current gaps, and since they are now persistent (I’ll explain this further later), you can share and re-use these with your colleagues.

What is a UDF?

A UDF simply is a function you create that applies a logic not available within the standard SQL function to your data. Generally, you will be using JavaScript, SQL, Simple Logic or base Python (albeit in a rather hacky approach).

How do you generate UDFs?

Using BigQuery’s comprehensive guides, you can see the general structure is very similar to some DDL statements, however, it can take a little getting used to. I go into this in more detail over on Medium with examples.

How to flatten tables with UDFs

Imagine a scenario where you have a Google Analytics 4 (GA4) or Firebase (FB) dataset where you need to flatten multiple fields. Normally you would need to write out the standard UNNEST script of:

(SELECT VALUE.INT_VALUE FROM UNNEST(EVENT_PARAMS) WHERE KEY = 'CCC') 

However, that is super painful to repeat multiple times and likely to produce errors. So using a UDF, scripting and the fab SQL execute statement, I have generated the code below which will automatically UNNEST your GA4 data.

Be careful to select the specific dates and events you wish to review. You can even remove the SQL execute and simply copy the query into another window or a tool such as Python.


 DECLARE
  SQLRUN STRING DEFAULT '';
CREATE TEMP FUNCTION
  ga4_firebase( key1 STRING,
    params ARRAY <STRUCT <key STRING,
    value STRUCT <string_value STRING,
    int_value INT64,
    float_value FLOAT64,
    double_value FLOAT64 >>>) AS ( (
    SELECT
      param.value
    FROM
      UNNEST(params) param
    WHERE
      param.key=key1) );
  # *****************************************************************************
  # Set the tables names and define row numbers and update times
  # *****************************************************************************
SET
  SQLRUN = 
  (
  SELECT
    CONCAT('SELECT EVENT_NAME , ', STRING_AGG(CONCAT("ga4_firebase('",key,"',    event_params).", event_parameter_value, " AS ", key, ',' ), '\n'), '  FROM  `project.analytics_yyyymmdd.events_*` WHERE _TABLE_SUFFIX = "20210801"  ORDER BY 1' )
  FROM (
    SELECT
      DISTINCT event_name,
      params.key AS key,
      (CASE
          WHEN params.value.string_value IS NOT NULL THEN 'string_value'
          WHEN params.value.int_value IS NOT NULL THEN 'int_value'
          WHEN params.value.double_value IS NOT NULL THEN 'double_value'
          WHEN params.value.float_value IS NOT NULL THEN 'float_value'
      END
        ) AS event_parameter_value,
    FROM
      `project.analytics_yyyymmdd.events_*`,
      UNNEST(event_params) AS params
    WHERE
      _table_suffix BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
      AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) )
     AND  EVENT_NAME = 'page_view'
    );
  # *****************************************************************************
  # Execute the query
  # *****************************************************************************
EXECUTE IMMEDIATE
  (SQLRUN);

Saving and persisting UDFs

I mentioned earlier that recently Google introduced persistent UDFs, ones that can be saved in your database and this is relatively simple to do. You simply need a destination dataset and the name of your UDF. This will then save in your dataset and include a function flag to make them easy to spot. You can then see the function values when you click on them and apply them to any code using the dataset.function_name.

CREATE FUNCTION  UDFS.ga4_firebase

In fact, a great UDF community has formed an open-sourced UDF project which you can actually apply to your own work, including many statistical techniques such as P-Values and much more simply by using the following syntax:

SELECT bqutil.fn.udf_name(variable)

Be sure to check them out!

Share:
Written by

Subscribe to our newsletter: