How to access BigQuery data from server-side GTM (via Firestore)

This solution allows for data, contained within BigQuery to be queried and stored in Firestore where it can then be accessed on the fly by server-side GTM. This will allow for tags to be augmented with additional information before they are sent to their endpoints. For example:

A purchase event on the client-side GTM container triggers an HTTP request to the server-side container

HTTP request contains event name ‘Purchase’ and item id ‘14589’

Item id is used to look up item information using a Firestore lookup variable

All information about that specific item is returned back to the server container where it can be added to the event tag before it is sent to GA4

The Google cloud services used are:

  • BigQuery: This is where the main tables of data live. They can either be queried as a one-off to build a JSON of desired data for Firestore or they can be regularly queried so up-to-date data is feeding into Firestore
  • Cloud Store: Cloud store acts as the intermediary between BigQuery and Firestore. When an initial query is sent to BugQuery, its results are stored in a storage bucket in the form of a JSON file. This JSON file is then uploaded to Firestore using a cloud function
  • Cloud Function: The cloud function in this case is used to take the JSON file stored in the cloud storage bucket and import it to Firestore. The function is triggered by a pub/sub call
  • Google Workflow: Google workflow is used to dictate the order and flow of tasks that need to take place. It used a YAML script to break down the various stages and execute them before moving on to the next. This is where to BigQuery table is queried and stored. Only then is a pub/sub message sent to the function to store the file in Firestore. This workflow can also be triggered using a cloud scheduler
  • Cloud Scheduler: This allows for the workflow to be triggered at specific intervals. Uses cron job notation

Contents

  1. Prerequisits
  2. Create a cloud function
  3. Create a Google Workflow
  4. Test
  5. Accessing the data from ssGTM
  6. Check the look-up is working
  7. Conclusions

Prerequisites

Something you will need to have in place before proceeding:

  • A GCP project
  • Some BigQuery data
  • A query that collects the necessary data from BigQuery
  • Google Cloud Storage Bucket to store the query results in

You also need to enable these services:

  • Google Workflows
  • Google Cloud Functions
  • Google Pub/Sub

Create a new service account, where the account should be able to do the following:

  • Query BigQuery tables where the data you want to move lives
  • Read and Write files into relevant Cloud Storage Bucket
  • Able to create new cloud scheduler rules
  • Be able to read and write to firestore
  • Be cable to send pub/sub messages

Create a cloud function

First, we need to create a pub/sub-triggered cloud function that will write the necessary data to Firestore. There is no need to alter the code below as all of the information required will be contained within the pub/sub message that triggers the function.

Settings

The setting to apply to the function are:

  • Environment: 1st gen
  • Trigger type: Cloud Pub/Sub (you can create a topic from within the function UI if you have not already done so)
  • Memory Allocated: 128MB (for small amounts of data this will work. If you notice that the function is timing out then adjust the memory accordingly).
  • Timeout: Depends on the file size you want to process (see limits). For files that are not that large, you could start with the default 60 seconds.
  • Runtime: Node.js 16

Then attach the service account created earlier.

Package.json

 {
    "name": "gcf-cloudstorage-to-firestore",
    "version": "1.0.0",
    "dependencies": {
        "@google-cloud/storage": "^5.20.3",
        "firebase-admin": "^10.2.0",
        "split": "^1.0.1"
    }
}Code language: JSON / JSON with Comments (json)

index.js

'use strict';

const admin = require('firebase-admin');
const {Storage} = require('@google-cloud/storage');
const split = require('split');

/**
 * Triggered from a Pub/Sub message.
 *
 * @param {!Object} event Event payload.
 * @param {!Object} context Metadata for the event.
 */
exports.loadCloudStorageToFirestore = async(event, context) => {

    const pubSubMessage = event.data ? Buffer.from(event.data, 'base64').toString(): '{}';
    const config = JSON.parse(pubSubMessage);

    console.log(config)

    if (typeof config.projectId != 'undefined') {

        const projectId = config.projectId;
        const bucketName = config.bucketName;
        const bucketPath = config.bucketPath;
        const firestoreCollection = config.firestoreCollection;
        const firestoreKey = config.firestoreKey;

        console.log(`Initiated new import to Firebase: gs://${bucketName}/${bucketPath}`)

        // Init Firebase
        if (admin.apps.length === 0) {
            admin.initializeApp({ projectId: projectId })
        }

        // Init Storage
        const storage = new Storage()
        const bucket = storage.bucket(bucketName);
        const file = bucket.file(bucketPath);

        let keysWritten = 0;

        try {
        
            // TO-DO: Remove old records

            // Read file and send to Redis
            file.createReadStream()
                .on('error', error => reject(error))
                .on('response', (response) => {
                    // connection to GCS opened
                }).pipe(split())
                .on('data',  async record => {
                    if (!record || record === "") return;
                        keysWritten++;

                        const data = JSON.parse(record);
                        const key = data[firestoreKey].replace(/[/]|\./g, '');

                        try {
                            await admin.firestore().collection(firestoreCollection).doc(key).set(data)
                        } catch(e) {
                            console.log(`Error setting document:  ${e}`);
                        }
                })
                .on('end', () => {
                    console.log(`Successfully written ${keysWritten} keys to Firestore.`);
                })
                .on('error', error => reject(error));
        
        } catch(e) {
            console.log(`Error importing ${bucketPath} to Firestore: ${e}`);
        }
    }

};Code language: JavaScript (javascript)

Once finished, deploy the function and wait to see that it throws no errors.

Create a Google Workflow

Create a new Google workflow and apply the service account created earlier. You will need to copy in the below YAML code which will created a visual representation of the workflow:

- init:
    assign:
    - project_id: "<your-project-id>"
    - bq_dataset_export: "<your-bq-dataset-for-export-table>"
    - bq_table_export: "<your-bq-tablename-for-export-table>"
    - bq_query: >
            select
                user_id,
                device_first,
                channel_grouping_first
            from
                `bigquery.table`
    - gcs_bucket: "<your-export-bucket>"
    - gcs_filepath: "firestore-export/firestore-export.json"
    - pubsub_topic: "<your-pubsub-topic-name>"
    - pubsub_message: {
            "projectId": "<your-firestore-project-id>",
            "bucketName": "<your-export-bucket>",
            "bucketPath": "firestore-export/firestore-export.json",
            "firestoreCollection": "<your-firestore-collection>",
            "firestoreKey": "<your-key-to-use-as-firestore-document-id>"
        }
- bigquery-create-export-table:
                call: googleapis.bigquery.v2.jobs.insert
                args:
                    projectId: ${project_id}
                    body:
                        configuration:
                            query:
                                query: ${bq_query}
                                destinationTable:
                                    projectId: ${project_id}
                                    datasetId: ${bq_dataset_export}
                                    tableId: ${bq_table_export}
                                create_disposition: "CREATE_IF_NEEDED"
                                write_disposition: "WRITE_TRUNCATE"
                                allowLargeResults: true
                                useLegacySql: false

- bigquery-table-to-gcs:
    call: googleapis.bigquery.v2.jobs.insert
    args:
        projectId: ${project_id}
        body:
            configuration:
                extract:
                    compression: NONE
                    destinationFormat: "NEWLINE_DELIMITED_JSON"
                    destinationUris: ['${"gs://" + gcs_bucket + "/" + gcs_filepath}']
                    sourceTable:
                        projectId: ${project_id}
                        datasetId: ${bq_dataset_export}
                        tableId: ${bq_table_export}
- publish_message_to_pubsub:
    call: googleapis.pubsub.v1.projects.topics.publish
    args:
      topic: ${"projects/" + project_id + "/topics/" + pubsub_topic}
      body:
        messages:
          - data: ${base64.encode(json.encode(pubsub_message))}Code language: YAML (yaml)

You will need to add in some values here which which will be used to access BigQuery, run a query, store it in GCP and then trigger a pub/sub message containing the necessary Firestore information. It is important to note that when defining your table and data set variables, these will be where the query results are stored, DO NOT use the same dataset and table name as the table being queried or all data will be replaced with the query result.

Some descriptions on the information needed from above:

  • project_id: GCP project ID
  • bq_dataset_export: The name of the dataset you want your BigQuery query results to live in
  • bq_table_export: The name of the table you want your BigQuery query results to live in
  • bq_query: The specific query that will return the data you wish from the table (SQL)
  • gcs_bucket: Name of the GCS bucket you wish to store your query results in
  • gcs_filepath: what do you want the file to be called, ensure it ends in .json
  • pubsub_topic: Name of the pub/sub topic which triggers your cloud function
  • pubsub_message: {
    • “projectId”: The ID of your firestore project,
    • “bucketName”: Bucket where your BigQuery JSON file canbe found,
    •  “bucketPath”: Name of JSON file,
    •  “firestoreCollection”: The collection you wish to add the data to in Firestore,
    •  “firestoreKey”: What would you liek to use as a document in Firestore

Test

Now that you have the function, bucket, pub/sub topic and workflow in place, run the workflow to check that all works as it should. You may not immediately see the data appear in Firestore, it may take 5 or 10 mins for it to appear, but once it does, you should see something like this:

The collection name is on the far left, and the list of added documents in the center and the key-value pairs contained within each document on the right. If you do not see the data after 5 or 10 minutes, here are some things to check.

Troubleshooting

  • Check the cloud bucket to see if the JSON file is present within. If not there is some error at the start of your workflow that is stopping data from moving between BigQuery and the bucket, check:
    • The query is valid
    • The dataset name is correct
    • The table name is correct
    • The project ID is correct
  • If the file is within the bucket, check the cloud function:
    • Did it run? If not there is potentially an issue with the Pub/Sub topic, ensure that the topic within your YAML file is correct. Also, ensure the pub/sub topic matches between Workflow and Cloud function
    • If it did run, check the functions logs. This will give some information on what is happening. For example
      • Permission errors, make sure the correct permissions are granted to the service account. If not update them and try again
      • Error messages mentioning a variable being ‘undefined’ will probably point to a value you defined in the YAML file in workflow. Check through these and ensure they are correct.

Accessing the data from ssGTM

Now that all of the data you wish to access is within Firestore, we can create a ‘Firestore Lookup’ variable from within ssGTM in order to retrieve it for use in our tags.

To do this, head to your ssGTM instance and navigate to the variables. Select new under ‘User-Defined Variables’ and then select ‘Firestore Look-up’. What this variable allows you to do is access information within Firestore based on some look-up value.

Above you can see an example configuration of the Firestore look-up variable. Here is a description of each field:

  • Look-up type: 
    • Document type: Look up a document by specifying its components
    • Collection path & query: Look up a document within a collection where the document meets the specified query criteria. If multiple documents are returned from the query, only the first document is used
  • Query conditions: Checks the field within a (Firebase) document to see if it matches the value. If it does it will return the value placed within the Key Path. So in plain english, if you had a product ID, you could check through the documents in Firestore for a project ID that matches and return the corresponding desired value.

One other thing of note. By default, ssGTM will look in the Firestore database of the project the container is provisioned on but, you can specify another GCP project to look up the value in. This is very handy as many people will have one project for their server and a second for their data warehouse. In order to look up a value in another project, the ssGTM container project needs to have the correct permissions in the project in which the data lives. To do this simply set up a service account with the correct permissions.

Check the look-up is working

Now that all is set-up we can preview our ssGTM container to check that the expected value is appearing in our new Firestore look-up variable.

As you can see in our test the expected value is returned and can now be sent on with any tag we wish.

Conclusions

This has the potential to be a very powerful way of enriching events with new useful information before they are sent from the server. Obvious examples are things like product data and user information but I am sure more and more new and interesting use cases will reveal themselves.

Most of the information in this post is adapted from two blogs, and code is also borrowed from a Stacktonic git repository that has an accompanying blog post (see below).

References

Cloud function code and workflow code from Krisjan Oldekamp at Stacktonic.

Simo Ahava’s blog on enriching server-side data with Cloud Firestore.

Krisjan Oldekamp’s blog on exporting BigQuery data to Google Firestore and GTM server.

Share:
Written by

Matthew is the Engineering Lead at Measurelab and loves solving complex problems with code, cloud technology and data. Outside of analytics, he enjoys playing computer games, woodworking and spending time with his young family.

Subscribe to our newsletter: