Skip to main content

Google Analytics data import: automate data import with Python & Cloud functions

Victor Sarker16 November 20206 min read
Google Analytics data import: automate data import with Python & Cloud functions

Why automate data import into Google Analytics

A few months back I was granted access to a Google Cloud Source Repository (private Git repositories hosted on Google Cloud part of Google’s Cloud developer tools) that contained Python code and configuration details for automating the extraction of data from a BigQuery table and uploading to Google Analytics as a custom dimension via the Management API on a desired schedule.

Choosing the right approach for blog word counts

Being primarily a JavaScript-focussed individual I thought this was a good opportunity to expand my knowledge and work on something a bit different to my usual Google Tag Manager shenanigans. The only problem, however, was coming up with a suitable scenario where I could make use of this repository—thankfully, while discussing the potential applications of this code, a memory surfaced of a previous client request; namely, to collect the word count on the client’s blog articles so that they were available in a custom dimension.

This request went round to a few Measurelabbers who each came up with a different approach.

Limitations of GTM and manual CSV uploads

My initial suggestion was to use GTM (naturally) to scrape the page in a custom JS variable, calculate a word count and pass on the value in a custom dimension field. However, this necessitates having to run a bunch of additional JS code in a user’s browser and potentially delaying the page view tag to only fire once the visible content has loaded.

Another suggested approach was to get the client to compile each word count with each page as a CSV and use the Data Import feature in Google Analytics. However, that approach requires repetitive manual input from the client every time a new blog is added.

Using Python and Google Cloud to solve the problem

Yet another suggestion was uploading the data to Google Cloud Storage or BigQuery and handling the data that way. This Google Cloud Solution offered (at least in part) a way to get round these obstacles.

With this problem in mind, I decided a good way to utilise this repo would be to to set up a word count custom dimension for Measurelab’s own blog. Now that I had an objective, it was time to get started.

Scraping the blog content with Python

First of all, I needed to get hold of the blog word counts. Since I didn’t want to bloat the site with additional JS in GTM to scrape pages in real-time, I figured I’d acquire the pages in a different way. I realised I could create a Python script that could crawl the site at a time when there would not be much traffic and extract the information that way.

I’m not going to go into the laborious, line-by-line detail of how the site-crawling code works as I’m already spilling over into Page 2 of this document.

Extracting and cleaning the text

But once the script is running however, it identifies how many blog list pages there are, loops through those blog list pages for the blog URLs and then scrapes each one for their HTML content. Having inspected a blog page already I've identified the necessary elements that contain the text of the blog - that text gets extracted, cleaned (words made up entirely of punctuation e.g. code snippets will be thrown out) and counted. During this process I can also scrape other pieces of information if they are available (such as the blog author and the date the blog was published from any tags for example).

Formatting data for BigQuery

Once the data has been scraped, it’s formatted into a dataframe and ready to be exported into a designated BigQuery table. Since this script is designed to be run periodically, I’ve included a code block that queries the existing table and compares it with the current dataframe, if the two are identical then there hasn’t been a new blog posted and there’s no need to update BigQuery. If the data frame is different from what is currently in BigQuery however, the table is overwritten with what was obtained during the scrape.

Setting up the Cloud Function to push data to Google Analytics

Now that there’s data available in BigQuery, the BQ_to_GA repository can finally be set up. The repo comes with a README.md file which describes the necessary steps required to deploy the cloud function.

Prerequisites and configuration

The prerequisites to this are:

  1. Access to a Google Cloud Platform project (with billing enabled), and an active Google Analytics account.
  2. Enable API for Cloud Source Repository, Cloud Functions and Cloud Scheduler in the GCP project.
  3. Following step 1 of Analytics API quickstart to enable the Analytics API in the GCP project and adding the service account's email id to the Google Analytics account.

Once all the above has been set up, it’s simply a case of cloning the repo into a Cloud Shell session, installing the necessary Python dependencies and configuring the config.properties file with the necessary information (service account key, BigQuery info, and GA info etc.).

Deploying the Cloud Function and scheduling updates

There is a function included where you can test the connection to Google Analytics where the GA property ID is returned, if that is successful then the Cloud Function can be deployed. In doing this a PubSub topic is referenced (or created if it doesn’t exist) and Cloud Scheduler can be used to invoke this topic on a desired schedule.

Automating the workflow

After a successful deployment and seeing a new file appear in GA’s Data Import, I altered the cloud function to run my page-crawling, word-count-scraping code to run first. Since there are only a hundred or so blog posts on the Measurelab website this process can be handled in one go with an execution time of around 2 minutes (Cloud Functions have a maximum timeout of 9 minutes or 540 seconds). For larger scrapes or code that takes extra time to run the process would need to be separated out some more.

Results and next steps

So, in conclusion, any time a blog is posted on the Measurelab website (like this one), each night at 23:45 some python code is invoked, a table in BigQuery is updated and a new file is uploaded to GA, all in a matter of minutes - without anyone needing to push any buttons.

Any comments, queries or feedback can be sent to victor@measurelab.co.uk


Suggested content

Measurelab awarded Google Cloud Marketing Analytics Specialisation

At the start of the year, if you’d asked us whether Measurelab would be standing shoulder to shoulder with Europe’s biggest consultancies by September, we would've been surprised. Not because we don't believe in ourselves, but because these things feel so distant - until suddenly, they’re not. So, here it is: we’ve been awarded the Marketing Analytics Services Partner Specialisation in Google Cloud Partner Advantage. What’s the big deal? In Google’s own words (with the obligatory Zs): “Spec

Will Hayes11 Sept 2025

BigQuery AI.GENERATE tutorial: turn SQL queries into AI-powered insights

BigQuery just got a major upgrade, you can now plug directly into Vertex AI using the new AI.GENERATE function. Translation: your analytics data and generative AI are now best friends, and they’re hanging out right inside SQL. That opens up a whole world of new analysis options for GA4 data, but it also raises some questions: * How do you actually set it up? * What’s it good for (and when should you avoid it)? * Why would you batch the query? Let’s walk through it step by step. Step 1: H

Katie Kaczmarek3 Sept 2025

How to start forecasting in BigQuery with zero training

If you’d told me five years ago that I’d be forecasting product demand using a model trained on 100 billion time points… without writing a single line of ML code… I probably would’ve asked how many coffees you’d had that day ☕️ But its a brand new world. And it’s possible. Let me explain What is TimesFM? TimesFM is a new foundation model from Google, built specifically for time-series forecasting. Think of it like GPT for time, instead of predicting the next word in a sentence, it predicts t

Katie Kaczmarek14 Jul 2025