Skip to main content

Data pipeline optimisation with Google Cloud and Dataform

Prasanna Venkatesan22 April 20253 min read
Data pipeline optimisation with Google Cloud and Dataform

In our recent engagement with a client, we went on a journey to transform their data pipelines, tackling inefficiencies in performance and cost within their Google Cloud BigQuery environment. Our efforts culminated in a comprehensive optimisation strategy that used Dataform, improved SQL practices, and implemented tailored solutions for significant performance gains and cost savings. Here’s a deep dive into the highlights of our project.

Identifying inefficiencies in BigQuery workflows

We began by analysing the existing data architecture, identifying key areas of inefficiency:

  1. Long view chains: Multiple layers of views increased complexity and degraded performance.
  2. Lack of incremental logic: Transformations processed entire datasets unnecessarily, increasing costs.
  3. Unoptimised tables: Absence of partitioning and clustering led to full table scans.

Transitioning to Dataform

To address these challenges, we transitioned from BigQuery Scheduled Queries to Dataform, unlocking the following benefits:

  • Improved maintainability: Modular SQL scripts made workflows easier to debug and maintain.
  • Version control: Seamless integration with Git allowed better collaboration and change tracking.
  • Automated dependency management: Dependencies were automatically managed, minimising manual intervention and errors.
  • Incremental processing capabilities: Focused data processing only on updated data, enhancing efficiency.
  • Assertions for data quality: Automated checks ensured consistency and reliability throughout the pipeline.
30-minute consultation


Book your free Dataform consultation

Infrastructure audit Transformation plan Efficiency analysis

Key optimisation strategies

  • Query structure improvements: Refactored subqueries into Common Table Expressions (CTEs), improving readability and debugging efficiency.
  • Consolidation of intermediate tables: Combined redundant views into fewer tables, reducing complexity and redundant operations.
  • Implementation of incremental tables: Processed only new or updated data, significantly reducing data scanned and associated costs.
  • Partitioning of tables: Partitioned tables based on relevant date fields, improving query performance.
  • Eliminating full table scans: Restructured incremental tables to prevent full table scans, ensuring more targeted data processing.

Cost savings analysis

Our optimisation efforts translated into substantial cost savings:

  1. Data processed for key tables:
    • BigQuery costs were reduced by over 40% annually due to a 92% reduction in data processed for some large output tables.
  2. Dashboard optimisation:
    • Dashboard redesigns also contributed to query optimisation. By separating lightweight and heavy charts onto different pages and consolidating scorecards, redundant queries and unnecessary filters were eliminated. This resulted in a substantial reduction -  hundreds of GBs per dashboard load -  from the final output tables.
  3. Operational efficiencies:
    • Reduced manual dependency management, debugging, and new model setup, saving approximately 32 engineer hours monthly, saving close to 20% in cost per month.

Conclusion

This project demonstrates how targeted optimisations can transform data pipelines, improving performance while dramatically reducing costs. Leveraging tools like Dataform and best practices in SQL and BigQuery, we delivered a smarter, more efficient solution tailored to the client’s needs.

Resources


Suggested content

Measurelab awarded Google Cloud Marketing Analytics Specialisation

At the start of the year, if you’d asked us whether Measurelab would be standing shoulder to shoulder with Europe’s biggest consultancies by September, we would've been surprised. Not because we don't believe in ourselves, but because these things feel so distant - until suddenly, they’re not. So, here it is: we’ve been awarded the Marketing Analytics Services Partner Specialisation in Google Cloud Partner Advantage. What’s the big deal? In Google’s own words (with the obligatory Zs): “Spec

Will Hayes11 Sept 2025

BigQuery AI.GENERATE tutorial: turn SQL queries into AI-powered insights

BigQuery just got a major upgrade, you can now plug directly into Vertex AI using the new AI.GENERATE function. Translation: your analytics data and generative AI are now best friends, and they’re hanging out right inside SQL. That opens up a whole world of new analysis options for GA4 data, but it also raises some questions: * How do you actually set it up? * What’s it good for (and when should you avoid it)? * Why would you batch the query? Let’s walk through it step by step. Step 1: H

Katie Kaczmarek3 Sept 2025

How to start forecasting in BigQuery with zero training

If you’d told me five years ago that I’d be forecasting product demand using a model trained on 100 billion time points… without writing a single line of ML code… I probably would’ve asked how many coffees you’d had that day ☕️ But its a brand new world. And it’s possible. Let me explain What is TimesFM? TimesFM is a new foundation model from Google, built specifically for time-series forecasting. Think of it like GPT for time, instead of predicting the next word in a sentence, it predicts t

Katie Kaczmarek14 Jul 2025