Skip to main content

Dataform for BigQuery: A basic end-to-end guide

Prasanna Venkatesan18 March 20256 min read
Dataform for BigQuery: A basic end-to-end guide

Dataform is a powerful tool for managing your data workflows in a structured, version-controlled, and automated way. Whether you're a beginner or an experienced data engineer, Dataform simplifies SQL-based transformations while integrating seamlessly with Google BigQuery.

Although this blog offers a basic introduction to Dataform's functionality, users can achieve significantly more with Dataform. From advanced scheduling, parameterised queries, and dependency management to complex data modelling, the possibilities are vast.

We will be releasing more blogs diving deeper into various aspects of Dataform, covering best practices, optimisation techniques, and real-world use cases. This blog will also serve as a live blog, where we will continue adding new features, updates, and linking to detailed Dataform-related articles as they become available.

Make sure to bookmark this blog and stay tuned for upcoming posts that will help you leverage Dataform to its full potential!

Now, we'll take you through an end-to-end setup of Dataform, starting from creating a workspace to building target tables, while exploring all its key features along the way.

Setting up your Dataform workspace

Before we dive into defining sources and building tables, let’s set up our Dataform workspace.

1: Create a Google Cloud project

  1. Go to Google Cloud Console.
  2. Create a new project or select an existing one.
  3. Enable the BigQuery API.

2: Enable Dataform in Google Cloud

  1. Navigate to Google Cloud Dataform.
  2. Click Create Repository and give it a name.
  3. Choose a location (typically the same as your BigQuery dataset).
  4. Initialise the repository with a new workspace.

3: Configure Dataform

After setting up your Dataform workspace, it's important to configure the workflow_settings.yaml file. This file defines the default project, location, datasets, and Dataform core version for your workspace.

Example workflow_settings.yaml configuration:

defaultProject: dataform-64568

defaultLocation: EU

defaultDataset: dataform

defaultAssertionDataset: dataform_assertions

dataformCoreVersion: 3.0.0

This configuration lets you execute all transformations and assertions within the specified project and datasets, streamlining your workflow management.

30-minute consultation


Book your free Dataform consultation

Infrastructure audit Transformation plan Efficiency analysis

Defining your data sources

Now that your workspace is set up, let’s define the source dataset and tables. We’ll use a free BigQuery dataset for demonstration.

1: Choose a public dataset

Google provides a variety of public datasets. Let’s use bigquery-public-data.london_bicycles.cycle_hire.

2: Define the source in Dataform

  • Delete all other folders and files except the definitions/ folder and workflow_settings.yaml file.
  • Create a new directory called source/ under definitions/.
  • Define the source tables under the definitions/source/ directory.

1. Bike Hire Source Data - cycle_hire.sqlx

config {

  type: "declaration",

  database: "bigquery-public-data",

  schema: "london_bicycles",

  name: "cycle_hire",

  description: "Cycle Hire Data"

}

2. Bike stations source data - cycle_stations.sqlx

config {

  type: "declaration",

  database: "bigquery-public-data",

  schema: "london_bicycles",

  name: "cycle_stations",

  description: "Cycle Stations Data"

}

Creating transformations

Let’s create transformations that will process and refine our London bicycle hire dataset. We will build intermediate tables, apply multiple transformations, and finalise the results into target tables.

Step 1: Create intermediate tables

Intermediate tables help us clean and enrich the raw data before final aggregation. These should be placed under the definitions/intermediate/ directory.

1. Normalise Bike Ride Durations

config {

  type: "table",

  schema: "intermediate",

  name: "normalized_cycle_rides"

}

-- In the above config, if schema is not mentioned the table will be created under default dataset mentioned in the workflow_settings.yml.

-- If the name is not mentioned then the table will be created as same as the file name. It is good practice to keep the tables names same as your file names.

SELECT

  rental_id,

  bike_id,

  start_station_id,

  end_station_id,

  TIMESTAMP_DIFF(end_date, start_date, SECOND) AS ride_duration_seconds,

  start_date,

  end_date

FROM ${ref("cycle_hire")}

Step 2. Aggregate Daily Bike Usage

config {

  type: "table",

  schema: "intermediate",

  name: "daily_cycle_usage"

}

SELECT

  DATE(start_time) AS ride_date,

  bike_id,

  COUNT(*) AS total_rides

FROM ${ref("normalized_cycle_rides")}

GROUP BY 1, 2

Step 3: Define final output tables

Final tables consolidate the transformed data for reporting and analysis. These should be placed under the definitions/output/ directory.

1. Summary Report

config {

  type: "table",

  schema: "output",

  name: "summary_report"

}

SELECT

  d.ride_date,

  d.bike_id,

  d.total_rides,

  s.station_name AS start_station,

  e.station_name AS end_station

FROM ${ref("daily_cycle_usage")} d

LEFT JOIN ${ref("cycle_stations")} s ON d.start_station_id = s.station_id

LEFT JOIN ${ref("cycle_stations")} e ON d.end_station_id = e.station_id

2. Peak Ride Hours

config {

  type: "view",

  schema: "output",

  name: "peak_ride_hours"

}

SELECT

  EXTRACT(HOUR FROM start_date) AS hour_of_day,

  COUNT(*) AS ride_count

FROM ${ref("normalized_cycle_rides")}

GROUP BY 1

ORDER BY ride_count DESC

Instead of storing the data physically, we can create views for flexible and cost-efficient querying.


Incremental tables are another type of table that can be defined in Dataform. To learn more about these tables and their uses, please refer to this blog.


The "Not found: Dataset <project_id>: intermediate was not found in location" error message appears in the Dataform UI only when the underlying tables have not yet been created. The error will disappear once the pipeline has been executed and the tables have been created.

4. Partitioning and clustering in Dataform

Both of these help optimise query performance and reduce costs in BigQuery by improving how data is stored and accessed.

Partitioning

Partitioning divides a table into segments based on a specific column (e.g., ride_date). Queries that filter by this column will scan only the relevant partitions, improving efficiency.

Clustering

Clustering organises data within partitions based on certain columns (e.g., bike_id and start_station). This speeds up queries that filter or group by these columns.

By implementing partitioning and clustering in our summary_report table, we can ensure efficient storage and faster query execution. This is particularly beneficial for queries that filter on a specific date, as it reduces the amount of data scanned in BigQuery, leading to cost savings. Let's implement partitioning and clustering in the summary_report table as shown below.

config {

  type: "table",

  schema: "output",

  name: "summary_report",

  bigquery: {

    partitionBy: "ride_date",

    clusterBy: ["bike_id", "start_station"]

  }

}

SELECT

  d.ride_date,

  d.bike_id,

  d.total_rides,

  s.station_id AS start_station,

  e.station_id AS end_station

FROM ${ref("daily_cycle_usage")} d

LEFT JOIN ${ref("station_usage")} s ON d.bike_id = s.station_id

LEFT JOIN ${ref("station_usage")} e ON d.bike_id = e.station_id

5. Compile and run transformations

Now that we have defined our transformations and output tables, we need to execute them in Dataform to create the required tables and views in BigQuery.

Executing Multiple Actions

  • Click Start Execution to validate your SQL code and check for errors.
  • Navigate to the Actions section in Dataform.
  • Click on Multiple Actions… to selectively run the required transformations.
  • Choose the following actions to be executed:
    • bike_usage_summary
    • popular_routes
  • Select option Include Dependencies
  • Click Run Selected Actions to execute and create these tables in your BigQuery dataset.

By selecting multiple actions, we ensure that both tables are created efficiently while adhering to dependencies between intermediate and output tables.

6. Customising Dataform structure

This blog uses a sample structure to simplify how Dataform works. You can adapt Dataform's structure to your project's specific needs after grasping the fundamentals. Depending on the complexity of data workflows, teams may choose to organise their Dataform repository differently, optimise queries for efficiency, or integrate additional features like parameterised datasets and advanced scheduling mechanisms. You can tailor Dataform's flexibility to meet various data engineering requirements.


Suggested content

Measurelab awarded Google Cloud Marketing Analytics Specialisation

At the start of the year, if you’d asked us whether Measurelab would be standing shoulder to shoulder with Europe’s biggest consultancies by September, we would've been surprised. Not because we don't believe in ourselves, but because these things feel so distant - until suddenly, they’re not. So, here it is: we’ve been awarded the Marketing Analytics Services Partner Specialisation in Google Cloud Partner Advantage. What’s the big deal? In Google’s own words (with the obligatory Zs): “Spec

Will Hayes11 Sept 2025

BigQuery AI.GENERATE tutorial: turn SQL queries into AI-powered insights

BigQuery just got a major upgrade, you can now plug directly into Vertex AI using the new AI.GENERATE function. Translation: your analytics data and generative AI are now best friends, and they’re hanging out right inside SQL. That opens up a whole world of new analysis options for GA4 data, but it also raises some questions: * How do you actually set it up? * What’s it good for (and when should you avoid it)? * Why would you batch the query? Let’s walk through it step by step. Step 1: H

Katie Kaczmarek3 Sept 2025

How to start forecasting in BigQuery with zero training

If you’d told me five years ago that I’d be forecasting product demand using a model trained on 100 billion time points… without writing a single line of ML code… I probably would’ve asked how many coffees you’d had that day ☕️ But its a brand new world. And it’s possible. Let me explain What is TimesFM? TimesFM is a new foundation model from Google, built specifically for time-series forecasting. Think of it like GPT for time, instead of predicting the next word in a sentence, it predicts t

Katie Kaczmarek14 Jul 2025