Skip to main content

A step-by-step guide to migrating scheduled queries to Dataform

Katie Kaczmarek27 November 20246 min read
A step-by-step guide to migrating scheduled queries to Dataform

Managing scheduled queries in BigQuery often feels limiting — there’s no version control, no easy collaboration, and scaling can be difficult.

If you’ve ever wondered how to make SQL workflows smoother, Dataform is your answer. In this post, I’ll show you how I migrated a BigQuery scheduled query to Dataform and how it transformed the way I manage my data pipelines.

After all, we all want to know who’s been touching our queries, don’t we?

Getting started in Dataform

First thing you need to do is set up a repository and link Dataform to GitHub. If you’re unfamiliar with setting up a Dataform repository, check out my detailed guide on how to set up a Dataform repository with GitHub & Google Cloud integration.

When it comes to moving an existing query to Dataform we want to focus on separating out any large bits of SQL into their own views. This way we can track errors easily if they occur and control the full data pipeline.

Breaking down your queries: How to structure your SQL in Dataform

Once you have your workspace in Dataform, clear out any example .sqlx files to start with a blank slate. A good structure for any project includes three directories: sources, intermediate, and output.

To create these directories:

  1. Click the three dots next to Definitions.
  2. Choose Create Directory.
  3. Create a directory for sources, intermediate, and output. You do this by adding the name to the end of the directory path.

Sources

In this directory we want to declare each of the tables that we will reference in our SQL. Each time you create a new directory or file, click the three dots next to the relevant section (e.g., Definitions, Sources) and choose create file. Give the source a name, add this to the end of the file path followed by “.sqlx”. Add code to your new file that looks something like this:

config {
type: "declaration",
database: "[name of project]",
schema: "[name of dataset]",
name: "[name of table]",
description: "[a brief description of what this table is]"
}

You will need to repeat the above for each data source you need for your SQL code.

Intermediate

In this directory we want to create views for any WITH statements or subqueries that exist in our code. Once again, create a file, give the intermediate view a name, and add this to the end of the file path followed by “.sqlx”. The config for this section of code will look something like this:

config {
type: "view",
schema: "[name of dataset you want the view stored in",
tags: ["[any words that might be associated with the table]"],
description: "[brief description of what this code is doing]"
}

The schema refers to the dataset that will be created in BigQuery when the code runs. We create views because they don’t incur storage costs in BigQuery. This means you can manage your data efficiently without worrying about additional charges.

After the config you can add you SQL query as it appears in BigQuery and it will be executed in a similar way. The only difference you want to make to your query is to reference the declaration we created in the source directory within the FROM statement. It should look something like this:

#referencing a source
FROM ${ref('[name given to the source config]')}

#referenceing a intermediate view
FROM ${ctx.ref('[name given to the intermediate view]')}

You can check what the code will look like in BigQuery by clicking on the compiled queries tab. This will show you what the SQL will look like once it is executed. It will also highlight any errors and what the run GB will be.

To execute your query and create the view in BigQuery, make sure you are in the intermediate code you want to execute, click on start execution in the top bar, choose actions and choose the code you just created. Click start execution and the view will be created.

You want to repeat the above Intermediate steps for any with statements or subqueries within your code. The ${ref you use in the FROM statement can refer to sources or other intermediate files, this is how we build up the code in BigQuery.

Output

Once we are satisfied that all sources and intermediate tables have been created, we can move on to creating our output files. These will be the final SQL that creates tables in BigQuery. So essentially the final bit of code in our scheduled query. Create a file and give the output table a name, add this to the end of the file path followed by “.sqlx”.

There are a few options you can use for config type here. The main ones are table and incremental. Table is for complete refreshes (just like a CREATE OR REPLACE TABLE), while incremental is for adding new data without touching existing records (like an INSERT INTO or a MERGE (include unique key)).

config {
type: "table",
schema: "[name of dataset]",
tags: ["[any words that might be associated with the table]"],
bigquery: {
partitionBy: "[date field you want to partition by]",
clusterBy: ["[fields you want to cluster by]"]
},
description: "[a brief description of what this table is]"
}

Use the above config to create a new table each time the code is run. Add you final SQL code after the config. To execute your query and create the table in BigQuery, make sure you are in the output code you want to execute, click on start execution in the top bar, choose actions and choose the code you just created. Click start execution and the table will be created.

30-minute consultation


Book your free Dataform consultation

Infrastructure audit Transformation plan Efficiency analysis

Schedule your Dataform pipeline

Once everything has been created in Dataform you will need to schedule the pipeline. Google’s guide on scheduling Dataform executions can help you automate your workflows.

Hopefully this will help to get you started on Dataform, and when you are ready you can move on to move complexities with things like assertions, pre and post operations, dependencies and I’m sure a load more that I’m not aware of yet! All of these refer to additional things you can do within the config. But if you are writing SQL already you are half way there.


Suggested content

Measurelab awarded Google Cloud Marketing Analytics Specialisation

At the start of the year, if you’d asked us whether Measurelab would be standing shoulder to shoulder with Europe’s biggest consultancies by September, we would've been surprised. Not because we don't believe in ourselves, but because these things feel so distant - until suddenly, they’re not. So, here it is: we’ve been awarded the Marketing Analytics Services Partner Specialisation in Google Cloud Partner Advantage. What’s the big deal? In Google’s own words (with the obligatory Zs): “Spec

Will Hayes11 Sept 2025

BigQuery AI.GENERATE tutorial: turn SQL queries into AI-powered insights

BigQuery just got a major upgrade, you can now plug directly into Vertex AI using the new AI.GENERATE function. Translation: your analytics data and generative AI are now best friends, and they’re hanging out right inside SQL. That opens up a whole world of new analysis options for GA4 data, but it also raises some questions: * How do you actually set it up? * What’s it good for (and when should you avoid it)? * Why would you batch the query? Let’s walk through it step by step. Step 1: H

Katie Kaczmarek3 Sept 2025

How to start forecasting in BigQuery with zero training

If you’d told me five years ago that I’d be forecasting product demand using a model trained on 100 billion time points… without writing a single line of ML code… I probably would’ve asked how many coffees you’d had that day ☕️ But its a brand new world. And it’s possible. Let me explain What is TimesFM? TimesFM is a new foundation model from Google, built specifically for time-series forecasting. Think of it like GPT for time, instead of predicting the next word in a sentence, it predicts t

Katie Kaczmarek14 Jul 2025