Behind the Cloud – Data Preparations in BigQuery

In this episode of Behind the Cloud, Matt explores BigQuery’s new Data Preparation feature—an AI-powered, point-and-click tool for transforming data with ease. From natural language transformations using Gemini to automated workflows and scheduled outputs, learn how Google is simplifying data prep for analytics and machine learning.

Watch below 👇 on Youtube, or catch more from the series here.

Transcript

[00:00:00] Hello and welcome to another episode of Behind the Cloud. Today we are gonna look at a feature or a service, or whatever you wanna call it, within BigQuery called data preparations. And data preparations is a way of, is a point of click, way of transforming data. So some of you may remember a service called Data Prep that used to exist with a Google Cloud that was by a company called Trifecta.

[00:00:30] I believe Google bought. From what I can see, this looks very similar. It displays your data, it shows you all the various metrics and allows you to create a series of transformations. You can schedule it to get it to do these transformations on a regular basis. They’ve slathered in a nice, a nice helping of Gemini and AI to have a natural language way of creating these automations and joins and filtering.

[00:00:54] So I thought it’d be a great opportunity for us to have a look around another interesting data transformation AI feature that Google is adding into BigQuery. So let’s go to create a new data preparation. What we need to do, we is go into BigQuery, we’re gonna go to the little dropdown here. We click data preparation.

[00:01:14] So we can do, which is quite nice, is you can search for tables within your. Cloud environment. I’m gonna search for a public data set. Here we go. So let’s just add this as a source. So as you’ve added it as a source, you’ll, you’ll see this, this specific screen that’s gonna collapse this on the left so we can get a better view of things.

[00:01:36] And you’re right outside. You’ve got your Gemini assistant that’s gonna help you and, and suggest potential transformations you can apply to the data. You’ll also see any steps that have already been applied. So here we’ve just got a read. So just reading from the public data set. Currently we’ve got a load of suggestions.

[00:01:55] As I am saying, you can apply any of these suggestions or copy the, the code or rate it. And then looking at the data itself, we can see all of the different columns. We can get an idea of duplication, so if there’s any rows that have got the same value within them, we can look at that kind of skew. We can see that, for example, all there’s 50 rows displayed on this page that have the same value of 2025.

[00:02:20] ’cause this 2025 data, you will be very familiar with this kind of, this kind of diagram if you’ve been working in data science or data, data preparation for machine learning modeling. You can see all of the various data types. You can select specific rows. If you need to, you can select a specific column, get suggestions based on that column specifically for from Gemini.

[00:02:46] You can also go into the schema, see what’s going on in the schema for, so for example, say we wanted to drop, I. A particular column. Let’s drop temp. It’s probably the worst one we could drop, but we’ll drop it. If I click here and click drop, obviously I’m not dropping that from the source table. What I’m doing is I’m dropping that from my output table, the, the, the table that will exist After that, I’ve completed my preparation steps, so you can see if you go into our applied steps now.

[00:03:19] We are reading it and then we are dropping that column, and that’s part and parcel of our transformation steps towards an output. You also see an error at the top saying. You have a problem here. This, this isn’t going anywhere. And what you can do is you can specify an output. So at the bottom here, if I click here and click destination, I can put this into a, into a data set.

[00:03:44] So let’s create a new data set called the example on pull, prep. Leave it in there. Get add the tags. We can create all the various specific settings that we would expect when we’re creating the data set. Normally, let’s create that data set, sample prep, and then I can chuck it in a table, which I can create as well.

[00:04:09] Prep data, save, and then what we should see, as you now have read this data, drop this column, write this data into this new table. So obviously you can see how this would be useful for running through the, exactly the same preparation steps of a source when you’re getting it ready for machine learning models, et cetera.

[00:04:32] So a couple of other things you can do inside of here, you can filter things out. So if we go back to data, let’s just go back to the suggestions actually in the first instance. So. Let’s, uh, I’m not saying any of these transformations are correct or incorrect, I’m just doing them as an example here. So let’s just apply this, this transformation from cha changing a year to an N 64.

[00:04:57] So let’s apply that just so we can see the example of that appear within the applied steps. Well, you know it, once you supplied it, it will, you’ll see that the, the number of, well, sorry. If you’re filtering, you’ll see the number of columns drop. What we should see with our year is it is now an integer converts column year from type strings 10 to 64.

[00:05:20] Now our column is an integer 64. You could also filter, let’s say we wanna filter to a specific month within the filter. Within most of these things in here, you can’t just ask it in natural language. So if we say. Filter mo to only own one. It’s not always perfect. Doesn’t always like it. I think they tried to be really strict with what it outputs.

[00:05:52] So if it, if it doesn’t think it can do it, it just says I can’t do it. It is probably better than it just taking a while to stab in the dark and and causing issues. So if I apply that, that will apply the filter to the data. You see, we’ve got much less rows now. All of our rows, all of our months are are oh one.

[00:06:11] And then we can see within our applied steps another applied step. So as you can imagine, you can, you can build out all these different steps. You can join it with other data, you can add in other transformations, you change columns to calculated values, et cetera, et cetera. All the kind of steps that you would potentially do with data transformation.

[00:06:31] What you can then do also is, is schedule this so you can schedule this transformation to run on a regular basis. So once you’ve got everything right and the output is exactly as you want it, you can schedule this to run on a daily basis and then potentially you’ve got a load of other, a lot of all the actions that that come off the back of that cleaned data appearing within the necessary data set table.

[00:06:55] I have a play with that and, and see what you think. Thanks for watching. Hope that was useful. We’ve just wrapped up. Going through all the new BigQuery announcements and Google Cloud announcements from next 25. So we’ll certainly be delving into some of those individual features that were announced, the agents, the new tooling, the new branding for BigQuery in future videos.

[00:07:18] So keep an eye out for that. If you’ve got any other ideas, the videos and things you’d like me to cover, please give us a shout. Please like and subscribe because it’ll help us grow the channel and keep us building out more of these videos. And yeah, catch you on the next one. Thank you.

Share:
Written by

Subscribe to our newsletter: