Behind the Cloud: Dataform, what is it and why does it matter?

In this episode of Behind the Cloud, Matt discusses Dataform, what it is, and why it matters?

Watch below 👇 or head over to our YouTube channel

Transcript

Cloud Data Warehousing

[00:00:00] Matt: Welcome to Behind the Cloud. Today we’re exploring data form, but first a little bit of scene setting. Over the past number of years, cloud computing, specifically cloud data warehousing, has advanced significantly. Huge amounts of data can be queried in seconds. The scalability of the platforms is near infinite from both a performance and a cost perspective.

[00:00:22] Matt: They’re managed services, meaning you don’t have to worry about infrastructure or software updates. Your only concern is data and providing insights for your organisation. Best of all, data warehouses are SQL based, a skill that is abundant in most data teams, and all of this lends data warehouses to be powerful transformation engines.

[00:00:42] Matt: This has facilitated a trend away from ETL, or Extract, Transform, Load, where data is transformed before being loaded into a data warehouse, to ELT, or Extract, Load, Transform, where data is transformed once it’s already in the data warehouse. Great, you might think. So, let’s get data teams running SQL queries and transforming data left, right and centre.

Best Practices in Data Engineering

[00:01:06] Matt: But therein lies a bit of a problem. Data engineering and software engineering teams have for years been developing sets of best practices that ensure that applications and pipelines perform the way they should be performing. They can work on sets of code in development environments in confidence, knowing that they will not be affecting any published front facing solutions.

[00:01:27] Matt: They can work in tandem on the same sets of code, knowing that tools like GitHub will manage versioning and merging.

[00:01:33] Matt: They can set up unit tests so that they know of any issues quickly and can fix them before they’re released. If the keys to the transformation kingdom are to be handed over to the analysts and the sequel experts, they also need to adopt these best practices.

Introduction to Dataform

[00:01:48] Matt: Failure to do so is going to quickly lead to a lack of trust in inevitably flawed and unavailable data. Enter Dataform. Back in 2020, Google acquired Dataform, and it has since integrated it directly into the UI of BigQuery. In Dataform, managing SQL queries and transformations is all streamlined into one platform. 

[00:02:13] Matt: Features like version controlling and automated scheduling are all complemented by checks that validate data models and maintain data quality. These elements make DataForm a powerful tool for ensuring reliable analytics. All of this sits in GitHub repositories that allow for different analysts to work on different branches of the same transformations, all without affecting the current version that is serving reports or models on the front side.

Dataform’s Integration with BigQuery

[00:02:37] Matt: JavaScript can be added directly into SQL transformations to layer in more programmatic approaches to data manipulation and increase efficiency. Is obviously a learning curve for the effective use of dataform as it requires SQL proficiency for query writing, JavaScript knowledge for scripting, and familiarity with GitHub for version controls.

[00:02:59] Matt: But all of these things are often simple to pick up and already available within many data teams. In summary then, the evolution of cloud data warehousing and the emergence of tools like Dataform represent a paradigm shift in data management and analytics. Dataform integration into BigQuery and its user-friendly interface, bridges the gap between traditional data warehousing, and modern data engineering best practices.

Skill Requirements for Using Dataform

[00:03:29] Matt: It empowers data teams by providing robust version control, automated scheduling, and data validation features, which are all essential for maintaining high quality data and trustworthiness. Its support for JavaScript within SQL queries offers an advanced level of flexibility and power in data transformations.

[00:03:49] Matt: The journey towards adopting all of these best practices and tools isn’t just a tech upgrade, but it’s a strategic move towards more efficient, reliable, and insightful data and analytics. Now we’ll go into a lot more detail on how to use DataForm in a future video, but I hope this has given you some insight into the why and the what.

[00:04:08] Matt: If you have not yet subscribed, please do. We have a very long list of ideas in the pipeline of what we’re going to do for videos and I hope that all of this will help you on your GCP journeys. Thanks!

Share:
Written by

Matthew is the Engineering Lead at Measurelab and loves solving complex problems with code, cloud technology and data. Outside of analytics, he enjoys playing computer games, woodworking and spending time with his young family.

Subscribe to our newsletter: