#125 Proactive pipelines – the power of shifting left (with Chad Sanderson at Gable)

In this episode of The Measure Pod, Dara and Matt are joined by Chad Sanderson from Gable.ai, a seasoned expert in the data space with a diverse background ranging from journalism to data science. Chad shares insights from his extensive career, which includes roles at large corporations like Subway, Oracle, and Microsoft, as well as at smaller startups. They discuss the concept of shifting left and how this approach aims to improve data quality, reduce costs, and accelerate data delivery by addressing issues earlier in the data lifecycle.
Show notes
- OpenAI ChatGPT agent
- Google analytics MCP server
- Import Reddit ads cost data into Google Analytics
- More from The Measure Pod
- Why Apple is playing it slow with AI
- Chad Sanderson and Gable.ai
Share your thoughts and ideas on our Feedback Form.
Follow Measurelab on LinkedIn
Transcript
You need to figure out what are like the 5 or 10 most valuable use cases of data at your business.
Chad
you can’t just unleash AI onto a code base and ask it to start doing things because it will make changes that have horrific impacts somewhere else.
Chad
[00:00:00] Dara: Hello, and welcome back to the Measure Pod. I’m Dara, I’m joined as always by my co-host, Matthew. Hello, Matthew.
[00:00:10] Matt: Hello, Dara.
[00:00:15] Dara: I was trying to think of a way of, actually, I want to kind of keep it interesting and kind of introduce you in different ways, but I just, I can’t, I’m just, I’m just bad at improvising.
[00:00:32] Matt: You don’t need to, AI can do that for you before every episode. Just type in a new introduction for me, and that’s how you introduce me each week.
[00:00:41] Dara: Quite a while back. a, a previous co-host on this podcast actually tried to replace me. He, who shall not be named, he, he who shall not be named, this is a test to see if he’s listening.
[00:00:56] Dara: Let’s see. But yeah, he tried to replace me as if you could with ai. No chance. No. No chance.
[00:01:02] Matt: No. I think the tone and the depth of knowledge, perhaps you could, but I’ve not nailed the Irish accent yet.
[00:01:11] Dara: I actually feel quite good that you used the word depth in front of knowledge when talking about me.
[00:01:16] Matt: Yeah. I was, I was insinuating that he lacked it. But if you took, if you took me as a compliment, I, no, listen, I
[00:01:21] Dara: listen, I’ve gotta take the compliments before I can get them. No, you’re, you’re irreplaceable. Of course. Oh, obviously, obviously. So, so listen, no pressure. but we’re starting to garner a little bit of attention and admiration.
[00:01:37] Dara: I would go as far as to say for our unique, I’m going to call it unique, style of journalism. Odd news, you mean? Yeah. Specifically the news section. Salty News. Yeah. Yeah, yeah. Salty news. Yeah. I’ve had, I’ve received compliments, right. Yeah. How many? They’re not, they’re not from my mom because she doesn’t believe in compliments and she hates the podcast.
[00:02:01] Dara: She told me it’s rubbish. But yeah, from, from other people they’ve said, oh yeah, really? You know, really good. Interesting news. I mean, I don’t believe a word of it they’ve said, but it’s, it’s interesting.
[00:02:12] Matt: Is it Claude and Brian and it might have been various other ais. No, they can be quite cruel as well.
[00:02:20] Matt: Actually. I find them very sycophantic a lot of the time. Depending on which one I’m talking to.
[00:02:26] Dara: I must not be using them. Right.
[00:02:30] Matt: Shout at me. Call me a naughty boy.
[00:02:34] Dara: I think. Tell me, I have depth of knowledge. So do we have any news? Okay. Is there any news? Has anything actually happened? I mean, nothing has happened in ai, that’s for sure.
[00:02:45] Matt: Well, yeah, things have happened in ai. I think maybe, you know, if people are liking insulting news, then we just stick with the fact that most of our news is about ai. Not, not to worry about the fact that it’s less, less of it is about marketing analytics, but, it’s, it’s not really our fault. Is it intertwined?
[00:03:03] Dara: We, we, we just, we just report on the news that is available. It’s not, you know, if, if, if, if we’re not doing enough marketing analytics news, then the marketing analytics world just needs to create more.
[00:03:15] Matt: Yeah. Get some sub controversies. Isn’t the, I don’t know, the head of insert marketing a company here going to have an affair and hug them in front of a Coldplay concert or something just to drum up something that we can, we can go onto.
[00:03:31] Matt: Yeah.
[00:03:32] Dara: Yeah. Ah, damn there. That’s an example. There goes my warm news article for the week. You’ve stolen my thunder.
[00:03:39] Matt: Sorry. Yes, there is news. I have some news.
[00:03:41] Dara: Do you want off, off with the news? I do too. We, I, I, I, I also, you know. Do work and produce news. It’s not, it’s not all you doing all this research. I mean, it might be mostly you, but, you know, I occasionally find the odd, silly news article.
[00:03:58] Matt: You’ve got a, somebody’s complimented you and now, now this is your most important thing. It’s just a pencil behind you and a hat that says Press. Yeah, OpenAI. I’ve heard of it. Chat. Yes. Chat. GPT Agent, this is, so this is like, they’ve released this new agent within the chat GPT interface, which can go and do tasks for you on its own computer.
[00:04:22] Matt: So it can go and check an email and then look at your calendar and pull out, create meeting notes for you or create spreadsheets. And so you can just kind of go and use a browser and a, and a computer to perform various tasks for you in an agent way. I can’t see it yet in my chat GPT instance, even though it’s supposed to be out.
[00:04:44] Matt: I dunno if it’s a UK thing that sometimes happens that we lag behind a bit. But yeah, it’s kind of an evolution of things like class to computer use and touch EPT had something similar as well. I can’t remember the name of it, but it’s, it’s kind of combining deep research, the, the, the more advanced models and using a computer to be agentic.
[00:05:12] Dara: I have so many, so many thoughts, so many questions. Mainly I’m just thinking about a little character with a tiny computer going around doing things for you. ’cause when I need a Red Hat computer, all I can think of is a physical computer, but that’s probably not what you’re talking about really.
[00:05:30] Matt: Is it like a CRT monitor as well and like a gray computer like from the nineties? I mean, is there any other kind of computer?
[00:05:37] Dara: Not in my mind. Floppy discs drive the worst. Yeah, I, I, I don’t know how I feel about it. I, with a lot of this stuff, I feel kind of slightly a mixture of excited and also terrified that it’ll, it’ll go off piece and, Hmm, I dunno, buy loads of tungsten cubes that I didn’t want, and I’ll be, bro, how do you, that’s, that’s always the risk.
[00:05:58] Dara: Ah, I mean, it’s so tempting, isn’t it? It’s like, here’s this thing that can go and do stuff for you. You’re like, sign me up. Mm. Here’s my credit card information. Here’s all my permissions. Just, just knock yourself out. Go, go nuts. Do everything for me.
[00:06:15] Matt: I, I find it hard with all these things as well that it’s like with, when we were talking about all the stuff from the next ages ago now, how long ago were we talking about when we were next seven years ago?
[00:06:26] Matt: It feels like 90% of the stuff is here and we can’t actually touch like agent space. I can’t touch that and see, I don’t want to talk about, have any talk anymore? No, that’s to me. Yeah.
[00:06:35] Dara: I don’t want to talk about it. Don’t care.
[00:06:37] Matt: Don’t want to, didn’t want to access it anyway. No, this should be losers. This, I, I don’t know if it’s useful or not at the moment.
[00:06:45] Matt: I’m sure somebody in the us, one of our tens of us, fans will tell us if they’re playing with it, but at the minute it exists and maybe we’ll get it and maybe in two weeks time I’ll be saying, wow, this is mind blowing. SA toten. Yeah. that I’m selling for profit and my life has changed, but life is good.
[00:07:06] Dara: I don’t know. Yeah. Yeah. If you are a listing and you have access to it, can you let us know? But can you prove in your feedback that it’s actually you telling us and not your agent? No, you can’t. We’ll never know. You can, we’ll never know. Anyway, Yeah, it’d be nice. It’d be nice to, we need to remember these things that we’ve talked about and complained about not having access to.
[00:07:27] Dara: So when we actually do have them, we can do a follow up and say, now I have it. It’s rubbish. AI’s not stealing anyone’s job. Let’s move on.
[00:07:35] Matt: Yeah, I’ll write that down to remember to talk about it next time. Next in AI related news, Google, there’s, there’s, there’s been a Google Analytics MCP server, released.
[00:07:51] Matt: It’s not, it’s like, it’s, it’s an open source thing. And I think Google and managing the, managing the contributions to that, the GI repository, the GI project. So it’s kind of managed by Google, but open source. So MCP server, for those who don’t know, I could see Dara Google in it now, is model, context, protocol.
[00:08:11] Matt: So it, it’s, I could have shown you that. Yeah. It’s an easy way. It’s a standardized way of LLMs interacting with tool sets. So you create these little servers that have a set of common functions that it can go and, and run. So one’s been made for Google Analytics, so you can kind of chat to your chat interface and it will go and tell you what’s inside your Google Analytics platform or how many users you’ve got, or I assume be able to alter settings and kind of as a conversational layer over the top of the admin and, and data settings of your Google Analytics platform.
[00:08:49] Matt: So it’s pretty interesting. I was playing with States, I think we mentioned that last week on, one of that, or two weeks ago on the one that Dennis was on. I was playing MCP server and it, it’s really interesting just to get an idea of what’s contained there and say, oh, you know, what’s going on in this Google Analytics platform?
[00:09:08] Matt: What’s going on in this Google Tag Manager platform? And it can just kind of give you the lowdown and give you a good jumpstart in terms of context. So I’ve not played with the Google Analytics one yet, but it looks interesting.
[00:09:18] Dara: So hang on. Are you saying that Staples one lets you do what the Google Analytics one will let you do or did I misunderstand that?
[00:09:27] Matt: The Stape one is for Google Tag Manager. Okay. And Google Analytics one is for GA four. Yes. Google Analytics.
[00:09:33] Dara: Yeah. I was going to ask, is it, so is to make sure I’m understanding, it gives you a way within certain parameters to actually make changes without doing it through the,
[00:09:42] Matt: I think essentially anything you could, you could access and control with the APIs, the various APIs that connect to these things.
[00:09:48] Matt: So there’s, there’s a couple of different APIs that interact with Google Analytics. There’s like the admin API and the data API, and they kind of do what they say on the tin admin, API, you can mess with settings data, API, you can look at what’s in there. So yeah, a lot of them, and it depends on what the developers put in there.
[00:10:06] Matt: Like maybe they’ve just created a MCP server that just lets you read and interrogate things. Or they may have also included the functions to delete and create things. So with the Staples one, you can, my one example I was playing with as I said, go create a new workspace in this container and create all these tags in it.
[00:10:30] Matt: And it did go away and do all of that. So it was creating and, and building things. So be careful because you could be, you could be in fact affecting things you don’t know about if you’re not careful with the language that you’re saying to these things.
[00:10:46] Dara: And look, with a lot of these AI releases, it’s basically just like highlighting to me all the things that I used to do that just took ages that just now don’t need to.
[00:10:57] Dara: I, I dunno whether I’d feel bitter about that or actually kind of happy that I did those things at a time, whether it wasn’t ai. ’cause if there was, I probably would’ve had to figure out something new. Some new skills. but yeah, all these things that used to just take you, I mean, I remember going through ga I’m going to bore our listeners now, little trip.
[00:11:14] Dara: This is like an old man Dara shouting at the clouds. But I used to, you know, that that was a big part of, you know, going through especially really big accounts when you’d have to go through and make loads of admin changes and you’d be doing it. And then there was obviously the management API, but you know, a lot of this stuff used to be done manually and now oh, they don’t know how lucky they have it.
[00:11:31] Matt: The kids are today’s kids. Yeah. I think that’s, I think I find it really useful. I think, like I said, a brand new account I have no idea about. I’ve never touched, never looked at before, and I just want a quick way of chatting to someone that knows everything about that account. and yeah, these, and, and LL M’s.
[00:11:51] Matt: Perfect for that. Like how, how’s consent working here? Oh, they’re using, they’re using open ai, I’m sorry, they’re using OneTrust and got consent mode V two and this is set up and that set up. And I just quickly, I’ve got that knowledge about having to sort of unpick it myself. Like the, like I did in the old days.
[00:12:09] Dara: Yeah, the good old days. Yeah. Okay. Google Analytics, NCP P Server. Watch this, watch this space. Yes.
[00:12:15] Matt: Watch this space. and the final one I had was actually about Google Analytics. Again, not AI related, believe it or not. that you can now import Reddit’s Reddit ads cost data directly into, into Google, Google Analytics.
[00:12:33] Matt: So this is part of them expanding out, being able to interact and bring in data from other ads platforms. That was interesting.
[00:12:44] Dara: Yeah, it is. I must admit, I didn’t even know you could do ads on Reddit. learn something new every day. You can. And if you want to, you can
[00:12:53] Matt: get the data, the ads, the data into Google four, which if you are running ads on Reddit, you would definitely want to do that.
[00:12:59] Matt: There you go. Win-win. I had some AI news. Can’t remember what it is now. no, I do. I mean the Met could one was a Conti. going to be quick. ’cause we, we, we, we’ve dwelled on the good news that you, you know, carefully researched and pulled together. but, the, the, the continuation of the, the, the Zuckerberg meta thing, is just off the scale.
[00:13:26] Matt: All this money that they’re spending, 15 billion I read, and it could be up to a hundred billion they spend to try and win this AI arms race, this super intelligence race. But there’s just a lot of stuff about that at the moment. How, how much they’re, they’re going big on this, even bigger than they’re Well, they’re, yeah.
[00:13:43] Matt: Meta is going even bigger than their previous success. The Metaverse. I mean, that was a gamble that really paid off. So I’m not convinced. I’m not in the metaverse right now. I think it works so well. We don’t realize we’re in there. Yeah, yeah. But yeah, Zucks gone nuts. We call ’em Zuck on this podcast ’cause we’re close friends.
[00:14:04] Matt: But yeah, he’s, he’s, he’s on a spending frenzy at the moment. and I read an interesting article. I don’t think this is really news, but it’s, it was an interesting perspective, which was kind of looking at why Apple has been so slow to get involved. and it was saying it could just be as simple as they’re slow and they’re going to get left behind.
[00:14:25] Matt: but the, the, the, the article was suggesting that maybe, they actually don’t think the technology that’s out there now, I guess particularly around hardware, isn’t good enough. And they’re going to wait until they have something really, really good and then they’re going to release that. So time will tell.
[00:14:40] Matt: Yeah. It’s interesting isn’t it because you, there’s a lot of people making gigantic bets like Google, like Microsoft, like Meta, like Salesforce, like pretty much any large technology company, pretty much betting the farm on it, and you kind of think. They must, hopefully they all know something or they can see something in the future that means they’re confident enough to, to put large portions of their company at stake.
[00:15:11] Matt: I mean, if you think of Google, basically upending its entire main part of its business in, in certain, search ads revenue and moving to AI related searches. And they have to know something. And I don’t know if it just smacks a bit of Apple, like, well, you know, we’re just going to thi this is what, traditionally what Apple do, isn’t it?
[00:15:32] Matt: They wait a little bit, a product category will appear, say a foldable phone or I don’t know, like Siri on their phone originally. And they’ll wait a few years and generations to allow it to, to, for, for them to perfect it internally, then release something that they think is better than everyone else’s and they’ll rebrand it as something completely different.
[00:15:53] Dara: And, and it usually is. Exactly. It usually is exactly, but this is fast.
[00:15:57] Matt: This is so much faster than any of that before. And it does feel like maybe they’re playing 3D, the four D chess, or maybe they are.
[00:16:07] Dara: We just can’t comprehend the, you know, the strategic intelligence of it.
[00:16:12] Matt: I mean, they got, I mean, apple glass in itself, that liquid glass nonsense that they’ve bang on about for the new ui that’s, that’s all their last, it felt like that’s all their last big development announcement was about.
[00:16:26] Matt: I was thinking, should you be talking about ai?
[00:16:28] Dara: Ai? Yeah.
[00:16:29] Matt: No, but look at this. It’s liquid glass.
[00:16:32] Dara: That’s quite impressive. Yeah.
[00:16:35] Matt: Yeah. Wow. Cool. It’s glass, but it’s really, I think we had that in Windows vista.
[00:16:38] Dara: Yeah. Okay. That’ll, that’ll do. That’ll do for news this week. I think we’re joined on the podcast today by Chad Sanderson.
[00:16:46] Dara: From Gable Matthew, discovered Chad, I guess, through an article that he wrote, which was quite inspiring around the shift Left Paradigm. I’m going to hand it over to you, Matthew, to talk a little bit more about what you found so inspiring about that and, and what led to us getting Chad on the podcast.
[00:17:03] Matt: Yeah, I, I mean, I’d like to claim I discovered him and, and all of his successes, successes after that were down to me. But, but, but I mean, yeah, that was a big, that was a big word I used. Yeah. But I, what I think what that means is, I, I came across this article, the Shift Left Manifesto. Companies need to, companies and teams need to really begin to understand and, and track and agree upon, data governance policies on the far left of their, of their processes, right.
[00:17:32] Matt: Where it’s being, where the data’s being produced from to make sure that downstream, when it’s reaching warehouses or other data products, it’s, it’s correct and, and any. Issues that arise from it, or any breakages or any, any changes in schema and things are quickly identified, or that people are considering it, developers are considering it when they’re making changes to sites.
[00:17:57] Matt: So, I’ll butcher an explanation of, of what the Shift Left Manifesto is, and, and, and Chad does a really good job of explaining it. So I think without further ado, we should just jump in and listen to what Chad has to say.
[00:18:12] Dara: very warm welcome, to the Measure Pod, to Chad Sanderson from Gable.
[00:18:17] Dara: So Chad, firstly, welcome. Hello. Thanks for agreeing to join us and, and come on the podcast and talk to us today.
[00:18:23] Chad: great to be here and thanks for having me.
[00:18:25] Dara: Just before we kind of dive in, ’cause I know I, I think Matthew’s going to be really keen to dive into what we’re going to talk about today, but if you could just.
[00:18:32] Dara: Tee things up by giving our listeners a bit of an intro into, into you personally, a little bit about your background, but also, feel free to talk about Gable a little bit as well. just to give us a bit of background and, and kind of set up the conversation that we’re going to have today.
[00:18:46] Chad: So my name is Chad Sanderson. I’ve been in the data space for most of my tech career. Prior to that, I did have a totally different life where I was a journalist, sort of traveling around Southeast Asia and covering a bunch of things. But since I’ve been in the data space, I’ve done a bit of everything. So data science, sort of analytics, experimentation, machine learning, data engineering.
[00:19:10] Chad: I’ve run data engineering orgs, data platform teams, machine learning and sort of AI platform teams. And then in terms of the companies I’ve worked at, I’ve, I’ve also done a bit of everything. So. I’ve worked from sort of small, your more traditional small businesses up to, very large, corporations, like, subway or Oracle and, and Microsoft, to, startups.
[00:19:35] Chad: So I did that for a pretty long time. And then the last job I had, I was running the data platform team and machine learning platform organization at a late stage, freight technology startup in Seattle called Convoy. And when I was there, I was trying to take what I thought of were more traditional approaches to data governance and data quality.
[00:19:59] Chad: And I was not having a lot of success in terms of adoption. So my team, which is the data engineering org, essentially was constantly being hit by. Outage related issues and tickets and just all sorts of things that really made our work a lot less about doing interesting data related to initiatives and more so about solving bugs, fixing problems, and putting out fires.
[00:20:23] Chad: And that wasn’t good for the business. and so we, we sort of set out on this journey that ended up taking a few years, to say like, this, this is not normal, right? Like other engineering teams within the company, they don’t have to constantly advocate for their existence. They don’t have to constantly be asked like, what’s, what’s the ROI of adding a new headcount?
[00:20:44] Chad: They’re just sort of doing valuable things and that’s the world that we wanted to get into. And we realized that there was sort of this fundamental problem with data management and that really the core of that fundamental problem is what I think we’re going to get into and is the heart of shift left, is that you, you can’t have a downstream team like data engineering.
[00:21:04] Chad: Be accountable and responsible for quality and governance if the data is produced by someone else. That’s kind of the core underlying thesis that we realized. And so the mission of Gable is to help those upstream teams that are generating all the data, actually take ownership of the data that they’re producing and create incentives for them to do that. That just goes beyond, well, we need to make the life of the data engineers a little bit better.
[00:21:28] Matt: I, I, I wonder if to context that a little bit, because I think, like you said, data teams or data engineering teams kind of not lag behind, but the maturity is a little bit behind, say the software engineers that have been through it and developed all these processes and these roles and this way of working that solve some of these problems.
[00:21:48] Matt: Do you think you could sort of context, sets, sets for us a little bit? Like what, what sort of software engineering has done to solve some of these problems that we are noticing in the data engineering world or in the data world generally.
[00:22:01] Chad: I think that’s a great question. So I’ll, I’ll sort of, I’ll, I’ll answer that, but I’ll also sort of, preface it by saying I think that a lot of the problems that software engineers have specifically when it comes to data management are the same problems that data teams have actually put the right systems in place to solve.
[00:22:22] Chad: They just don’t have the ability to solve problems because the software engineers aren’t following those same practices. but on the flip side of that, you have had, I would say cost centers that have existed in the past, like security, like DevOps, that sort of are, are like, they have the same pain that data teams do now where there were small teams that were thinking mainly about quality, right?
[00:22:47] Chad: How do you do something the right way? How do you facilitate, like how do you write software that is secure across the organization? And when you have like a small number of people responsible for this sort of definitionally, you become a bottleneck. You kind of have to be, you can’t release your software unless you’re following the rules of the security team.
[00:23:07] Chad: And when that happens, you sort of become like a glorified crossing guard in the company and you’re not adding value. And, and many times this puts you in direct conflict with the teams that are making all the money in the business, right? In order for me to ship some features that I think are going to help us hit our sales targets, I have to go through the crossing guard.
[00:23:29] Chad: And so you’re actually my enemy, right? You’re actually the enemy of the company, and that’s just like not a good position you want to be in for your career. and, and so what? A lot of these, like cost centers, like DevOps is another one, or like the, the, you know, folks who are, people who are focused on, releases and deployments and things like that.
[00:23:51] Chad: what they decided to do was this, this idea of shifting left, that is to say. We need to get the teams that are responsible for making the money in the company. So the application developers who are writing the software that’s generating all the value to care about the quality, right? They need to care about deployment.
[00:24:11] Chad: They need to care about not shipping any bugs. They need to care about writing secure software because if our responsibility is to fix it all after the fact, it leads to all those bad outcomes that I mentioned before. And I think where there’s a parallel to to data here is that if you’re an application developer and you’re writing software that is producing data, you’re generating an API, you’ve got a transactional database, it’s getting loaded into some downstream file storage system, you should probably be thinking about, well, what, what does this data actually mean?
[00:24:41] Chad: Who should be using it? Who should be allowed to use it? What are the quality rules around it? How do I communicate changes to the people who have taken dependencies on it? So on and so forth. And if that was the level of thinking that producers had when managing their data, a lot of our quality issues on the downstream side would go away.
[00:25:00] Chad: Right? And so that’s where I think there’s parallels to the engineering world.
[00:25:04] Dara: So who’s, who’s responsibility is because the shifting left is moving it from the kind of the end users back to the source and, and, and back to the people actually producing the code. So within a business, whose responsibility is it to actually do the shifting?
[00:25:19] Chad: Yeah, I think that’s a good question. I think that data teams need to help with the education because a lot of software engineers simply don’t really know that much about, there’s two things they don’t know about, they really don’t know about. They don’t really know about the problems and the challenges that the data engineering organization has, number one.
[00:25:39] Chad: And number two, they have their own problems with data management. They haven’t yet attributed data quality and data governance as the solutions to, so I can give you an example of that, right? Like let’s say that I’m a front-end engineer and I’m consuming an API that’s been created by some backend team and I’m populating my UI and I’ve got fields and forms that people come in and they’ll sign up for my website and type in their name and all this other useful information.
[00:26:08] Chad: Well, if that backend team decides to update their schema, maybe they have a first name field or a first name column and a last name problem, and they merge those together into a name column and they’ve done this for some arbitrary reason, well, it’s going to break my ui, right? That’s a data quality issue.
[00:26:25] Chad: But software engineers don’t think of it as a quality issue. Data engineers do, like, think about testing and validation for data in that, in that way, and, and, and they don’t. And so I think that there’s a whole set of that. There’s like a, there’s a whole spectrum of education. that needs to happen for engineers to really realize, like, oh, wait a second.
[00:26:47] Chad: The techniques that you guys are talking about, like schema evolution and data validation and, you know, tests on business logic and looking at latency, looking at like, not just the latency of my service, but how long does it take from data to reach, like one point after it’s been generated and how many records do we expect to see?
[00:27:08] Chad: they’re, they’re just not really thinking that way. So I think that’s the responsibility of the data teams. I think that the ultimate owner for a quality at the source is going to be the application developer. And this is what I alluded to in the earlier part of the conversation when I was talking a bit about Gable, which is unfortunately not enough to go to an app developer and say, Hey, I need you to do a lot more work on behalf of the data engineering organization.
[00:27:37] Chad: They’re going to say, I, I don’t, I don’t want to do that. I, I already, I already have a full backlog. I have a roadmap. I’m not thinking about you. So what we actually need is we need them to care about data selfishly, right? What are the things that they would benefit from in their own day-to-day workflows that would improve their lives significantly and give us high quality data as a byproduct.
[00:28:03] Chad: And that’s, those are the systems that need to exist and ultimately need to be given to the upstream teams.
[00:28:08] Matt: I think that kind of, that leads onto a question I was going to ask. In, in, say, say in our cadence, we’re talking about marketing teams or people who are producing marketing data. What kind of tools and approaches might somebody take to make them care selfishly about that data?
[00:28:24] Matt: So if, if we, we’ve gone to all this extent of setting up a data layer and getting everything the right way, and everything’s working perfectly and is feeding into all of our downstream marketing systems, and then like you say, somebody changes something. The data layer is broken. Some of my values change, it completely changes our schema.
[00:28:41] Matt: How, how do we make them care if I’m a marketing person sitting there and trying to get engagement?
[00:28:46] Chad: Yeah, but it’s a great question. So I think it’s really going to depend on, it’s going to depend on the incentives of that upstream team, of the upstream team. There are some upstream teams that kind of already understand why they might want to do this, because maybe they’ve made some changes in the past.
[00:29:06] Chad: They caused an outage to a machine learning model, or an important dashboard, or, you know, like an API or another service that was consuming that data and they realized like, this is when this happens. It’s on me, it’s on my head. I have to stop what I’m doing. I need to go in and deal with this, and we need to open up a ticket.
[00:29:26] Chad: And that’s just problematic. And I, and I, and I don’t want to deal with it. Right. That’s, that’s sort of one thing. The other thing is. Is there a scenario where that upstream team would actually have a better development experience if other engineers in the company were following this paradigm? So an example of that might be, Hey, I, I’m a software developer and I’m going in to build a new feature and I need to write some events, or I need to create some, you know, I need to emit some data so I can put it in my transactional system, but I don’t know what data we have.
[00:30:02] Chad: And so I need to go in, I need to figure that out. I need to understand my service, understand where all this data is coming from, see if we have it or not, and then I need to go in and actually create this data, em, admit it. And then I find out three months later, I’m actually duplicating something that already exists that’s maintained by someone else.
[00:30:18] Chad: And like, what, what do I do now? So if in that world, the team that was creating that initial data had a data contract, for example, that they made public through a catalog and the engineer could simply look and say, do we have the data I need to emit from my service or not? And if we do, great, I’ll just reuse it.
[00:30:37] Chad: And if we don’t, cool, I can build it and then put it into the catalog. I’ve just shortened my own development cycle by a significant amount. Right. And those are, those are two examples, but that’s the type that’s like the way to approach it, right? Which is like, what are the things that this person wants?
[00:30:53] Matt: Basically everybody works with data to some degree, and can you give them the tools that make working with data easier, that results in higher quality data for all of their consumers. and you mentioned their data contracts. Do you want to just go into a bit more detail about, about data contracts and what they are?
[00:31:12] Chad: Yeah. So, a data contract is the, the best way to think about it is like, it is an agreement between a producer and a consumer. What is the data that is being generated from some source system? What are the expectations on that data? And when you provide that agreement, when you make it visible to the broader organization, you’re sort of saying, here are the things that you can take a dependency on, and you don’t have to worry about being broken.
[00:31:38] Chad: So I think about it as the data equivalent of an API, right? But instead of just focusing on like, Hey, here’s some things that you can call and here’s what you’re going to get back. These are the rules of the data itself. Here’s how many records you can expect to see. Here’s what the contents of the data actually means.
[00:31:56] Chad: Here’s what the schema is going to look like, and here’s how it’s going to evolve over time. The nice thing about a data contract is that. As it changes, because like the reality is these things do not stay static, right? Your databases are going to evolve, your events are going to evolve. And the point is not to have a static contract, it’s to have a system that is recording at the moment of change something has happened.
[00:32:21] Chad: Figure out who is going to be affected, who are the consumers of that thing, and give them the proper amount of heads up and notice. And the notice might be, Hey, we’re going to be making this change in a week, and then you’re going to migrate to the new sort of data structure that we’re pushing out. Or it could be an opportunity for the consumer to say, wait a second, we actually have a business critical dependency on the data that you’re changing.
[00:32:44] Chad: We’re going to need to do a lot more work before you, like suddenly flip a switch and cause something to, to blow up. The last thing I’ll say about contracts is. Generally when I talk about this, people are thinking of schema primarily, and like that makes sense. But a contract really needs to go beyond schema to capture anything holistically that you would expect from the data, including the meaning and the context of the data itself.
[00:33:11] Chad: And there’s ways that you can get to that, you know, purely just through, you know, talking about what the data should mean and kind of writing down, Hey, this is what this data means and how you should use it. But as we move into this world of AI where artificial intelligence is looking at code as the source of truth, AI should be able to start to deduce what the meaning of data is in a previous version of code versus what it becomes in a new version of code.
[00:33:39] Chad: And if that violates constraints in a contract, and there’s so many cool things that you can start to do once you factor in that contextual understanding as part of the contract as well.
[00:33:47] Dara: So should the day should the data contract be something that’s actually defined and agreed mutually between the end users of the data?
[00:33:54] Dara: So say the data team and the actual programmers coding, the creation of that, that data, or is it more of a this is our list of demands and you just need to, you just need to adhere to this?
[00:34:05] Chad: I, I think it’s another one of, it’s going to be a very unsatisfying answer because I, I think it depends on the type of organization that you’re in and the use case.
[00:34:13] Chad: So the way that I’ve seen it play out is that there are some cases where the producers have all the power in the relationship, right? Where like, I’m, I’m a software engineer and I am, you know, the, the downstream data teams, you guys are cool. You’re building dashboards, you’re doing all sorts of great stuff.
[00:34:33] Chad: But if those dashboards suddenly disappeared tomorrow, the business still revolves around me. So what I’m going to do is like, I, I don’t want to waste my time dealing with outages that you tell me about. So I am going to define what the contract is, and what the API is. I’m going to give that to you. You can take the dependencies on the things that I make available through my contract.
[00:34:56] Chad: And if you go beyond that, and then that breaks, well, I mean, that’s your own problem now, right? That, that’s sort of one, mechanism. And I’ve seen, I’ve seen, I’ve seen it go that way. You also have, the way that you, you sort of talked about at the end there, where the consumer has all the power, and this is definitely true.
[00:35:13] Chad: Like one ex, one place where it’s very true is in the financial services. So if you’ve got like, let’s say a risk model or a fraud model, and these are audited by the federal government, and if you don’t pass those audits because the data is low quality, or you’re using PII in the wrong way, and you could be fined billions of dollars, well guess who has the power and the relationship.
[00:35:36] Chad: Now it’s, it’s me, right? Like you, you might own the front end UI of some mobile app, but I’m responsible for making sure that the business doesn’t fail because we get fined, you know, to oblivion by the government. And so now I can dictate the rules to you, right? I can say this is what you have to do because this is what we are using in these models and this is what the federal government expects.
[00:35:59] Chad: And then I also think that there is this relationship, there, there is another option where it’s kind of in the middle. Where the, it, it almost works in the way that a product manager might create a requirements document for a software engineer when they’re building a new feature, right? Where they’ll say, Hey, this is the problem that we’re trying to solve.
[00:36:18] Chad: This is kind of the shape of the solution that we want to have in place. And this is the, the, the outcome, right? This is the ROI of what will occur if you give this to us. And then, what the downstream team would basically do is say like, we’ve got some meaningful use case for this data. We are having quality issues or we’re having governance issues that are resulting in some direct ROI impact on our team.
[00:36:44] Chad: Like, we’re not able to make effective decisions. Or maybe, you know, if we could improve our model’s accuracy by 5%, by increasing the quality of the data, then we make a hundred million dollars or, or whatever it is, and I’m going to give those requests via data contract to the upstream teams. And now that becomes a conversation that you can start to have on at, at the leadership level.
[00:37:05] Chad: Which is, do I, it’s not just do I do this work to make the data engineering team happy, it’s do I do this work to solve some business problem and do we prioritize this or do we not prioritize this? And what’s the level of effort versus impact? And if the level of effort is, I just need to make sure that a field is not knowable and I potentially save millions of dollars, that’s a very easy conversation to have at that point.
[00:37:27] Matt: Just to, just to think about this, these data contracts still, and still pulling that thread and, and to take it into the marketing analytics world for a second. We, we come across a lot of clients who may struggle with, with cent big centralized, governed warehouses, for example, that are just so inaccessible to them that it becomes a bottleneck for them actually being able to achieve anything.
[00:37:52] Matt: Some equally as, as, as a contract, going back to the source and understanding what data, what data format and context and all the rest of it should be. Is there a world where as, as long as that sort of separate team, that marketing team can agree upon some sort of data contracts with this central governed data, they can, they can peel off a piece and begin to work at it a bit more freely and be able to remove some bottlenecks a bit too, to get the actions they want and stop sitting around waiting 20, 20 months to get a very small piece of what they need.
[00:38:28] Chad: Exactly. I mean, I, I think that that is the ultimate vision of the contract and you if, if you sort of do this the right way and I have seen it done the right way, you end up in a world where you basically have two environments for your data. One environment is very sometimes messy. complicated spaghetti nonsense.
[00:38:50] Chad: That exactly as you said, is very difficult to penetrate. I don’t really know what this stuff is, but maybe there’s some gold nuggets in there somewhere if I root around in the mud for long enough. And then the other world is the data under contract. This is, everything is owned. It’s clearly defined exactly what it is, how to use it, how it feeds from one system into another.
[00:39:13] Chad: people have visibility bidirectionally, right? So I know when the upstream system changes in some way, but the upstream system also knows when I want to use this data for something. And that’s another really important thing too, for a variety of reasons. And I think in that world, it just becomes so much easier to use data to solve business problems because you know exactly what you can trust, how to use it, where it’s, where it can be found when it changes, yada, yada, yada.
[00:39:40] Chad: So I, I, I do think that that is the, that’s sort of the ideal future. The problem is, or, or the challenge is getting to that future because you have these two groups kind of sitting on either side, the producer on one side, consumer on the other side, and a lot of like crap that’s just emerged in the middle over time.
[00:40:02] Chad: And sort of being able to bridge the connection, you know, the, the sort of connective tissue between either side to get to the meat in the middle is a very, very labor intensive process. Like it’s very difficult to unravel all of that spaghetti to say, I, as a producer, am supporting. Dashboard, A marketing initiative, B email automation C if I know that that’s amazing and I’m going to do so many things on my end to make sure that you have what you need.
[00:40:32] Chad: But until I get to a point where I can know that definitively, and I understand how my system connects to your system, it’s very, very hard to take that level of imp to have, to have that level of ownership. And that’s where I believe technology comes into play. I think the purpose of the technology in this space is to do, in large part, that automated unraveling of the system and being able to connect those two sides together.
[00:40:53] Chad: And then you understand what each sort of responsibility is to the other.
[00:40:58] Dara: What does that look at? Because I was going to ask you about that. What, what does that, what does that look like in practice? ’cause that, as you say, there’s a huge amount of spaghetti to unravel. and the two sides, they’re quiet, they’re not quite opposite ends.
[00:41:10] Dara: They, you know, they, they potentially are speaking in different languages, so, you know, you might have SQL versus, you know, some. Programming language. What, how, how, where does that start? How, how do you start that process? What does that actually look like in terms of bridging those two completely opposite worlds?
[00:41:28] Chad: I mean, so you, you nailed the problem exactly. Is that you, I I think the way that you framed it as there’s, there’re literally speaking two different languages is really the core of the issue. You might have a software developer, writing an application, or a developer writing something in Swift on their sort of iOS app.
[00:41:47] Chad: You may have a Java developer working in the backend. They might be using two different types of databases. That gets handed off to some file storage system where it all comes together. And then you might have a data engineering team writing Spark pipelines, and then you might have an analyst team writing, writing sql.
[00:42:05] Chad: And I think that translation issue is the core problem. Right. That is the problem. If you can overcome that, meaning you’re able to look at all the different, like stuff that people are creating and is connected together and abstract all that complexity into a well understood flow between people and teams and, and technology, then you can start taking action on like, okay, well, well what do I do now that I understand TM A is connected to B in this way for this, for this reason.
[00:42:35] Chad: And there actually are, there actually is tech that is designed for exactly that. It just has not ever been used in the data space. It’s mainly been used in areas like security. This is another big reason why I bring up security a lot because I think that security is one of the best analogies for data.
[00:42:55] Chad: security really for the past 30 years or so has had that exact problem, right? Where I, I like from a, from a, from the point of view of like, I need to understand what the vulnerabilities are in my, my code base. I need to understand if I have a vulnerability over here in one system, what is the potential blast impact of that on all the other services that I have?
[00:43:19] Chad: What, what could be contaminated and how scary is it? And that determines what I prioritize. It’s like, okay, if I have a vulnerability over here, but it’s totally isolated and the attacker’s not going to be able to get anything interesting, it’s just a lower priority for me to solve than if I have a vulnerability that connects to like all of my sensitive, customer data.
[00:43:38] Chad: and so they’ve invented techniques that sort of revolve around code parsing, being able to look at a code base irrespective of what that code base is. Trace through that system, through, through different languages, through different technologies, and look at the handoffs and say, I now understand the flow of, of information.
[00:44:00] Chad: But like I said, that’s something that’s been applied specifically in the context of understanding vulnerabilities. It’s called taint analysis. It’s never really been applied from the perspective of like, well, I need to know where data starts and where it ends up. And I think that’s the type of technology that’s going to be pretty revolutionary for this space.
[00:44:21] Dara: So how much, this is probably going to be a case by case basis as well, but how much of the problem. Can be, if you were to assign percentages to it. So there’s obviously the, you know, education piece around getting people to buy in, adopt, and then there’s the, the technical challenge maybe of unraveling all of this and understanding what connects to what.
[00:44:39] Dara: So if you were to say this as a kind of rule of thumb, how much of the problem is adoption? How much of it is technology? And maybe what other pieces might, might fall into that, into that pie as well.
[00:44:51] Chad: Yeah. I think there is, unfortunately, huge investments required for both. like what I just mentioned a second ago is only part of the technology investments.
[00:45:05] Chad: Basically, you know, bringing in tech that’s been around for a very, very long time and applying it in new ways. But data also has some unique challenges associated with it as well, where there’s a lot of contextual information that would be very hard to get from a purely deterministic. Code parser. And there’s, there’s is all, all types of things like that, that are, that are very technically challenging to solve.
[00:45:28] Chad: But to your point, even if you are able to solve those problems, it still doesn’t necessarily mean that the upstream engineers would use this technology. And that’s where I think there, it sort of goes back to what I was saying before where you, you can’t just provide the visibility into who’s using what and say, Hey, I want you to use this so that you can prevent problems to downstream teams.
[00:45:52] Chad: You have to figure out how to, number one, make that absolutely trivial to onboard, meaning the cost of getting started on a system like that has to be basically zero, right? Because you don’t want to add any new, any additional effort it has to fit within the existing workflow of these other teams.
[00:46:11] Chad: And I’ll, and I’ll give you just one quick kind of story. I saw this when I was working at Convoy, but I know a lot of other people that have gone through this as well, where they had a data team. And they said, Hey, here’s sort of all of my tables and all my queries and all my metrics and y yada yada.
[00:46:27] Chad: And I want my engineering team to start looking at this catalog of information before they make changes. Tracing how the code that they’re writing is going to affect this web of interconnected systems. Then doing the hard work to tell the affected people when they will be impacted. And the success rate of those initiatives has basically been 0% because an engineer will see that and go, this is a mess.
[00:46:57] Chad: I have no idea how to connect my code to some arbitrary table that’s sitting in file storage somewhere. I’m not going to do this work because the cost of doing it is so high that it’s not even worth the value anymore. The other problem I would say is that it’s not like that type of work is.
[00:47:18] Chad: Local enough to the individual developer, right? Developers generate enormous amounts of data. Not all of that data has utility in the business, and it’s not all being used in the same way, right? You might have a table that’s sitting in Snowflake or Databricks or whatever, but there’s only one person using that table to generate a dashboard every six months for something that’s kind of interesting.
[00:47:44] Chad: Or you might have a table that’s being used to convert one currency into another that’s being used by every single component of the product. And there’s not a great way for these upstream developers to know that and then prioritize what they should be taking action on versus, versus what they shouldn’t.
[00:48:03] Chad: So when we talk about adoption, I think those are the types of problems that need to be solved, right? Like you need to be able to give a workflow to a developer that’s easy to onboard that gives them context that basically tells them what to do and who to talk to, and then also gives them the next step.
[00:48:22] Chad: And that’s a really, really hard problem to solve as well.
[00:48:26] Matt: And, and I suppose in the, with the best will in the world, say you get all this spot on and you get a, you get a contract in place and, and you, you’re able to communicate those things down to the developers. There’s still going to be, it’s never going to be perfect, is it?
[00:48:41] Matt: There’s still going to be a shared sense of responsibility across the entirety of that, of that left to right. And, each individual component of that from, from the developers, from the, the data engineers setting up the spark pipelines to, to the, to the people creating models and using DBT or data form, they’re going to have to have.
[00:49:03] Matt: Diligence in alerting and monitoring and things like that to be able to feed back issues. ’cause it, it, tell me if I’m wrong, but I’m assuming there’s, you get to the, the, the holy grail here, it’s never going to be absolutely perfect and no problems we’re ever going to feed through. You can’t just switch off and say, right, we’ve got it now. We don’t need to worry about monitoring any of this stuff. Exactly.
[00:49:21] Chad: I mean, that’s exactly correct. So the, like, this is, this is kind of the, the heart of the massive change that I am trying to push and trying to recommend. And this, this isn’t just a change in the world of data. It’s actually a huge change in the world of software as well.
[00:49:40] Chad: What we have basically done in the software engineering industry is treat code as an we. We, we’ve sort of, we, we’ve tried to manage code in a very isolated way and that’s a result of the, of, of the adoption of the cloud and the adoption of applications and things like that. And what happened, you know, sort of in the, in the two thousands and going into the early two thousands and two thousands, 10 and tens, is we became a very application centric universe, right?
[00:50:12] Chad: Everything became centered around the app. How do we allow software engineers to ship changes to their applications as fast as humanly possible? And it turns out that the best way to do that is to move in in isolation. So you have small teams that have a lot of autonomy. They have a lot of control. They can kind of make changes to whatever they want, and as long as they provide interfaces to other teams in the company, then you don’t really have to worry about things like a big unified plan or unified architecture or whatever, whatever.
[00:50:43] Chad: This is not the way that data operated for the previous 30 or 40 years. Right? Like, it was very structured. You had, you know, ERD diagrams and you had data architects and they were sort of thinking about how all the database databases are connected and you had architects that would sort of build out your transform layer and then make that available to your analysts and all these other people.
[00:51:04] Chad: And that just kinda went away overnight, right? The cloud just sort of evaporated. But you know, there’s no such thing as a free lunch. Like the reason we did those things, the, the reason, the reason that that happened is because you need some level of coordination and structure in order to have an accurate representation of the world in order to do analysis, right?
[00:51:26] Chad: That that had, that had just had to be true. And I think what software developers are starting to run up against now is that there are previous assumptions that we can just live in a world where we’re totally isolated from everyone. If it does, it doesn’t work. Like you, you do reach a limit there, right?
[00:51:43] Chad: Like if I am writing software and you’ve taken a dependency on me, whether I want to be isolated or not, we are connected, right? Like there is a relationship that you and I have, and my view is that the world of code is a graph, right? It’s, it’s, it’s a network. It’s multiple nodes, people who are generating things and writing software and connections between multiple nodes.
[00:52:05] Chad: And that requires asking the question, how do you manage quality? If you are a single node in a graph that’s making a change that has impacts across the entire system, that means you, you, your unit tests shouldn’t just be focused on your node, right? It should be focused on other people as well. That’s a really, really big change.
[00:52:26] Chad: It’s a change to how we think about software and how we think about quality and change management and stuff like that. But I think that’s what’s going to happen. I, I, I think the world is going to move to that over the next 10 years, and I think AI is going to be a huge tailwind for a bunch of reasons.
[00:52:39] Matt: That is pretty much my next question is like, this is obviously a problem and, and has been a problem for a while, but do you think, well, the pace, the pace of change and stuff that’s going on with AI is, is shining more of a spotlight on, like, you, you can’t just rely on trusting the data or, like you say, chucking a load of crap into a data warehouse and letting data engineers figure it out and try and pull out some sort of order from it.
[00:53:03] Matt: The more garbage that you’ve got sitting in one of these systems that are beginning to feed into AI or ML models or what, whatever it may be, this problem gets worse and it gets faster. It is getting worse, faster, perhaps.
[00:53:16] Chad: Well, exactly. I think the problem has always been there, this problem, like, like that really is the core issue, right?
[00:53:21] Chad: Which. I, I, I’m producing, I’m writing some code, I have a service, I’m doing a thing. I’ve got all these dependencies. But no one is, everyone’s sort of pretending as if those hard dependencies don’t exist. And the way that engineers have kind of managed that is that principal engineers and senior developers will take it upon themselves to understand all the interconnections between their system and other systems, right?
[00:53:51] Chad: It’s not a part of your testing framework. It’s not a part of your unit test or your automation or any of your other quality tools. It’s just in my brain, I just know that like, I call this thing and this is what I depend on and this is the way that I’m using this stuff. As we move to a world of air, right, where the AI is now a developer and they’re living in the code base and they’re making changes to software and they’re doing this rapidly, the AI is not going to have the context.
[00:54:20] Chad: Of how one system is dependent on another system. It’s not able to get up from its desk and walk around to the other side of the building to talk to the engineer that manages the API. They can maybe reference some documentation, but that’s not going to be good enough in a system that has like these cascading dependencies and all this complexity and thousands of files and so on and so forth.
[00:54:41] Chad: And so the tailwind that I’m mentioning is software developers starting to realize that actually you can’t just unleash AI willy-nilly onto a code base and ask it to start doing things because it’ll make changes that have horrific impacts somewhere else. And there’s no human in the loop to validate that that change actually makes sense.
[00:55:02] Chad: And so if that’s the world that we want, if we want a world of autonomous sort of AI agents able to do things, it means that you need to more or less give them a map of the world. Like, here is how my stuff is interconnected. This is what takes a dependency on this, and this is what it’s intended to do.
[00:55:18] Chad: And that’s it, it’s all the same problem. Data teams suffer from that problem. Software engineers suffer from that, that problem. But I think, I think data organizations are oftentimes the tip of the spear because they don’t have that same type of personal relationship with software engineers that software engineers have with each other, right?
[00:55:35] Chad: Where they just get up and say, okay, well I know this guy, I’m going to go talk to them about this. Software engineers generally don’t do that with data engineers. And so they’re just impacted a lot more.
[00:55:43] Matt: Yeah. And if there’s a gap in it, if there’s any sort of gap in an M’S knowledge, it will just make it up, hallucinate it completely, and carry on.
[00:55:52] Dara: Yeah. Do, do you see, do you see ai, so, so if we’re, if we’re thinking about like AI being something that sits on top and it kind of magnifies a problem that already exists, do you see more practical, short-term uses for AI in terms of actually solving, like, can AI be a part of organizations actually shifting left?
[00:56:10] Chad: Yes. I, I definitely think it can. I think it can in a few different ways. If you’re able to combine it. With other forms of technology that facilitate shifting left. So, for example, one of the, like, one of the things that you can do with pretty sophisticated code parsing methods is you can figure out how one piece of code is connected to another piece of code and data flows from here, and it ends up there.
[00:56:37] Chad: What that doesn’t tell you though is context. W what is happening in the code? How is the data being transformed? What does it mean? How is it actually being used? And if you don’t have that context, you’re still operating with limited information. And this is something that AI is actually really good at, right?
[00:56:56] Chad: Where, where it struggles is it, it kind of struggles in two ways. It struggles when you need an explicitly right answer. Right? Where it does great is when something is subject to interpretation. Kind of like writing an essay, right? There’s no real right way to write an essay. You can, you can get sort of the general themes correct.
[00:57:16] Chad: And it’s like, okay, this, this feels right. This feels like something that’s digestible by people and interpreting what code is doing. Interpreting what data is doing is exactly the type of thing that LMS are really great at. So if you’re able to combine the deterministic systems to say, I know that this goes from here to here, to here to here, and then with that concrete deterministic knowledge here is what I think this data is, is doing.
[00:57:41] Chad: You make it unbelievable. It makes it 10 times easier to shift left because you’re operating at the level of, of, of the, the abstraction of context and not at the level of, code and functions and classes and things like that. And that just makes decision making way easier too.
[00:57:58] Dara: So if you were a, if, if one of us say, say, take one of our listeners here, they’re listening to this, they’re thinking, okay, I’m a, I’m a data engineer, or I’m a data analyst.
[00:58:07] Dara: Wow, my company really needs to shift left. I completely buy into this. Where do they start? What’s their, what’s their first step?
[00:58:13] Chad: So I, I think the first step, and, and this is what I always recommend before you think about, before you think about technology, before you think about, you know, what tools do I go and buy?
[00:58:25] Chad: It’s to figure out your use cases. So, here’s a, this is, this is a, a common problem I’ve seen. Sometimes I’ll go to data platform teams or data teams, and the data teams will say, Chad, we, we really want to shift left and we want data contracts and we want all these things. And I’ll say, great, tell me the top five use cases where adding a data contract in your organization is going to save the business money or save them time.
[00:58:46] Chad: And they go, duh, I don’t know. Let me go find out. Right? and, and that’s, that’s not where you want to be. I think that a lot of the data engineering oriented technology has not been probably for the last 15, 20 years, hasn’t been use case driven. It’s been more driven by best practices because we as like data teams understand what should probably be done, like what’s the right way to, you know, write code and do data validation and so on and so forth.
[00:59:18] Chad: But it’s been di it’s been disconnected from what the business is act, what the business actually needs. So you can’t do this with data contracts for all the reasons we’ve mentioned, right? There’s too much data, there’s too much code. It’s going to be a huge sort of workload for engineers to take on.
[00:59:34] Chad: You need to figure out what are the five or 10 most valuable use cases of data at your business, which is not necessarily limited to analytics. It could be operational, right? You might say I’ve got a front end team that’s consuming a whole bunch of data from a backend team, and I’ve got examples of where they’ve made changes in the past that caused an outage and our app went down for six hours and that caused us to lose X amount of dollars.
[01:00:00] Chad: Right? You could put a contract there. And the reason why that would be useful is now all of the data coming off of that backend team, regardless of where it’s going, right, regardless of if it’s hitting the front end team or it’s hitting the analytical database or it’s hitting the marketing team, is going to be quality, right?
[01:00:17] Chad: And so as long as you can find those examples of where the data coming off a system is important, it requires a contract. You start there, you get your organization to agree that if we had a contract in place at X point in time, it would’ve prevented some horrific incident from happening. And your forward project, we think that this type of incident is going to happen at least once a year, twice a year, once a month, whatever.
[01:00:46] Chad: And now you actually have your ROI calculation, right? And it’s like, well, okay, I think I can save you $10 million a year by having 10 data contracts in these strategic places around the company. That becomes a no brainer for your executive team. Once you get alignment on that, then the question becomes, okay, well now how do we do it?
[01:01:05] Chad: And that’s, now, that’s a fun conversation. Right now we’re talking about technology and what do we do first and what’s the MVP versus scale? But you gotta get to that. We all philosophically agree that it’s good to make sure the data is high quality from a certain set of systems first.
[01:01:20] Dara: And just a, a bit of a follow on, from that. So it, you’ve, I, I would imagine you’ve seen success stories, but also you’ve seen companies attempt this and, and it may go wrong. So what are the, what are the kind of differences when, when, when it, when someone has the idea that they want to do this, but it doesn’t quite work out, what are the typical reasons you see versus the companies who actually successfully managed to shift left and see all the benefits?
[01:01:44] Chad: Yeah, so, it may start with the success stories and give you, you know, some, some good news and some like aspirational things to maybe hope for. We have seen companies where. They started with a small set of data contracts. And these data contracts were so powerful that within one or two months the CTO issued an executive order that said all data needs to be under contract.
[01:02:10] Chad: That’s how, that’s how quickly this stuff can take hold. And the example I’m thinking of is a team, is, is an organization that, one of the main ways they made money was, advertising. And, so a advertise ad tech based, ad tech based business. And the way that they charged their customers was they would have events emitted from their front end systems, like basically an impression event.
[01:02:36] Chad: And so when a user visits a page, they would charge that like, you know, CPC type of thing, back to back to a customer. The problem that they were having was. Any engineer in the company could go and add that impression data to any page. And if that wasn’t accounted for by the data engineering organization, you’re, your customers are just going to be like accruing, you know, they’re, they’re just going to be accruing spin that you could be charging for.
[01:03:03] Chad: That doesn’t make its way back to finance. And this has happened a few times to them, they said, okay, well let’s just try that problem first. They went out, they implemented like three or four data contracts. They prevented, you know, one or two outages and they said, okay, we’re sold this.
[01:03:20] Chad: Not only is this useful for preventing our advertising spin issues from happening. It turns out to be an incredibly effective QA mechanism as well for building features, for building APIs to say, Hey, as we’re building this API, here’s what we expect the data to look like coming out the other end. And then you can use that as a way of bringing in product, bringing in data engineering, bringing in the analyst team, and they just started building better features, like more robust features that were solving customer problems better.
[01:03:50] Chad: So I think when it goes right, like that’s what you see, you see a lot of conversation between different groups that weren’t, maybe, weren’t necessarily talking before. And, and you see real, you see real problems being solved when it doesn’t go well. Right. I think when it doesn’t go well, the, the primary, the, the, the primary signal that is probably not going to go well is when teams are spending more time talking about what our spec is going to look like and not about use cases.
[01:04:21] Chad: They’re, they’re, they’re focused too much on what, like, how are we going to implement this and not, where is it going to accrue to business value? And in almost every single situation I’ve seen that happen after six months, the adoption is terrible. Right? Because there hasn’t been a compelling story made to the leadership, to the engineering team about why they should do it.
[01:04:42] Chad: They’ve just provided a capability. Then it becomes a cost. And even if you are able to convince a few engineers, okay, I’ll do this and I’ll just be nice or whatever, there’s so many issues that come out of that. So many problems that you then need to like, coordinate that. Like they’ll just say, Hey, I, I don’t, I don’t want this anymore.
[01:05:00] Chad: And not only that, but I don’t really care about the data. Like anytime I interact with the data group or the data engineering group, it’s a cost for me. So screw them. And that’s the worst possible place to be. Right? It’s like the end output of this is they care about you less. So that’s why I, I just, I stress so much that like, if you start from the use case and work back to the tech, it’s going to be so much more of a, of a, it’s going to be so much easier to get people aligned than the other way around.
[01:05:23] Dara: Is the approach different? Depending on the size of the organization. and actually maybe a, maybe a, a starter question is actually, does this work for all sizes of companies? And if it does, is the approach different?
[01:05:37] Chad: That’s a great question, and I think the short answer is no, it doesn’t work for all sizes of companies.
[01:05:42] Chad: I mean, well, lemme reframe that. Yes, it works for all sizes of companies, but I don’t think, I I, I don’t think that the primary, I, I, I think it’s going to be much easier to do it a different size of co. So when you’re an early stage company, your priority, the, the, the way that you primarily use data is not to solve some like, production level operational problem, right?
[01:06:07] Chad: Like you’re trying to find product market fit. That’s why you’re building software. And the role of data in those early stage organizations or smaller companies is to basically provide supporting evidence that you’re doing the right thing on the product side and in that world. Accuracy and correctness and governance and quality of the data is actually secondary to directionality, right?
[01:06:30] Chad: As long as the data is directionally true, like you’re giving me the right signal that I’m, that the thing that I built is working the way that I expect that I’m like making you know, more money. I don’t need it to be totally accurate. And, and, and that’s fine, but it just means that data contracts in that world are probably not going to be worth the effort it takes to create and maintain them.
[01:06:53] Chad: ’cause the, the op, the, the cost there is, well, I have to go slower on shipping software. and in bigger companies though, it’s the exact inverse, right? It’s, I, I am, I’m using data everywhere. I have an enormous amount of data that’s probably exceptionally useful. I am using it in all types of interesting, valuable ways.
[01:07:14] Chad: And the cost of you slightly slowing down to give me a contract and take this approach to quality has an outsized impact on value because there’s a thousand different teams that’s using this data point to do something useful. So it’s just, it’s way easier to get alignment. It’s way easier to get adoption.
[01:07:31] Chad: And oftentimes those, like in those companies, the engineers are already being, are, they’re already causing outages and they’re already causing problems and people are coming to come are coming and complaining to them about it, and they’re trying to figure out how to, how to solve it already. I think that in the future, having a data contract is going to be like writing a unit test or writing an integration test.
[01:07:52] Chad: Anytime there’s a handoff point from one technological system to another that’s doing some type of exchange, which is facilitated by data, there will be a contract ensuring that there’s clear expectations between those two parties. I believe that that’s going to become standard. But the way that that type of standardization happens.
[01:08:11] Chad: I think it’ll follow the same path as APIs, right? Where APIs really weren’t a huge thing until Amazon, or microservices, I guess, weren’t really a huge thing until sort of Amazon popularized them. They, as a big company, it makes sense why you don’t necessarily want to monolith. Like you can’t have 10,000 people all operating within the same code base.
[01:08:31] Chad: You, you sort of break things off into microservices, and then the engineers who work there go, oh, this is amazing. It allows us to sort of work on code faster and it’s cleaner. We can kind of decouple a bunch of things and they bring that thinking into other companies. And now a lot of businesses are starting from microservices, from day one.
[01:08:48] Chad: I, I think we’ll see where the big enterprises are the ones who start with the data contracts. That’s where it’s needed the most. Those people will leave, they’ll go to startups and say, Hey, it just makes sense to do this. And you’ll start to see it become more of the common lexicon, the one exception.
[01:09:04] Chad: Small company. The tight category of small companies that I think is really going to need data contracts from day one are the small companies that are offering data products as their primary service or, or a product offering, right? Meaning like, let’s say I have a machine learning model and I’m producing a data set, I, I’m producing something interesting.
[01:09:24] Chad: Well, I want to have a contract in place, both with the teams who are producing all of the data that my, you know, scientists are using to create their model and probably between my customers so that I’m not changing something that causes our customers to blow up, to blow up if they take a dependency on my data. Right? So I think data products are probably going to want those contracts early-ish.
[01:09:48] Matt: My final question is really related to just having a view for a crystal ball and trying to sort of think about what’s coming up over the next, if you can even do that anymore with the pace of, with which things are changing, but sort of over the next five years.
[01:10:03] Matt: You kind of touched on it a little bit, I suppose with the, the proclamation that everyone will be using data contracts and it’ll be as common as a unit test, but have you got any other thoughts or, or insights of what you think might be coming down the road for data in the next five years?
[01:10:17] Chad: I think we’re going to see a lot more artificial intelligence, starting to cut off the annoying parts of the jobs of data engineers.
[01:10:29] Chad: So for example, I think that pipeline development to me seems like the most likely candidate to go, to be handed off to ai because it, it really is, Hey, if I know the shape of the data that’s coming in and I know where this data needs to land. There’s no reason why an artificial intelligence can’t automatically spin up like an airflow job with that context and move it to the right location and then schedule the job.
[01:10:59] Chad: it’s actually better, I think, for an AI to do something like that because they can just exist in perpetuity across the ecosystem and make sure all the data when it is produced, arrives in the right place in the right format. And I think that’s going to be a good thing for data engineers because it means they’re not going to be spending their time writing pipelines.
[01:11:16] Chad: They’re going to be thinking about interesting data problems, which I think is great. I, my, my sort of hot take though is that I think where, I think where AI is, is not going to be particularly useful for the next few years anyway, is in analysis. This is the main place that I think people believe that AI is going to be useful because it’s sort of semantic and it talks like a person.
[01:11:43] Chad: It’s like, oh, well why can’t an AI just sort of analyze, analyze some data and make a decision? And it really gets back to what we talked about before, which is the underlying context. I think there’s so many places like a corporate setting where 90% of the work to make a decision is actually gathering.
[01:12:06] Chad: There’s like people working together in order to gather context from each other, right? But a lot of that gets kind of hidden away and not really included as part of the job description. And what’s more focused on is the sort of technical abilities of like, oh, I wrote the query. I produced the thing, here’s the thing or the other.
[01:12:24] Chad: And I, and I think that analysis is basically like that, right? In order to actually analyze a thing, you need to do data validation. Data validation is not a deterministic process. Like you, you’re figuring out what this data mean, where does it come from? Does it represent the thing that I think it does?
[01:12:40] Chad: I need to, I need to map my understanding of this data that was produced to what my understanding of the actual real world systems are. I need to take into the context of everything else that was built. I need to look at the numbers. Whereas like an AI to the point that you guys made earlier, it’s always going to give you an answer, right?
[01:12:58] Chad: It’s very rare that the AI is going to say, I don’t have enough information to answer this confidently. And a person will. And I think for that reason, you’re just not going to see AI taking over, like making these decisions that require a level of criticality and thought and introspection and conversation with many other people for a long time.
[01:13:19] Chad: That’s my, that’s my point of view.
[01:13:21] Matt: Yeah. I can, I can see, I can see it being those roles being augmented, but, but complete hands off.
[01:13:27] Dara: It’s, yeah. I. Alright, well we’re going to bring you back in five years time, Chad, and see how your predictions have panned out. but for now, thank you again for joining us today for a really interesting conversation.
[01:13:42] Dara: That’s it for this week’s episode of The Measure Pods. We hope you enjoyed it and picked up something useful along the way. If you haven’t already, make sure to subscribe on whatever platform you’re listening on so you don’t miss future episodes.
[01:13:53] Matt: And if you’re enjoying the show, we’d really appreciate it if you left us a quick review. It really helps more people discover the pod and keeps us motivated to bring back more. So thanks for listening and we’ll catch you next time.
Subscribe to our newsletter:
Further reading

Webinar: Meet Dataform, the smart solution to fragile SQL setups

Measurelab awarded Google Cloud Marketing Analytics Specialisation
