#143 Stick, twist, or wait for Clippy: A week inside OpenAI

Dara Fitzgerald · 26 June 2026

Dara and Matthew spend a week putting OpenAI through its paces (ChatGPT, Codex and deep research) against their daily driver, Claude. A blind four-way experiment building a running-race predictor surfaces a clear split: deep research goes to OpenAI, while coding and long-horizon agent work stay with Claude. They close with a stick-or-twist verdict, taking in Anthropic's recursive self-improvement claims, a near-trillion-dollar valuation, Microsoft's own models, and the case for renaming everything Clippy.

Show notes

More from The Measure Pod

Share your thoughts and ideas on our Feedback Form.

Follow Measurelab on LinkedIn

Transcript Show transcript ▼

"It felt like Anthropic is further along in that long-horizon agentic capability compared to ChatGPT." - Matthew

"It seems like ChatGPT's deep research is just unanimously better... but the coding was unanimously better for Claude Code." - Dara

Show full (AI-generated) transcript

[00:00:00] Lizzie: Hello, and welcome to "The Measure Pod" by Measurelab, the podcast dedicated to the ever-changing world of data and analytics, with your hosts, Dara Fitzgerald and Matthew Hewson. Between them, they've spent more years than they'd like to admit wrestling with dashboards, data quality, and the occasional Google curveball.

[00:00:32] Lizzie: So join us as we share stories about how analytics really works today and where it might be headed tomorrow. Let's get into it.

[00:00:40] Dara: Hello, and welcome back to "The Measure Pod." I'm Dara. I'm joined by Matthew. How are you today, Matthew?

[00:00:47] Matthew: I'm good. Thank you. Yeah. Nice to see that you remember your own name now.

[00:00:51] Matthew: Getting back into the swing of things.

[00:00:53] Dara: Yeah, you said that last week. Come on. Come on.

[00:00:55] Matthew: No, I didn't.

[00:00:56] Dara: I once... Who doesn't occasionally forget who they are? [00:01:00]

[00:01:00] Matthew: That's true. That is true. Not often happens sober, but yeah.

[00:01:06] Dara: Who says I'm sober?

[00:01:08] Matthew: No. Yeah, you gotta have fun somewhere. Exactly. Podcast drinking.

[00:01:11] Dara: Yeah. Yeah, exactly. But I'm all right. Good. I'm good. How are you? I'm also all right. Thanks for asking.

[00:01:18] Matthew: Yeah, no problem.

[00:01:19] Dara: Okay, that's enough for today. See you next time, everyone. Great. Yes, no I'm good. I'm good. I'm I'm looking forward to this one. I think it's gonna be interesting. We probably always say this before we start recording, but I thought I'd say it as we s- hit record this time.

[00:01:35] Dara: This one's gonna be interesting. Yeah. It's gonna be a wild ride. It's

[00:01:39] Matthew: gonna be good or terrible.

[00:01:42] Dara: No, nowhere in between. Listen, like most things in life, they're either good or terrible. Who likes the in between?

[00:01:50] Matthew: Nobody

[00:01:51] Dara: Not interested

[00:01:52] Matthew: Nobody. If I find it, if I find it's feeling a bit mediocre, I'm just gonna tank it just to pull it down,

[00:01:58] Dara: e- exactly. [00:02:00] Exactly. Exactly. We will kick off with you. I've done a lot of legwork on the news this week. As u- as usual, let's be honest, as usual. But I'll let you, I'll let you steal my thunder. I'll let you have the credit. I'll let you pretend you've done all the research on the news this week.

[00:02:16] Matthew: Thank you for that. There is a... We normally say this flippantly and as a joke, but it is a little bit s- has been a little bit slower. It's hard this because we're t- we're technically recording this probably what?

[00:02:29] Dara: In the

[00:02:29] Matthew: future Three weeks before we're actually gonna release it?

[00:02:32] Dara: Or in the past.

[00:02:34] Matthew: Yeah, in the past. I'm gonna say it's been a bit of a slow news week, but that is from the 28th of May to, to today, the 8th of June. By the time this comes out, Mythos is probably out and self-recursive is probably... There's Terminators wandering the streets and mowing people down. I don't know. But-

[00:02:53] Dara: People are gonna be like, "What is this guy talking about?

[00:02:55] Dara: Slow news week?" The world as we know it has been [00:03:00] turned upside down, and he calls that a slow... God, it's hard to please this fella.

[00:03:03] Matthew: Who knows? I will... Maybe if there's something absolutely catastrophic happens, I might have to just jump in and do an ad- an addendum to this news item this news section.

[00:03:12] Matthew: But here is, here's what I have as I know it.

[00:03:14] Dara: Maybe this will be replaced by a fully AI-voiced podcast instead, and it'll just be agent to agent.

[00:03:22] Matthew: Yeah. It's- Yeah ...

[00:03:23] Dara: who knows? 'Cause, 'cause we'll all be locked up in little tiny boxes.

[00:03:28] Matthew: In our own little our own little AI spaces

[00:03:30] Dara: Wow we're five minutes into the episode, not even, and already we've gone to humans being kept in tiny boxes.

[00:03:38] Matthew: Dystopia. Yeah.

[00:03:39] Dara: Yeah.

[00:03:40] Matthew: Blame Hollywood. It's... news. So it's, there's a few pieces. Nothing major I don't think, but... So there's this this, Anthropic's released another one of its is it marketing, is it truth

[00:03:52] Dara: Yeah ...

[00:03:53] Matthew: blog posts, which is around recursive self-improvement.

[00:03:58] Matthew: And essentially what they're saying [00:04:00] is that they've noticed a very significant trend to the point where their engineers now internally are eight, producing eight times as much code as they were at the end of last year, and this code is now actively going into improving the models. They're also noticing at the end of last year they were the quality of the code was that or just below some of their top engineers, and they anticipate in very shortly to the middle of this year that the code quality is gonna be on a par or better.

[00:04:31] Matthew: In fact, it might have been, it might have said it's on a par currently, but they expect it to start exceeding their engineers' standard outputs in the middle of this year.

[00:04:40] Dara: I I'm probably not supposed to be. You were on a roll there and I've probably just completely ruined that, but it's too late now.

[00:04:46] Dara: It's started. I wanna know how do they assess that? Because what does, what do they mean by... Do y- yeah, because does the, like semantically correct and whatever, and but what but it surely it [00:05:00] depends partly as well on what context it is and what, and the prompting.

[00:05:03] Dara: I, I've just, I'd love to know how they judge that. What does that mean exactly?

[00:05:09] Lizzie: Yeah,

[00:05:09] Matthew: I don't know. It's- my guess would be- It's- ... things like, I don't know, efficiency the need for intervention, things like that that, like if it's able to just one shot produce efficient, secure code and a post-review comes out as golden without any need for change Maybe.

[00:05:33] Matthew: I don't know. Your guess is as good as mine. It's- Yeah

[00:05:36] Dara: it's interesting, isn't it? 'Cause it's like, what does... It's- Do you see what I'm getting at? What, like surely already it's capable of being as good as any developer if it gets the right context and the right... 'Cause it's with some of the stuff, it's ob- it's obviously objective, like it's either right or it's wrong, and then other stuff may be less but if they're just doing it purely based on something like efficiency or security or whatever, then you'd [00:06:00] think if it was prompted in the right way a, one of those developers using it to produce code is probably better than one of those developers producing code on their own, you would think.

[00:06:14] Matthew: Certainly more productive.

[00:06:15] Dara: More productive, yeah. Maybe not better. That's why I'm curious what better means in this case. I

[00:06:23] Matthew: think probably the entire... a- again, it's it's whether this is, there's a bit of the mar- the old marketing spin that Anthropic is pretty good at, the we're helping hu- humanity but also we're, or at the same time, bigging up what we are able to do and produce hand in hand.

[00:06:39] Matthew: They're the masters of that. But it's something that the entire, the entirety of software engineering is probably wrestling with, like what is-

[00:06:48] Dara: What means-

[00:06:48] Matthew: What is, and what is a good engineer at this point? Somebody could be technically brilliant But they refuse or can't gel with working in, in, with these [00:07:00] LLMs, so they are technically now worse than somebody who is below their theoretical skill level because they, what- for whatever reason, yeah, they can't produce as much.

[00:07:09] Matthew: Ultimately, it comes down to it comes down to the fundamentals of security and scalability and all of those kinds of things. But the buck stops with being able to produce features and profitability, and if you can ship more with these things that's the way, the least friction, that's the way it's gonna go, isn't it?

[00:07:27] Dara: Absolutely, yeah.

[00:07:28] Matthew: Yeah. As part of that, they,

[00:07:31] Dara: towards

[00:07:31] Matthew: the end- It's- ... of that post, they talk about they essentially call for a slowdown on the production of frontier models because they're worried right now they, they're kinda theorizing that they're gonna get towards this recursive self-improvement in short order, but they are not convinced that they or the world are ready for a place where AIs are able to s- [00:08:00] to improve themselves in that way.

[00:08:02] Matthew: Yes ... they call for a slowdown. I, so this, being cynical, I don't think that's going to happen in any way, shape, or form, and maybe they know that as well, which is why they can confidently say it out loud, but- It makes me think of the most b- bizarre connection

[00:08:16] Dara: or parallel, whatever. I don't know what it is.

[00:08:19] Dara: Have you seen the Arnie documentary, Pumping Iron?

[00:08:23] Matthew: No. I can't wait to see where this goes.

[00:08:27] Dara: This is the point we lose all our listeners. They're like he's finally completely lost it." So there's a moment, this really happens, there's a moment in one of the Mr. Olympias that Arnie won.

[00:08:36] Dara: Lou Ferrigno, Lou Ferrigno, who was the actual Incredible Hulk on the TV series back in the-

[00:08:40] Matthew: yeah ...

[00:08:41] Dara: in the '80s. So he was competing against Arnie, and the two of them they, the judges couldn't decide who's who's the winner. So Arnie said... Yeah, exactly. Yeah. So Arnie said, "Let's just walk off stage.

[00:08:53] Dara: Let's not give them the, let's not lower ourselves. Let's just walk off the stage." And then as they went to walk off, Arnie [00:09:00] turned round and went, "I'm the winner, I'm the winner." And he basically claimed the prize. And it just strikes me that it's a bit like that, that they're maybe saying, "Look, let's all stop.

[00:09:09] Dara: Let's all stop." And then if anyone else actually listens to them, they'll just say, ha we've invented AGI."

[00:09:15] Matthew: Yeah. And they know they can't... Yeah, they know they can't. Yeah. I it's a, it seems like a, just a clever thing just to put out there. I

[00:09:21] Dara: think they've been inspired by Arnie. I think they're fol- they're following in his footsteps.

[00:09:25] Dara: They thought, "That's a good idea. Let's do that."

[00:09:27] Matthew: Yeah. Who's the muscliest foundation lab? We're the m- we're the muscliest guys, but if you wanna start, whatever. Yeah, very interesting link there.

[00:09:36] Dara: Yeah. Yeah. I thought you'd like that one, yeah.

[00:09:38] Matthew: Very human. Don't think AI could come up with that one yet.

[00:09:42] Dara: Definitely not. No, definitely not.

[00:09:45] Matthew: As well, and this links to the second news item, they might be saying this stuff because they are in the, they're in the throes of raising their I- IPL which currently would put them on track to be [00:10:00] the most valuable startup ever, 'cause it's, it looks as though it's, the valuation currently is hovering around a trillion.

[00:10:09] Matthew: 900, 965 billion is the last one I saw.

[00:10:14] Dara: Let's just call it a trillion, shall we? Let's

[00:10:15] Matthew: call it an easy tril.

[00:10:17] Dara: Round it up to the tril.

[00:10:19] Matthew: Yeah. And it's, yeah, they're just... It's, there's a lot of stuff going around about what this does just generally to the stock market, and index funds, and all this sort of stuff when you're talking about these kinds of silly money things appearing.

[00:10:33] Matthew: But yeah, that they're going for... And that would s- that would put them past OpenAI's valuation, I believe, the current IPO valuation. Which is interesting in and of itself, which we might talk about a bit in the OpenAI section. Spoilers, don't know if I've said about that yet, but anyway. Yeah, so bonkers money

[00:10:52] Dara: basically.

[00:10:52] Dara: Pretty mad, isn't it? It's pretty mad and, this'll be the point, that if they have any skeletons in the closet they'll come out. Or if anything is, [00:11:00] if there's any- Yeah ... tensions brewing then, this'll expose them. It's gonna be an interesting-

[00:11:05] Matthew: Yeah ...

[00:11:05] Dara: interesting point in time for them.

[00:11:07] Dara: But just crazy numbers and crazy growth and everything else. It's just, it's mad. It's-

[00:11:14] Matthew: Yeah. They've pa- I think we've hinted at this on a couple of episodes but never had any numbers to hand, because that's what we do.

[00:11:19] Dara: Yeah, we don't like numbers really, do we?

[00:11:21] Matthew: No. But they have, Anthropic has recently passed OpenAI in annualized run rate revenue.

[00:11:29] Matthew: They're now up to 30 billion and they've passed an OpenAI who are at 24 billion. And the ch- the change, I think in... Let me just find the number exa- ex- exactly. But it was something silly like... So the numbers, one of the craziest numbers was that they went from 14 billion to 30 billion, as in they exponentially increased their annualized run rate in eight [00:12:00] weeks.

[00:12:01] Matthew: It is literally a, just a vertical line, and they look to be spending a lot less training than OpenAI are quite significantly less training their models, and less less just generally serving the models which is also interesting. I was thinking about this actually, because OpenAI have a ton more monthly or daily users.

[00:12:24] Matthew: But that they... No they've got the sort of general p- person ChatGPT market sewed up Yeah. Which is why they're so much more reliant on getting this ad stuff to work, because they need to get revenue from that. They've got all these free tiers in which they're serving up compute to all these people for nothing, and trying to hope that they'll drop into that day-to-day, yeah, how do I make a cake?

[00:12:49] Matthew: Whereas Anthropic is really concentrated on Claude Code and CoWork, and delivering value there. They're not-- Like, people don't tend to use Claude for how you bake a cake. It's much [00:13:00] more about doing. So they have less hangers on, as it were, if you... That's probably a bit rude to call those hangers on. But they just deliver value to a smaller subset, and that, yeah, it seems to be working unbelievably for them.

[00:13:15] Dara: It really is unbelievable. The, just the, that growth trajectory is just insane. And you definitely wouldn't wanna be... would you wanna be, would you wanna be in a position in OpenAI? Just not sure I would. They're having to-- They've had to roll out ads almost in desperation, and Anthropic are just like, are able to now say, "We're not gonna do that."

[00:13:37] Dara: And Google at the moment don't have to. Maybe Google will go that way if their own ad business starts to decline, but they've got all the cloud revenue then as well,

[00:13:47] Matthew: yeah, these two don't really have that.

[00:13:49] Dara: No. OpenAI are the only ones who have really had to think, "Sh- shit, our back's against the wall.

[00:13:53] Dara: We need to get some revenue in."

[00:13:54] Matthew: Yeah. They're still their annualized their ARR was 24 billion [00:14:00] to, to Anthropic's 30.

[00:14:01] Dara: Chump change really, isn't it?

[00:14:02] Matthew: But yeah, but none, neither of them are profitable, and it's worth pointing out and saying.

[00:14:06] Dara: Doesn't matter. Companies don't need to be profitable, do they?

[00:14:09] Matthew: Apparently not. Hell, I think Uber was unprofitable until two years ago. But, yeah, I think the idea is that the intelligence the cost per intelligence unit is reducing, and once, once they hit a critical mass where they're getting lots of use out of it, that reduces down, and then the sort of theory is sort of 2029, where those two, two things pass it, this is just a drop in the ocean.

[00:14:30] Matthew: Who cares? 'Cause it's just gonna be a license to print trillions of pounds for any investor,

[00:14:36] Dara: if money means anything at that point.

[00:14:39] Matthew: Yes. Yeah. True. True. But who knows? You're going back into dystopian again

[00:14:45] Dara: And that's just where I live. My mind is a dark place.

[00:14:50] Matthew: Yes. Yeah, I've always said that.

[00:14:52] Matthew: Yeah, and to move off Anthropic then for a minute, Microsoft have brought out a few models [00:15:00] themselves. Have they now? Yes. They've brought out to, to wean themselves off OpenAI a touch. They've brought out a couple of models. I've got the name for them somewhere. I have to cut this silence.

[00:15:16] Matthew: It's-

[00:15:17] Dara: But yes

[00:15:17] Matthew: they've released MAI Code One Flash.

[00:15:22] Dara: Ooh, great name.

[00:15:24] Matthew: Yeah, which takes written descriptions from people and spits out source code for applications and websites. So yes they're clearly trying to dip their toe in somewhere. They, they have a lot of data centers, don't they?

[00:15:36] Matthew: You would assume that they've got some compute behind them.

[00:15:39] Dara: Yeah, they do. I've just gone and ended up on playground.microsoft.ai. It looks like it's got an image generator, a transcription thing for audio to text, and a voice one.

[00:15:53] Matthew: I'm not seeing the other one. Maybe that's not fully available to us yet.

[00:15:58] Matthew: It's... No, probably [00:16:00] not. Anthropic are the only good ones who give us access to everything. At the same time as everyone else. But again, we'll come onto that later.

[00:16:06] Dara: Yeah. Indeed. Yeah, indeed. But that's interesting, yeah. And so what yeah, I wonder what, It's- Yeah, how that's gonna play out then because they're still...

[00:16:14] Dara: Y- like you said, wean themselves off, but have I missed a piece of news? Are they- Are they pulling out of OpenAI or are they-

[00:16:25] Matthew: I don't think so. I just I know that there's some... Obviously my when they first invested in OpenAI, they were not-for-profit, and now they've got this for-profit arm. There's some funkiness was going on with the definition of AGI, 'cause that was technically a, some point at which, I don't know, OpenAI could separate or I can't remember exactly the terms of it.

[00:16:47] Matthew: So I don't know if there's, there, there has been a bit of friction and-

[00:16:50] Dara: Yeah. It's just interesting that there's, if there is friction and then bringing out a model is like really showing your hand, isn't it? It's-

[00:16:58] Matthew: Yeah. But [00:17:00] again, they they kind of position themselves a little bit like Google, because I think a lot of OpenAI's models are probably trained on Microsoft data centers 'cause it, that, and that's what the, that's what the initial sort of part- It's-

[00:17:13] Matthew: part of the initial partnership was about, was giving them the compute resource to compete with DeepMind, who had Google's cloud to- It's- ... to train things. Yeah. I don't wanna do a Microsoft version of our,

[00:17:25] Dara: Don't care.

[00:17:26] Matthew: I don't care. Move it up. But yeah, I can't f- I can't ever see myself using Microsoft models.

[00:17:32] Matthew: It just feels too gross.

[00:17:34] Dara: It is a bit gross, yeah. Sorry Bill.

[00:17:38] Matthew: Yeah, sorry, Bill. Think about Epstein Island today.

[00:17:48] Matthew: Maybe let's stick to tech. Let's not go down that road. Moving on swiftly. Gemini 3.5 Pro was supposed to come out in June, may well have come out by the time this goes out.

[00:17:57] Dara: It's June now.

[00:17:58] Matthew: I know. Some [00:18:00] point in June. I think last time we talked about... I said there was like a thinking, but I think it was the 3.5 flash or light thinking.

[00:18:09] Matthew: There is a Pro version due to come out, and it's supposed to come out at some point in June. It's, it's- And finally- ... it's Claude- ... this Mythos, I think, appeared briefly for some in the wild as like a UI item. N- Didn't do anything, but it just- That's

[00:18:26] Dara: how it starts.

[00:18:27] Matthew: Yeah. Some, so it appeared as like a p- a Mythos preview appeared in a couple of places and then disappeared.

[00:18:33] Matthew: They have recently expanded out the Project Glasswing to more companies.

[00:18:39] Dara: Okay. So it's heading there.

[00:18:40] Matthew: Yeah. But I... and I know they mentioned in the 4.8 release that they were working on a new tier of model that would be coming out relatively soon. So you gotta assume it's some version of Mythos that's gonna sit above Opus- soon. [00:19:00]

[00:19:00] Dara: Very interesting

[00:19:02] Matthew: Yeah.

[00:19:03] Dara: That's the news. That is the news. Except this placeholder, which we're gonna fill with all of the big things that happen between now- Yes ... and when this gets

[00:19:12] Matthew: released. Yeah. Yeah. And if this, if we're still saying this, then nothing happened,

[00:19:15] Dara: which I

[00:19:16] Matthew: doubt.

[00:19:17] Dara: That's unlikely, isn't it?

[00:19:18] Dara: Yeah. Yeah. Okay, so onto our topic. So we've been promising this for a while a bit of a bit of an OpenAI deep dive, because we've covered Anthropic a lot. We've covered Google quite a bit. To be fair to... on the Google front we haven't dug into Gemini a huge amount. It's come up in the news a lot, and we've talked about, we've done episodes around kind of Google's position and things.

[00:19:41] Dara: But oh, I need to be careful or I'm gonna talk us into doing a Gemini, doing this Gemini. This is how we get into this trouble in the first place. Anyway, let's see. Park that thought. But we we figured we'd do a bit of an exploration of OpenAI, because neither of us have really used...

[00:19:59] Dara: we [00:20:00] obviously both started out on ChatGPT, but then fairly early on moved across to Claude. And have been, with a bit of Gemini thrown in there, sprinkled in there along the way as well. But things have changed a lot at OpenAI since we last were properly using their tools, so we figured we'd spend a bit of time exploring them and give our thoughts.

[00:20:22] Matthew: Yeah, I think there's a bit, there's a bit to say about the context of, like, where they sit a- amongst the others as well. What their aim... That we've just briefly touched on in the news section, but, like how they compare to what the apps and what they're trying to deliver and what they're trying to push is all really interesting as well.

[00:20:38] Matthew: Bit of scene setting as well,

[00:20:39] Dara: yeah. Where do we start?

[00:20:45] Matthew: I don't know.

[00:20:47] Dara: Just to this isn't necessarily the main starting point, but you said there about where they sit and where they're positioned, and obviously there are, even like we talked about earlier, and this has come up a few times before, where ChatGPT has got the bigger user base, but it's a [00:21:00] lot of what did you had some terrible, horrible, mean term for them, hangers on or something.

[00:21:04] Dara: I would never say such a thing, but- ... you've said it now, so that's the definition, so that's what we're gonna go with. But it's all on you. But yeah, no, they've got more of your kind of general kind of mass market usage. Whereas Claude has from, for a while now, been a lot of kind of focus on the coding aspect.

[00:21:22] Dara: But my kind of initial, very initial take when I started using ChatGPT, a- again now for this episode, and Codex, was they're really similar. Even the way they're designed and everything. I was finding it... Sometimes I had to double check which one I was in, 'cause I had them both open. They're not identical, but just, when you're a bit fixated on what it is you're doing.

[00:21:44] Matthew: Yeah.

[00:21:44] Dara: They're, generally speaking, they're quite similar. Like even Codex now is very much like CoWork, whereas it wasn't up until recently. I think it was just a CLI, and now the desktop app is very CoWork-y. So a lot of the, like, where you [00:22:00] expect to find things, they're in similar places, and they work in a vaguely similar way.

[00:22:05] Dara: Obviously, performance is different. We're gonna go into that. But just as a general kind of, what you might use them for, I found a lot of parity. I did. From my kind of use cases, I thought, "Do you know what? I could probably interchange between them without too much, without noticing too much of a real practical difference."

[00:22:25] Matthew: Yeah, I guess the main difference I noticed was that it, it did feel a little less mature, the Codex app, compared to Anthropic, because Anthropic have kinda combined their three verticals into that single application now. So you've got the chat, you've got the co-work functionality where it's connecting to all of your apps, et cetera, and then you've got code in there where you can run cloud versions of Claude code and things, and it all kinda sits in the same interface.

[00:22:51] Matthew: What I was finding with... correct me if I'm wrong here, 'cause I didn't end up using the app, the Codex app a lot for reasons that I'll come onto, [00:23:00] but I, it felt like it was very much just like the co-work equivalent as an app. It was a standalone app.

[00:23:06] Dara: No. This I found the same and it's really weird 'cause I had the ChatGPT app and I had the Codex app, but a lot of what I was...

[00:23:12] Dara: But I was doing coding in the CLI. So I was like hang on. What, why have I got this Codex app and a ChatGPT app? Why are they two separate apps?"

[00:23:21] Matthew: And I wonder if that's because they, because of what we've talked about, that they've got so many hangers on. They've got so many people who aren't gonna be interested in running code or even...

[00:23:34] Matthew: they're just people who just want the chat interface and apps in the chat interface, and booking hotels and doing all those sorts of things, that if they pull all of it together it starts to conflate things a little and make it confusing for their huge user base.

[00:23:47] Dara: Yeah. Yeah you're probably right.

[00:23:49] Dara: Yeah, no, I think you're probably right 'cause it's a bit like with the, within the Claude desktop app, you wonder like sometimes why would you use chat, but then you might use chat to do deep [00:24:00] research. But that's about the only thing chat really does now that you can't do in co-work. And then on the OpenAI side you've got that same thing, but you've got it across two different apps.

[00:24:10] Dara: So you're right. Probably for most people they might just download ChatGPT, and they'll use that, and that's fine. Or they'll just do it in the browser. But yeah, when you're kinda trying to use all of it, you end up thinking, "Hang on. Why am I..." But like the terminal open for... Which you don't have to do.

[00:24:25] Dara: You could obviously do the coding in Codex, the desktop app, but I was choosing to do it in the CLI. So then you ended up, you've got three different apps open, and you think surely there's an easier way. Surely there's an easier way than this." But y- yeah, maybe it is just as you said, that they're just aiming them at different parts of their target audience.

[00:24:46] Matthew: Yeah, you'd have to say one of the...

[00:24:48] Lizzie: So they- Yeah ...

[00:24:48] Matthew: they roughly have the same set of things, right? So they have the chat, they have the relatively new Codex co-working type equivalent, and then they have the [00:25:00] CLI coding stuff going on. And the way I felt it was when I was in, say, the chat app actually, I actually use chat through the, to the, through the web UI actually.

[00:25:11] Matthew: I didn't even think about downloading the sh- the app, but I was using it through the web UI. That feels more mature than just Claude's chat. It feels like that, you could tell they've been there for longer, and they've spent more time on that piece. Whereas the other two, the Cowork and the CLI, I still felt that Anthropic are far ahead on those two other pieces, like Co- Claude Code and the Cowork stuff.

[00:25:38] Matthew: Anthropic still has a thing. And I noticed, I went and was, I was on a VPN, just h- how... I forgot I was on a VPN for the US, 'cause I was trying to use G- Gemini Omni to create silly videos. And I went into ChatGPT, and then I was asking it something, and it's popped up an app, and like it was showing me a map that I could interact with and stuff.

[00:25:59] Matthew: I was like, "Oh yeah, [00:26:00] forgot about these apps." But then I looked into it, and that's only 'cause I was on the US. It still doesn't exist in the UK, and that was like, oh, that's terrible, 'cause that was a really... I was like, "Go ChatGPT." But yeah, that, that would be my assessment.

[00:26:14] Dara: And that was interesting actually what you said there with the, 'cause the geo restrictions was part, was one part of something that I found really both frus- frustrating and amusing, where as well as doing...

[00:26:26] Dara: So we'll get into this, but we both went and did some actual, we picked an example of something, and we went and worked on it to compare. But as alongside that, I just wanted a simple list of what each of them does. Sorry, so a list of what they both offer that has some equivalence. Features that even if they're not the same, that they, you can vaguely do the same thing across both of them.

[00:26:48] Dara: And then where each of them have something that the other one doesn't. And using AI to get that list is an area where AI still has some work to do, because I got each of them, I gave each of [00:27:00] them a fair shot and said, "Look, do this for me," but knowing that they'd be a little bit biased. But it wasn't even just that they were biased, they just had their weaknesses in terms of getting real, up-to-date information.

[00:27:11] Dara: And then I thought I'd use Gemini as the arbitrator, and Gemini even confidently said, "Oh, you're right to give this job to me, 'cause I'm not a horse in the race, and, I've got the most up-to-date real-time information," yada, yada, yada. So it built me the list, and I passed it back through the other two, and they ripped it apart

[00:27:29] Matthew: Maybe that's in the training data, just to disparage

[00:27:32] Dara: Yeah, I think so.

[00:27:33] Dara: But I ended up in this l- never-ending... I had to just stop eventually. I think I was on round six of passing the feedback back round, and I just, all I wanted was it complete. I should have just manually done it. I just wanted a list of what they do. But anyway the point within the point was part of that is because of the geo-restrictions.

[00:27:50] Dara: It was including things saying they're available, and then actually they either, they're available to some people but not to others. So it's hard. It's a bit of a moving target trying to find [00:28:00] out what... And because they're releasing stuff left and center, and deprecating stuff. One of the features that came up, I don't know if I can even remember what it was now, Canvas I think it was called in ChatGPT, but they got rid of it.

[00:28:12] Dara: It's just rolled into something else now. There was things showing up on the list that don't exist as of last month. There was things that weren't showing up that do exist now. And then things that are either partially available or whatever. And then mix in your little sprinkling of hallucination as well, which is actually quite hard to get this definitive list.

[00:28:30] Dara: I gave up on that in the end I think that is, it's... i'm need to be careful not to just wang on about Anthropic the whole time. But it is one thing that, and I think we've commented on this in the past, but Google and OpenAI are terrible for just w- US-based releases.

[00:28:47] Matthew: Or OpenAI is a very US-based release. Gemini does the same thing, but Google goes one step further and announces things three years before they come out, and then by the time it actually reaches, it looks completely different, it's called something different. [00:29:00] Anthropic's as far as I can tell, most announcements happen, and then I open Anthropic, and it's there, and I can use it.

[00:29:07] Matthew: I've never felt, I've never quite felt the sort of US, UK, or Europe, or rest of the world divide. No. And how, if they can do it, surely there's no... It's not gonna be a, it's, there's, there can't be a good enough reason why the others don't if Anthropic are able to do it.

[00:29:22] Matthew: No. No. But yeah so I, yeah I agree.

[00:29:25] Matthew: It is difficult to know the different parity, and it's difficult with, like I was finding... Maybe we'll come onto this, I'll come on it be- when we come to the project. But when I was using various features and things in the CLI, I was struggling to... They're all named slightly different things, and, but they do similar things, and it was hard to navigate my way around, like, how do I get it to do this?

[00:29:47] Matthew: How do I get it to do this muscle memory I've got in Claude Code in here? Which I, some cases I couldn't do.

[00:29:52] Dara: No, exactly. And that plus trying to match equivalent the model and the settings across them as well. [00:30:00] It's trying to make sure that you're using the, a comparable model and a comparable sort of thinking level.

[00:30:06] Dara: I don't know, re- reasoning level. And then how autonomous it is as well. So I was trying to step my way through quite carefully thinking, "I don't want to bias this by giving one of them, accidentally giving one of them an advantage over the other just 'cause I'm not so used to using OpenAI." Also, normally, when you're using Claude, you're not necessarily thinking.

[00:30:25] Dara: You're thinking about maybe things like that in terms of what output you want, but you're not thinking about it in terms of comparing it to anything else. I was trying to just make sure that I was doing everything fairly between the two of them.

[00:30:36] Matthew: Yeah. I still have one one good place to s- to start.

[00:30:40] Matthew: We've already started, that's what I'm on about, but one good place to continue starting. How did you find the the sort of downloading, setting up, and starting work with Codex, like the application?

[00:30:54] Dara: Will it annoy you if I say it was really easy and smooth, and I had no errors or anything like, like you did?

[00:30:59] Dara: It [00:31:00] just, yeah, everything just seemed to work okay for me. Mine, yeah, mine was terrible. So I, I downloaded it. It offered up... One interesting thing, another interesting thing is this migration stuff that's built into it, and I'm sure there's other migration assistants in the others as well, but when I opened it for the first time, it was like, "Do you wanna migrate from Anthropic or wherever?"

[00:31:22] Matthew: And obviously, did y- I said yes. But it didn't work. It, it brought in all these chats that said like it had no idea about them. I was hoping it would bring in memory and things like that, 'cause I... one of the things I found later was I was able to get better answers because j- Claude has just...

[00:31:38] Matthew: I've just got a load of memory just... I don't even know where it's stored. It's just all over the place. It seems to know these things, so I was getting better answers in that way. So yeah, that happened, and it didn't really properly migrate. And then I don't know if this was a ca- was because of that or just something, some particular version of the app I had, but it was incredibly unstable.

[00:31:58] Matthew: It crashed [00:32:00] continuously. I think I... it crashed about eight times in a few hours by the time I gave up and just started using the CLI Codex instead of bothering with that app. The actual Codex application itself, for me, was unusable.

[00:32:15] Dara: I didn't do any coding through the Codex app. I just did some conversations.

[00:32:22] Dara: I did the coding through the CLI in, just in the terminal.

[00:32:26] Matthew: Yeah. Yeah.

[00:32:27] Dara: So may- maybe I would've hit more issues if I'd tried to do-

[00:32:31] Matthew: Maybe. Maybe. It, I... There was something I was clearly triggering that you weren't, that was just causing it to just fall over every five seconds.

[00:32:38] Dara: No, I had no-

[00:32:39] Matthew: It is a less, it is a, an, a newer app in comparison to Claude Desktop, but still.

[00:32:45] Dara: Yeah, no, I had no, no major hiccups technically on, on either side. But then I already had Claude set up. But it d- it behaved itself, which it doesn't always. And the OpenAI stuff worked okay as well [00:33:00] Lucky you Yeah. I'm not usually so lucky with these things. My swearing at it was quite minimal actually.

[00:33:06] Dara: It was a bit... I had to go and swear at the wall just to get the-

[00:33:09] Matthew: maybe we should- Just to get my bears ... we should have at the end, and at the end of this podcast, we'll have a s- a stick or twist, and you gotta decide whether you're gonna stick with with Anthropic or if you're gonna s- you're gonna twist to OpenAI.

[00:33:20] Dara: That could be interesting. Yeah. That'd be interesting. Yeah. Yeah. And then, and so how did you find, Codex. Should we go into little projects who we were, we played with?

[00:33:29] Dara: Yeah. So I'd made life a bit difficult for myself, to be honest, I think. I I probably spent all the time doing the test and not a lot of time processing the results and preparing to talk about the results.

[00:33:42] Dara: I wrote a big, long document here that I'm gonna probably have to try and, smoothly scan through and pick bits out. But so what I did... So what I decided to do I thought I'd pick something non-work related to, to partly, this is maybe overthinking it, but partly to keep... 'Cause I was gonna use [00:34:00] my work Claude account, so I didn't want it to have anything, any context that might be useful for it from previous conversations.

[00:34:06] Dara: So I decided to build a... there's loads of these online, but I decided to build a running race predictor. So like a to predict the time that you'd run in a race based on a previous race results and training leading up to the race. So useful for me, something that that I would use personally.

[00:34:24] Dara: And I wanted to have minimal involvement. So the way I did it, I did two deep research pieces, one with ChatGPT and one with Claude, and I fed the same prompt into both. I said, "This is what I wanna build. It-- This is... All I'm gonna give you is this prompt. You need to research all the available best practice methods for predicting race times."

[00:34:49] Dara: There's a bunch of them out there already, and it's "Go and research them and bake that logic that you think is best into a standalone web [00:35:00] app." And the output of the deep research needed to be the brief for the coding agent, and then I was gonna feed both of the... so I fed that prompt into ChatGPT and into Claude.

[00:35:10] Dara: So I got two outputs, and I fed each of those outputs into both Claude Code and Codex, but blind, so they didn't know which one had produced which brief. So I ended up with four running apps. And then the, and then they, those four were assessed based on accuracy, quality of the code. And actually, this is interesting 'cause this links back to what we talked about earlier about what, what does better code- mean because I had them assessed then by Gemini, and it looked at the prediction, it looked at the quality of the code, and it looked at a bunch of other things as well, and then some kind of usability things.

[00:35:55] Dara: And the... Shall I move on to the results or are we gonna, how are we gonna, [00:36:00] do we do the big reveal at the end or should I just say what the... Should I just jump straight to the results now? Yeah.

[00:36:06] Dara: So the deep research was significantly better with ChatGPT.

[00:36:13] Matthew: Yes ...

[00:36:14] Dara: and the brief that it produced from that to go into the coding agent was better.

[00:36:20] Matthew: It's, I might as well mention my, 'cause I did a test comparing deep research f- on both, and I found... so I have some headline stats for mine. Exactly the same prompt into both. The time ChatGPT took to do the deep research was 29 minutes versus Claude did it for six to seven minutes.

[00:36:43] Matthew: The searches ChatGPT did was like 530 odd versus... It didn't have a direct comparison of searches. It Claude called it sources, which I assume was the same thing, which was 230 sources.

[00:36:57] Dara: My, my searches and sources were [00:37:00] more similar to each other, but the time, they're just very similar.

[00:37:04] Dara: So 32 minutes for ChatGPT, and nine minutes 59 seconds for Claude. But they w- but the s- Claude's sources was 245, and ChatGPT's searches was 277, so they were closer on that. But the time it took ChatGPT... And I was annoyed at first, 'cause I was saying, "Oh, that's a black mark against ChatGPT," but the quality of it was...

[00:37:27] Dara: And this was judged on my, in my case. I don't know how you did it. I got them to judge... So I got a fresh Claude session and a fresh ChatGPT session, and fed just the output of the research back in, and then I also fed it into Gemini, and all three of them said ChatGPT's was better. So even Claude. M-

[00:37:47] Matthew: mine was corrupted essentially because I've, I had it do some deep research on something I was looking up around Scene.

[00:37:56] Matthew: So the output I p- I fed both of the outputs [00:38:00] into Gemini, but the output for Claude was elevated 'cause it knew what Scene was, which was part of the, part of what I was thinking of there's a little bit of a walled garden there. It's just, it's got so much context tucked away on things.

[00:38:14] Matthew: It knew what Scene was, it knew what kind of application and everything, and it fed that into its context of when it was doing the research. Whereas ChatGPT went more generic because it didn't know what Scene was. So the... Essentially with Gemini, she said this, "It's a lot more thorough, and as a standalone piece of research, a lot better for ChatGPT.

[00:38:33] Matthew: But the most useful piece of research probably to you right now is the Claude one," because it knew what Scene was. In the end, I combined them both together and used that. But yeah, I kinda corrupted it a touch.

[00:38:44] Dara: That makes sense. That's why I went with the, that's why I went with the running app thing, 'cause I was like, I'll keep anything.

[00:38:50] Dara: I'll try my best to keep it unbiased. Obviously, it's hard 'cause there's lots of other places where it could swing one way or the other. But yeah, it seems like ChatGPT's deep research is [00:39:00] just unanimously better.

[00:39:02] Matthew: Yeah. It'd be interesting to look at Gemini's as well, because I have... I remember finding Gemini's deep research really good in comparison to OpenAI's at one point.

[00:39:11] Matthew: I think I shared two documents internally saying "This was ChatGPT's, this was OpenAI's..." Sorry, ChatGPT's and Gemini's was much more thorough. So I would be... you can imagine the might of Google in their search. They would have some pretty good clout behind it, so I'd be interested to compare those two.

[00:39:30] Dara: One thing I noticed with Gemini, 'cause I haven't used Gemini much lately is it was unbelievably quick. Like, when I was passing in, one, at one occasion I had to actually say, have you actually checked this? Have you verified this? Because you just spat this answer out." It says, it was when I was getting it to do the list of features, which actually turned out not to be entirely right anyway.

[00:39:52] Dara: But it was saying, "Yes, I pulled all this information in real real time from the interwebs." But it was blindingly quick. I couldn't believe how quick [00:40:00] its response was.

[00:40:01] Matthew: Yeah. That might... I think their default model is that 3.5 Flash now. So I bet that is quick to respond.

[00:40:06] Dara: I'm not sure I was even, I think I was on Thinking Maybe I wasn't. No, I was on Flash, yeah. Yeah, so maybe that's why. So my-- So if so then I fed those two briefs as they were into Claude Code and into Codex and had them both make two apps. So as I said, I ended up with four, and then this is a deep research was undeniably better from ChatGPT, but the coding was unanimously better for Claude Code.

[00:40:38] Dara: But the reason, or partly why I was saying it's interesting about what does, like what does better mean, because the most accurate predictor... So what I did with this, I then took historical data of mine and fed it in to get it to do a prediction of a real race that I'd done, and the most accurate one was the ChatGPT deep [00:41:00] research and the ChatGPT coding agent.

[00:41:04] Dara: But the code wasn't as good, and Gemini was judging it based on, I don't have the detail of exactly what, but like it was judging it based on quality of code. And I don't know. Again, I don't know what-- I didn't get time to dig into exactly what measures it looked at or, how it defined good quality code.

[00:41:23] Dara: But but Gemini and Claude and Chat... 'Cause again, for all of this, I fed it through all of them anyway, and I used Gemini as the kind of official judge, but I was curious to see what the others would think as well, and all three of them thought that Claude Code's code was better. But it wasn't the most accurate predictor, which was the point of the app.

[00:41:44] Dara: So I thought that was interesting. To me the, But this is, but maybe this is me because I'm thinking more of the output and not of the, like the line by line quality of the code. But to me, I would say the winner, quote unquote, would be the one that [00:42:00] did the best job

[00:42:02] Matthew: Yeah. I guess there's a lot of aspects, there's a lot of aspects to an app as well as...

[00:42:07] Matthew: I suppose it could be d- it could get the right prediction, but then have a giant glaring security hole in the middle of it that gives everyone 50- Which would be bad,

[00:42:14] Dara: yeah.

[00:42:15] Matthew: But there's like the visual aspect and UI and all those sorts of things that would n- that would need to be considered as well, right?

[00:42:21] Matthew: But-

[00:42:21] Dara: Yeah. Yeah.

[00:42:22] Matthew: Is it... on the model, I think ultimately 5- ChatGPT GPT 5.5 I've found to be a really a good model. It- It was a good model. Yeah ... it didn't feel, it didn't feel like it was like massively behind anything else. It felt pretty much on parity with what I've, the experience I've been having with 4.8.

[00:42:38] Matthew: It didn't feel like I was taking a step down or, and it didn't feel like I was taking a giant leap up, but it felt, yeah, it felt good. It felt right. So-

[00:42:47] Dara: Yeah. Comparable. Yeah, I agree. Yeah.

[00:42:49] Matthew: So the, I guess the only difference there that you're experiencing is in the co- the code quality is from the chops of Claude Code, right?

[00:42:58] Matthew: Like how much more mature that is as a [00:43:00] product compared to- 5.5. Interestingly while we're on the model when I was doing some research, and this might be wrong, but I... 'cause I poked it at, I poked at this.

[00:43:11] Dara: Bit of salt.

[00:43:12] Matthew: Bit of salt in there.

[00:43:12] Dara: Yeah, about time we sprinkled some salt in this episode.

[00:43:16] Matthew: 5.5 was apparently their first rebuilt model since 4.5, as in retrained model. Oh. Which which would throw our point- It would, yeah. It would, out the window, yeah ... it's apparently rebuilt, re- rebuilt for agent, f- with agent being the focus, agentic workloads being the focus. I've not been able to find a reliable...

[00:43:40] Matthew: I don't know where it got that from. Could be hallucinated, but I felt we hadn't been salty enough, so I'm chucking it out there anyway.

[00:43:45] Dara: Yeah. It's, let's just say it's true. Yeah. So something I... I don't know if you had this, something I had, and I did try and match them, so I don't know if this was something in the settings.

[00:43:55] Dara: So I couldn't get, I probably could have got, I probably could have [00:44:00] stuck at it, but it di- I couldn't immediately get ChatGPT to go into dangerous mode or whatever. It was saying something like it couldn't do it for whatever, and I just thought, "Okay, fine. I'll do it in auto edit mode," or whatever that one was called, the kinda standard one.

[00:44:14] Dara: And I did the same for Claude and Codex. And one of the things that I just observed was Claude asked me, what's it c- it's called an intervention or something, is it? But there is a technical word, isn't there, when it asks you for permission. It's not asking you to reply, but it's asking you to approve something.

[00:44:32] Dara: And it asked me six times for the Claude brief and Cl- and so these are both for Claude doing the work, but it asked me six times on the Claude-generated brief, and it asked me five times on the ChatGPT-generated brief. And Codex asked me zero times for both briefs. It just did it, and came back and said, "It's done that."

[00:44:56] Dara: Which I thought was interesting. I don't know what it means or why but [00:45:00] I found as well, this is not thoroughly tested, but even when I was using Codex as like a co-work equivalent or even just chatting with ChatGPT. I felt like it was j- it was being a little more proactive and not trying to come back and check things with me compared to Claude.

[00:45:19] Dara: It was you could see it figuring things out. It was like, "Oh, the thing he asked, I don't have access to that, but I could look somewhere else." And it was going and actually trying to figure things out. Whereas Claude by default is trying to say, "Oh, you didn't mount that folder." And then sometimes even you have to say ask me for permission for it then, and I'll give it to you," and then it will do it.

[00:45:37] Dara: But ChatGPT just seemed a little bit more fluid in that respect, which is not necessarily a good or a bad thing. So obviously you might want to, you might want to give it permission to things again. But it just seemed a little bit more willing to go the extra mile without checking in on everything with me

[00:45:56] Matthew: It's, yeah, I think I s- it [00:46:00] feels more convenient, but could...

[00:46:01] Matthew: It feels like the harms may outweigh the convenience of it because es- especially if it's just going and doing things and looking in folders without checking. That, and maybe this comes from Anthropic's safety bent that they, they are apparently, responsible AI builders, and they bake that stuff in more.

[00:46:21] Matthew: But what I found... maybe have you is that a wrap up on your on your coding experiment? Or did it give any other-

[00:46:31] Dara: Loads, loads more to tell you.

[00:46:33] Matthew: Let me tell you about my coding thing, 'cause it t- it fits nicely into that bit. I didn't do a comparison. I just used Cha- I just used Codex CLI to build out a new app.

[00:46:44] Matthew: So I started off with a bit of a cr- a bit of a conversation and brief just coming up with an idea, 'cause I didn't really have one at the time. Something re- relatively relevant to what we do. So I was looking at this sort of of attacker management and things like this, and [00:47:00] saying what, how does that work in the modern era?

[00:47:01] Matthew: What, w- is that an old solution looking for, trying to adapt to a new problem? If you had to build it today from scratch, how would you do it? And I found that conversational aspect pretty good. Like it was putting forward suggestions. It was pushing back a couple of times saying, "I wouldn't really go in that direction," and, "I'd go in that way."

[00:47:18] Matthew: So it wasn't... It felt like it, they'd tried to pull back on the s- the sycophancy of it a touch in places and it felt quite, like quite a good brainstorming partner. I, and then I started to build out this, it's essentially a bot and LLM analytics platform. So it just concent- it just concentrates on the bot and LLM traffic your site gets, and how you can improve that, because that's the future people who are gonna be using your website.

[00:47:46] Matthew: That was the idea anyway. And yeah, like the, what it produced was good in the end, and it was able to go off and create things autonomously. But I found, ultimately I found It struggled or I couldn't get [00:48:00] it to go long horizon in the same way that I could with Claude. I felt like especially with auto mode now in Claude, like I can pretty much walk away and be confident that it's just working working along in the background.

[00:48:14] Matthew: And I found it harder to do long horizon tasks with Codex. And I know there are like, you can push it into modes that will get it to a dangerous mode. But there's a critical difference between Claude's auto mode and dangerous mode for these things. And Claude's auto mode passes off decisions to a sonnet model to say, "I wanna do this.

[00:48:38] Matthew: Do you think I'm okay to do this?" And then that's assessed, and then if it's okay, it continues working through its process. So it kinda has an arbiter that it's it's doing things. And when that pushes back and says, "No, you can't do that," that's when it comes back to you, which is few and far between.

[00:48:52] Matthew: With dangerous mode, it's just "Just do whatever you want. Don't check. Just go and do whatever." So I found like I still find and I f- felt this [00:49:00] today when I was back using Claude Code again. Like I set it to, I set it to task to create to review a, an issue in GitHub and to come up with a plan of action and to assess a few things.

[00:49:13] Matthew: And that went off, that, that used Ultra and it went off and spun up like 16 agents and ran for an hour in background. Just ran away and came back with this plan and all of these inputs. I struggled to get it working that way in ChatGPT, and maybe I wouldn't have cared as much, but very recently I've started to move more and more into this sort of autonomous handoff way of working, and it felt really jarring to have to keep going and clicking yes, even if it was only like a handful of times.

[00:49:43] Matthew: It's like you, you come back, you're like, "Ah, yeah. Yeah." So it feels it felt like Anthropic is further along in, in that long r- long horizon agentic capability compared to ChatGPT. But it might just be, it might be some settings that I couldn't quite figure out or [00:50:00] find. I couldn't find any equivalence of auto mode like the handoff checker that exists in Claude.

[00:50:06] Matthew: But what some things I did find and use was like the goals- ... how to play with them. So I s- I, it was building out this model for identifying bot traffic and coming back asking me all the time. In the end, I said like I set it a goal to review, improve, and get i- the identification model and algorithm as good as it could.

[00:50:28] Matthew: So it looped over those things a little while and returned something to me. Similar to loops, I think, in Claude. I think they're the same thing called different things, but but it did a pretty good job of that. And ultimately what it output was like the UI. I didn't really do, I didn't do like a big design file.

[00:50:45] Matthew: I got it to just create a UI. It came up with the concept of, what's it called? Actors, like human actors, LLM actors, bot actors. It came up with this whole concept for the application and then built it out, and it ultimately, it worked. And the model was good. I just [00:51:00] found it, I found it a little bit more limiting in terms of long horizon.

[00:51:05] Dara: Yeah. And maybe you're right. I don't know. I said this earlier as well, didn't I? But not knowing exactly if they're set up equivalently, because I think there probably are settings in there that maybe... It doesn't mean that they y- that probably is a difference, and that probably is a genuine difference in, in how they perform in that long horizon stuff.

[00:51:21] Dara: But yeah, I didn't h- I didn't spend enough time, 'cause we've been using Claude for so long. I didn't... I was quickly trying to set things up and run them, and I didn't have time to interrogate exactly what all the different options are. But something that I'll say 'cause the qua- going back to the, the quality and stuff, like the quality of Claude Code's code is obviously better because it seems like across the board that was the finding.

[00:51:46] Dara: But Codex was cheaper and quicker, and depending on what you value, like it's a, it's, it comes down to what you're looking for. And it's the same as choosing the model. It's what do you need for the job at hand? [00:52:00] Because for I don't think I have the tokens in front of me, but Claude Code used up a lot more tokens to produce that better code, which, maybe you think of course, 'cause it's doing better work."

[00:52:08] Dara: But it, you'd have to get into the what's the meaningful difference in the quality? And how much do I wanna pay for that difference? And the speed thing is another one. It's if you w- it depending on if it's something that you're just using, if it's a production-ready application, you obviously want it to be as top-notch as possible.

[00:52:27] Dara: But maybe for other things, you m- if it's just for your own use, you don't want it to not work, but you might prefer to go for something quicker and slightly less robust or less, perfect code, but you might favor using less tokens and having it quicker potentially

[00:52:43] Matthew: Y- yeah.

[00:52:44] Matthew: I think the token thing's funny 'cause it, it's hard to know with Claude what tokens it's actually using 'cause it's got... They've done a lot of work with the cache and caching plugins, so there's, they, it could be like 90% of them aren't actually something you're [00:53:00] being charged for. But yeah, I know what you mean if y- and as well, it depends on the way you're using it.

[00:53:04] Matthew: If you, it could be, Claude could be slower, but you could step away and do something else for a long period of time 'cause of this long horizon, so it could feel quicker. But then at the same time, if you're just wanting to bash through things in the interface straight off with and just work it and sit in that session, then yeah maybe the, yeah, quicker models might be the way to go.

[00:53:24] Matthew: Maybe f- Gemini 3.5 Flash is the way to go 'cause of how quick it is. I found as w- I found as well, and again, this could be a settings issue, but I did some research after. I found it, it didn't go into, it didn't tend to go into parallel agents.

[00:53:38] Dara: No, I didn't see that at any point.

[00:53:40] Matthew: Apparently it is a capability but Claude code very much now, its recommended approach is to spin up sub-agent, parallel agents with all their own context windows and everything like that.

[00:53:52] Matthew: So it'll just spin up 16 agents and work through the, work through different aspects of things. I didn't notice it doing that, and again, that feeds into the long [00:54:00] horizon feeling I got, but I didn't notice it doing that a lot, even though it's potentially does. Maybe you have to force it into it a bit more- Yeah, maybe

[00:54:08] Matthew: than with Claude.

[00:54:09] Dara: The other thing I wanted to get to and I didn't was comparing Atlas against, I know Atlas is a standalone browser, but comparing the in-app browser. So I was gonna do some testing then of get them both to run through the different apps, but I ran out of time in the end.

[00:54:26] Dara: And also I have to admit, I was slightly less interested in that. I forgot it existed until you just said it. There you go. Yeah. Yeah. Yeah. I used Atlas. I downloaded it when it came out, like the actual browser, and I just didn't like it 'cause there's that thing, we've talked about this before, I don't know if we've talked about this on the podcast or outside of it, but like you end up sitting there watching it do the thing and you think this is a weird, it's a weird, it's a weird concept.

[00:54:49] Dara: It's almost like that in between, there's like that in between technology before it becomes something. It's like they can't quite release it to people and have it go away and do things without them knowing. So they're like, "Oh, no, [00:55:00] here's the interim approach. We'll have them, we'll have it like play out the browser, the automation in the browser."

[00:55:05] Dara: But it's weird from a user point of view 'cause you're just sitting there watching something happen in your browser. So it feels a little like an in between tech. It's like the mini disc, it's like a, it feels like an in between technology. I miss the

[00:55:16] Matthew: mini disc.

[00:55:17] Dara: Same.

[00:55:18] Matthew: Yeah.

[00:55:19] Matthew: I, but that's changing for me now, but I, again, just like I just literally, like I just said, it's because of the long horizons. If I'm getting used to and trusting more that I can just walk away or just go attend to... I built like a little notification service for myself last week where I can set it off on auto mode to create to do things, and I can be off doing something else, and it'll ping me and say like it wants some input or it's finished.

[00:55:44] Matthew: And that could be 40 minutes of it just doing work in the background or even longer if I like I, that, that example from earlier. I'm getting more and more used to that, and that's part of what was in that economic index thing, that Claude released a little while ago, wasn't it?

[00:55:57] Matthew: Like how mu- how much is augmenting and how much are you [00:56:00] starting to just go "There you go, you handle that."

[00:56:02] Dara: Yeah. Figure that out for me. Yeah. Did you... So going back to what I said about my slightly circular experience of trying to get a clean list of what they do and what they each do that the other one doesn't.

[00:56:14] Dara: I ended up giving up on that a little bit 'cause it was getting farcical. But did you find anything that you felt w- either was missing from the OpenAI set of tools that Anthropic has or vice versa? 'Cause I don't think I did, and I didn't explore everything. But I didn't apart from obviously, so there is like the Atlas browser, and that's a standalone thing, and Anthropic don't have a browser.

[00:56:37] Dara: But in terms of the kind of day-to-day functionality of what I would normally use, I didn't find anything missing. But equally, I didn't find anything in ChatGPT or Codex that I couldn't do already with Cowork or

[00:56:52] Matthew: Claude Code. Other than the pieces I just mentioned, like the little... And I don't know if it's muscle memory or what within Claude Code where I [00:57:00] know just to get this working or spin that up or up the work rate or whatever.

[00:57:05] Matthew: Other than like those kind of things, the only one that st- that stuck out to me is the code the code interface within the desktop app for Claude. 'Cause I've recently, as part of that sort of long horizon stuff, I've very recently started using it to actually code because they've got like a file browser of stuff in there now, and I've always been very much in Visual Studio using the CLI or the terminal.

[00:57:31] Matthew: I've recently started using that. And that's got stuff in it like it shows me where my current token rate and burned out over five hours or whatever. It shows me the different merges I can do. It it can go to auto mode so I can get notifications on my phone and keep it going and moving as I...

[00:57:48] Matthew: So I found myself like on my phone being able to reply and get it to carry on working. So that's the biggest missing piece for me was the, was that, was just that whole interface, that whole [00:58:00] nicer way of dealing with autonomous ways of working with code and the remote stuff.

[00:58:06] Dara: Yeah, 'cause I wasn't sure.

[00:58:06] Dara: I wasn't sure about that. That actually, that's, It was something that came up, I think, in, in me trying to compare them, and then it was flagged as possible hallucination, so I wasn't sure whether you could or couldn't do that remote stuff with ChatGPT.

[00:58:20] Matthew: I couldn't find anything obvious. I mean- Because I, I would assume not.

[00:58:27] Matthew: Because it's part of... it's built into the app. Unless you can do it by Telegram, 'cause I know that Claude has the Telegram stuff i- up there. But I would say that probably in short order, OpenAI's going to come out with something, right? 'Cause they hired the guy who built Claude Bot.

[00:58:44] Matthew: The, the- they've got to be thinking about how do we make this sort of always on autonomous assistant, 'cause I'd otherwise, I don't know why they bothered hiring him. Yeah ... 'cause that's most of what, that stuff I just described is Claude's reaction [00:59:00] to Claude Bot a little bit.

[00:59:01] Dara: Yeah, it is. Yeah. Do you think, or do, do- just seems obvious now, but it's only just come to me.

[00:59:06] Dara: They're so fixated on trying to match each other. Do you think they're like... Because it really felt to me like, as I said earlier, okay, this, if when you really dig into it, you can sense some of the differences. But at a high level, they are just basically trying to match each other.

[00:59:20] Dara: And you would think, and it's easy to say this from the outside, but you'd think one of them would say, "No, we're gonna just spin off a little bit and make sure we're, we've got our own kind of unique offering." It probably is way easier to say that than to do it, but it just feels like they're playing this weird kind of childlike game of catch up.

[00:59:42] Dara: Oh if you can do that, I can do that, and as opposed to them... I was struck by how similar the product set is at least at the surface level.

[00:59:52] Matthew: Yeah. I was... I didn't find that. I didn't feel that. I felt, especially on the Codex app, they were very different, that I didn't have what I [01:00:00] needed.

[01:00:00] Matthew: And maybe it's 'cause we're doing different things on a day-to-day basis that's causing that- that gap. But my feeling is that Claude Code has maybe it's ChatGPT chasing up Claude Code now because, and that, and so a lot of the similar things are appearing because they've seen that, Claude Code is essentially positioned Anthropic as this behemoth in business and enterprise that, that they wanna be.

[01:00:27] Dara: Yeah, I guess something in the level up, like you're right. I'm not saying Claude Code and Codex are that similar, but if you g- if you go in at the highest level and you go right, there's a, there's an app, a desktop app that you can ask questions and you can connect it up to all your different, Slack and Excel and all these things.

[01:00:42] Dara: And then there's a coding agent. So if you only went that deep and you didn't really dig into like the differences in the two coding agents I don't know, m- maybe there's no reason why they would be that different because they've come, like all the Anthropic people came from OpenAI, so maybe I shouldn't be [01:01:00] expecting them to be fundamentally different.

[01:01:02] Dara: And there's obviously ethical differences and safety differences and all the rest of it, but just the kind of product set, it's just It's quite similar. I don't know, maybe I'm expecting something unrealistic that one of them would completely just go off and go, "Fine, look, we're not getting involved in this tit for tat anymore.

[01:01:16] Dara: We're just gonna..." And maybe this is what Anthropic are doing, maybe.

[01:01:21] Matthew: A little bit, but I don't know if it's... If maybe there's a set, maybe there's a set of... May- They probably are copying each other, almost certainly are copying each other to a certain extent. Maybe there is just a natural cadence when they prioritize coding in the first instance to, to get towards recursive self-improvement.

[01:01:40] Matthew: Maybe that just naturally leads to sets of solutions and products that come out of it, 'cause that's what they're working on first, and code, and then moving to the cowork and Codex-type stuff to, to do white collar work. Maybe none of them, yeah, maybe one of them don't wanna stick their neck out too far.

[01:01:57] Matthew: They're not gonna... I think it would be more a [01:02:00] case of they're always gonna be copying and chasing the same things, but does one of them make a bet somewhere that is different and see what sticks, which I think Claude Code was in the first instance. They were the first movers there really, weren't they?

[01:02:13] Matthew: All the year, all many years ago. And Cowork. And Cowork yeah. Yeah. True. Yeah. Maybe Mythos will just gobble up ChatGPT and chew it up instead of those. Maybe. Maybe. Or OpenAI will come out with something. Maybe OpenAI will beat them to it. There's definitely some stuff's gonna happen.

[01:02:30] Dara: It's gonna be... No, you know what it's gonna be. It's gonna be the Microsoft models. Yeah, that's the future.

[01:02:36] Dara: That's it. So then that's gonna be our lengthy, deep, four-hour deep dive. So the next episode is gonna be the, yeah, deep exploration of the new Microsoft models.

[01:02:46] Matthew: If they release the models as Clippy, and Clippy is your...

[01:02:50] Matthew: Then I'd probably buy into it.

[01:02:51] Dara: I'm sold.

[01:02:53] Matthew: Yeah. But if not, then I'm not interested.

[01:02:55] Dara: Clippy the frontier AI model.

[01:02:57] Matthew: Yeah. Why they didn't call... They could've, [01:03:00] they should've called the models Clippy, 'cause you had Claude, Clippy. That would've worked.

[01:03:04] Dara: Yeah. I would. Honestly, I'd be sold So you said you, yeah.

[01:03:09] Dara: Okay. Great mind- You're

[01:03:10] Matthew: about to say what I'm about to say.

[01:03:11] Dara: I've got... Go on. You do it. You're the last question guy, so go on.

[01:03:15] Matthew: Okay. So yeah, stick or twist. Would you stick where you are with Anthropic? Would you jump over to OpenAI, or is it like so much of a muchness, I'm just gonna stay where I am because I can't be arsed?

[01:03:32] Dara: I was probably more im- maybe impressed isn't the right word, but I probably had a better time with OpenAI than I thought I might. I don't know why really. Just 'cause I hadn't been using it doesn't mean I should have expected it to not do the same things that I would, that I'm currently using Claude for.

[01:03:52] Dara: But I just maybe felt like it was gonna be a bit of a s- a bit of a chore using it and trying to figure it out and, thinking, "Oh, this isn't [01:04:00] Claude. I don't like this." But actually for me, I know you had a lot of issues with it. For me, it was quite smooth, and then there were objective things that were better, like the deep research, for example.

[01:04:10] Dara: But ultimately, I'm not gonna move away because because Claude is really good also at what it does. Maybe the next time I have a deep research due, I might go to ChatGPT, because that's something that I just do that's never really linked to anything else in my kind of Claude ecosystem anyway. I'll just do that in a chat, and then I'll feed it into whatever I'm using it for.

[01:04:34] Dara: That might be the only thing that I'd hop over to ChatGPT for just because it was so much better. But for everything else, I think I'm... so both in terms of familiarity of using them and having them set up and having the memory and the history and the context, but also the at least on the surface, the ethical kind of leaning of Anthropic or more ethical leaning of Anthropic still makes me prefer them as a company.

[01:04:59] Dara: I'm gonna [01:05:00] stick with Anthropic. No, I, yeah, ultimately, like the model was great. I didn't find it massively worse or better than like 4.8. Seemed very similar. Slightly, maybe slightly different personality and way of talking, but nothing that was blowing my mind. Yeah, I had terrible time with the Codex app.

[01:05:20] Matthew: I just couldn't get it to work, and ultimately it was missing pieces that I've come to get used to with with Claude. And I f- I still found Claude Code Much better just generally my day-to-day work. Most of my day-to-day work is coding based it feels like s-still the right model.

[01:05:35] Matthew: I g- similarly, I might have... What I might do is compare Gemini to ChatGPT from the deep research front, because I'm interested in is Gemini even better than that? But yeah, I might have a, I might start handing off some de- deep research tasks outside of Claude because you don't know what you're missing till you see another alternate research.

[01:05:55] Matthew: And then maybe, maybe down the line there'll be, there will be these sort of [01:06:00] emerging focuses that, that turn out to be better with different models. And I think I have seen some people that have these sort of multi-agent, multi-frontier model stacks, because this is better at this, and this is better at this, and this is better at this, and you can route things through the different models in that way.

[01:06:16] Matthew: But ultimately, yeah, I think the vi- the vibes and the responsible-ness of that Anthropic claims to be still resonates. Claude Code still the beast, the best one for me. So yeah, I'm gonna stay where I am for now.

[01:06:29] Dara: Yeah, until Clippy is on the market, that's where we're gonna stick with Anthropic.

[01:06:34] Matthew: Yeah. Yeah. And maybe we'll see where we're at. We'll do this again in a year's time and see where we're at.

[01:06:41] Dara: Yeah. A year seems like a long time away. Mythos will be making all our decisions for us by then. Mythos. They better be good. They better be good, bloody Mythos. Otherwise, they're gonna look like right pillocks, aren't they?

[01:06:53] Dara: It's hype. It's probably yeah. We'll see. Okay, let's leave it there. Until next time. [01:07:00]

[01:07:01] Matthew: That's it for this week's episode of the Measure Pod. We hope you enjoyed it and picked up something useful along the way. If you haven't already, make sure to subscribe on whatever platform you're listening on so you don't miss future episodes.

[01:07:04] Matthew: And if you're enjoying the show, we'd really appreciate it if you left us

[01:07:15] Matthew: a quick review. It really helps more people discover the pod and keeps us motivated to bring back more. So thanks for listening, and we'll catch you next time.

Suggested content

The problem with Agentic Analytics

An AI agent will answer any question about your data, confidently and often wrongly. Agentic analytics works only on a governed semantic layer. Here is why.

Mark Rochefort · 25 Jun 2026

You can't buy data strength. You build it.

Google made data strength central at GML 2026. But it is not a setting you switch on. Here is what actually builds it: consent, server-side, and a warehouse.

Mark Rochefort · 21 Jun 2026

#142 Google's agentic era: Next '26 and I/O '26 unpacked

Dara and Matthew unpack Google Cloud Next '26 and I/O '26: the Gemini Enterprise Agent Platform, Gemini 3.5 Flash, Omni, Antigravity 2.0 and generative AI search. Plus a breaking-news bulletin on Anthropic's new Fable model, and why agents are the headline but governance is the subplot.

Dara Fitzgerald · 12 Jun 2026