Why you should consider Cloud Run for server-side GTM
Here at Measurelab we have built up a good amount of experience in the world of server-side GTM (sGTM) migrations. So much so we have developed a framework for a successful migrations (soon to be published).
The standard way to do things is to go with the out-of-the-box Google auto-provisioned App Engine instance for your analytics server. This is all very nice in theory, with a reassuringly simple Google approach to set up. But a mixture of curiosity, 500 errors and the recent resource shortage for App Engine flexible environment instances over black Friday/cyber Monday led us to explore the possibility of provisioning our analytics servers using Cloud Run. All of that is before we even talk about the prospect of some healthy cost savings. Although as we find out, this is not always a guarantee.
We are in the unique position of being able to compare a Cloud Run deployment on two sites with dramatically different event rates (7,000 vs 100 Million) and to observe what effect, if any, this has on cost and stability. So, let’s dive in.
Why Cloud Run?
I have seen it said in blogs and forums in the analytics sphere that, were Google to release sGTM today, its default would not have been App Engine, rather it would utilise a much newer GCP technology, Cloud Run. Since the App Engine resource issue over the Black Friday weekend, Google has also come out and specified Cloud Run as a solution for sGTM server provision. So outside of ‘Google says so’ what other advantages does it offer over its older sibling App Engine?
First of all, Cloud Run is a container on-demand service. Meaning that you only pay for what you use to the nearest 100 milliseconds. Its ability to quickly scale up and down the number of instances based on demand means that, in theory, your sGTM instances could scale down to 0 when not receiving requests and therefore cost nothing.
The eagle-eyed of you will spot the caveat here – what if my site never receives 0 requests? And you are right, the cost benefits are reduced or eliminated in such instances. However, cost is not the only string to Cloud Run’s bow.
We have observed greater stability on higher traffic sites which inevitably leads to a greatly reduced risk of data loss. Anecdotally, we noticed a phenomenon of ‘instance catch-up’ on an App Engine deployment, whereby large spikes of requests crashed an instance. As this then took the instance pool to below its minimum, it would find itself caught in a loop of catch-up and crashing, never managing to scale in the way it needed to. This is not something we have observed on Cloud Run, even with greater levels of requests.
Cloud Run can also be configured with a load balancer for a multi-region approach. This not only advantageous to websites with a global audience, but could also help avoid issues like the ones seen over the Black Friday weekend by allowing for redundancy. For example, if we had a load balancer pointing to 3 instances in 3 regions, if one region lacked availability, the load balancer would simply send the data to one of the other available instances. A load balancer comes with its own costs and set up, so we won’t go into detail here, but keep an eye out for an upcoming blog on this exact subject from us soon.
It is probably also worth mentioning that if you are (understandably) jumpy about scaling to 0 with Cloud Run, you don’t have to. You can set minimum instances just as you can with App Engine. These instances are kept ‘warm’ and ready to serve requests, reducing latency by minimising cold starts.
So I have laid out why you might go with Cloud Run, but what are some potential areas where it could fall short? Let’s start with latency.
Conventional wisdom would dictate that the App Engine would have better latency, partly because of its always-allocated CPU resource, and partly because it compresses files automatically. But conventional wisdom be damned, from our deployments of this solution, we see no such issue. Observed latency times seem to have parity with App Engine, or in some cases are even better. Below you can see Cloud Run latency over 5 days for a site that received around 97 million events in one month:
- 50th percentile: 0.03s
- 95th percentile:0.46s
- 99th percentile: 0.85s
And from the other end of the spectrum, latencies for a site that received around 7,000 events in one month:
- 50th percentile: 0.04s
- 95th percentile:0.12s
- 99th percentile: 0.13s
Simo Ahava saw similar behaviour when deploying Cloud Run on his own site.
A good deal of finding my way around this solution is thanks to that Simo blog linked above, along with another by Mark Edmondson.
As already mentioned, there is a theoretical cost saving to be had if you are using Cloud Run. Still, this all very much depends on the number of requests and your configuration. If your site will regularly be receiving no traffic, Cloud Run’s ability to scale down to nothing is a big bonus. However, as we have already established, there may be more benefits to the Cloud Run approach than just cold hard cash.
If you want to spend to unlock the other benefits Cloud Run affords, there are ways to reduce costs down:
- If you know that you wish to have a minimum number of reserved instances for at least a year, Google can provide committed use discounts https://cloud.google.com/run/cud.
- In our experience, there does not seem to be a need for always on CPU resources when provisioning an analytics server. Doing so will increase costs considerably. More information on CPU allocation can be found here https://cloud.google.com/run/docs/configuring/cpu-allocation#console_1.
- We would recommend not resting on laurels when it comes to set up. Keep monitoring things and updating scaling to find the right balance of maximum/minimum instances to ensure all requests are dealt with in the most cost-effective manner.
- Remember that Cloud Run does not immediately shut down instances once they have handled all requests. To minimise the impact of cold starts, Cloud Run may keep some instances idle for a maximum of 15 minutes. You will be billed until they are shut down entirely.
As highlighted earlier, we are in the privileged position of being able to compare deployments on the extreme ends of the spectrum. Let’s start with a lower-traffic site that receives around 7,000 requests a month.
- Min 2 instances
- Max 6 instances
- CPU is only allocated during request processing
- CPU limit 1
- Memory limit 512 Mib
This incurs a cost of around £1.03 per day or about £30 a month. We have seen no issue in terms of analytics collection and all instances are scaling as we would expect.
Now let’s look at a site that sends close to 200 million requests per month to a Cloud Run.
- Min 1 instance
- Max 100 instances
- CPU is only allocated during request processing
- CPU limit 1
- Memory limit 512 Mib
This incurs a cost of around £13.40 a day, which is significantly more than the £5.87 a day incurred when using App Engine, but there are some very important things to point out.
- App Engine had a max instance set-up of 8 before the configuration was moved.
- There was significant data loss seen on App Engine due to extreme traffic spikes causing instance crashing, a problem not seen on the new Cloud Run set-up.
- The traffic hitting the site was significantly higher post-move to Cloud Run.
- It is relatively early in its deployment and there is still experimentation going on to determine the best/optimal number of instances and their individual configurations.
Neither of these deployments has a load balancer in place which would come with additional expense.
Head to head
To round things off let’s have a look at a comparison table for Cloud Run and App Engine.
|App Engine||Cloud Run|
|Runs on Compute Engine instances||Uses serverless containers|
|No scale-to-zero||Can be configured to only run when servicing requests|
|Automatically compresses files||No automatic compression|
|Adds some additional metadata to HTTP requests||Does not modify HTTP requests|
|Single-region, hard to-load balance||Can be configured with the multi-region network using load-balancer|
|Established technology||Relatively new technology (late 2019)|
So, should you do it?
This is cutting-edge stuff, and it may well be that the costs and stability of App Engine are perfectly acceptable to you and your site. The level or traffic hitting your site may mean then you never experience the ‘instance catch-up’ we observed. Even if that is the case, there are still plenty of benefits to be had from a multi-regional approach.
And if you are having problems with cost and know that scaling to 0 will help, or are a victim of the Black Friday App Engine calamity, Cloud Run may be a good option for you to try out. From our experience, it has certainly proven a stable and effective way of provisioning analytics servers.
Keep your eyes peeled for more upcoming content on sGTM including information on our framework for a successful migration, instructions on how to provision a load balancer on your Cloud Run instances, and just what the point of switching to sGTM is in the first place!
Until then, please do reach out if you have any questions or would like help an/or advice in making the switch to sGTM yourselves.