Make actionable alerts using Google Cloud

minherz
Google Cloud - Community
8 min readDec 10, 2022

--

Photo by Robert Linder on Unsplash

[edited on Oct’ 23]

TL;DR; If you are looking for implementation sample, go directly to the Example section.

Automation is a super goal for everyone involved either in mundane work of dealing with toil or people on pager duty. Not only automation frees your time and mind from the tasks but it usually performs these tasks faster than you will do.

Implementation automation is not trivial. Every service provider does their best to provide users with ways to automate work on their platform. Cloud providers aren’t different. One of the popular examples for automation is (or was) a task that restarts VM when its CPU utilization reach out 95–99% and keeps on this level longer than a couple of minutes. This automation follows one of the popular playbooks of DevOps engineers when all workloads were running on VMs. On this example, it is easy to see that automation is composed of three elements:

Methods and terminology about event, alert and action differs among service providers. Working at Google, I would like to describe how it works in Google Cloud.

In Google Cloud

Google Cloud does not have many fancy, one-click solutions that allow you to click through the implementation. Bad thing about it is that you need to build stuff. Good thing about it is that you know your stuff and have full transparency and control over it. So, first thing first. What is each element from the diagram in Google Cloud?

Event is any of lots of monitored metrics not including custom metrics of your application. Alert is an incident triggered by an alert policy that is conditioned to monitor the event. Action is virtually anything that can be triggered for execution by an HTTP(S) endpoint call or PubSub message. In other words: any thing.

Implementing on-alert action

Another goal is to minimize maintenance of the automation. It makes little sense to implement automation that takes care of our tasks just in order to spend this time on maintaining the automation itself. So, how these elements are working in Google Cloud?

You do not need to maintain events because the monitored metrics are fully managed by Cloud Monitoring. It means that for any workload that runs on Google Cloud these metrics are automatically captured. And your application ingests custom metrics as part of its design.

You do not need to maintain alerts because Cloud Monitoring does it for you through definition of the alert policies. Once defined (including notification channels) it just works.

It leaves the last and most important part: action. As mentioned, you can execute action by calling a REST API endpoint or by sending a message to predefined PubSub topic. The following methods to execute action can be used:

  • Define a Cloud Build configuration to be triggered by PubSub
  • Deploy a Cloud Function instance to be triggered by PubSub
  • Deploy a workload (on GKE, Cloud Run, AppEngine, Cloud Function or VM) that exposes REST API endpoint.

It is up to you, but I would recommend using one of the first two methods. The main reason for it is that execution of the Cloud Build configuration or Cloud Function instance happens in fully managed environment and you do not need to wory about reliability or capacity of your action.

The information about the event and the alert is provided in the PubSub message in JSON format. You can reference the documentation for the JSON schema.

First two can be triggered using PubSub or Webhook notification channels. The last method is triggered only by Webhook. I would recommend using Cloud Build because it (1) uses declarative language; (2) the configuration is easy to create, maintain and control; (3) being serverless service it is cheap in maintenance and (4) being integral part of Google Cloud it let you do anything that you can do using Google CLI or API. But if you need a some kind of proxy so you can trigger execution of, say, Terraform plan in Terraform Cloud, then the second or third methods are for you.

You can find the JSON schema of the payload sent within the PubSub message or HTTP request in the documentation.

Describing event’s conditions in the alert policies

To be able to define a meaningful alerts you will need to write MQL query. Unfortunately, it is not very close to SQL but also not complicated to the level that an engineer like you cannot use.

When you create a new alert using Cloud Console you find yourself in the multi-step wizard-like interface

New alert wizard window

It limits you to selecting a single metric for querying and then lets you define the threshold for alert. It can be enough if your event is “CPU utilization reach 100%”, but is not enough if you want to catch an event “when the rate of responses returning 200 status code to total number of responses is less than 90%” or something similarly sophisticated. In these cases you can change the UI to allow you to write MQL in the wizard by pressing that small MQL button in the top right corner of the window (see the red arrow in the previous screenshot). You will get a slightly different look of the wizard window then:

New alert wizard window to write MQL query

Now to the hard part. Almost all complex alerts use some kind of ratio of the metrics. It is implemented using ratio operation. Description of other methods to reach the same goal can be found in the MQL examples. Using MQL you should be able to define MQL query that describes your event although the documentation might not be comprehensive for your specific case. If you find yourself in a challenge, feel free to ask the question in the dedicated forum. In the following section I show how to do it for the specific user story.

After you built the query to see the data that captures your event you can define one of two conditions that finalize the event’s description and triggers the alert:

  1. condition operator lets you describe threshold condition that fires the event or, in case of Google Cloud, opens an alerting incident.
  2. absent_for operator lets you describe an event that should be triggered in the absense of the condition. It is useful when you want to describe an event that does not have the data. For example, any response from your deployed application or any query to a database.

Then you can setup additional constraints for the triggering the event

Whether you want it to be triggered each time that the conditions are fulfilled or you want it to be triggered only sometimes, based on the rate (in %) or number of time series that violate the condition. To make it as close to the usual actionable events, I recommend to select the first option of “Any time series violates”.

Example: “Can you do it in Google Cloud?” challenge

There is nothing better than an example, right? Some time ago my friend asked me if it is possible to automate a rollback of Cloud Function if its performance after deployment “is not good”. My friend mentioned that Azure makes it easy (which proved to be a slight exaggregation at the time we discussed it). I came to the challenge because I was sure that Google Cloud has it.

So, I ended up with creating a simple Cloud Function that returns status 200 when going to the “/ping” path and status 404 when going to elsewhere:

I deployed it to Cloud Function using all default settings (i.e. it is deployed as Gen 2 and uses Eventarc).

Then I used Cloud Scheduler to simulate the load. This job calls the functions once each minute.

Now it is time to define “not good performance” (as my friend described). I decided that error responses (i.e. non 2xx HTTP status) is a good indicator of “something was not good”. Given the load generator limited frequency, the alert should be triggered after:

Error rate of the responses from my Cloud Function instance is beyond (greater than) 20% for a 5 minutes

In other words if less than 80% of the requests are not OK for more than 5 consecutive minutes then I want to rollback to the previous version.

Next step would be to define a PubSub notification channel and to configure an alert policy to send a message to the channel when the above conditions are fulfilled. The condition is defined using MQL and uses Cloud Function metric function/execution_count:

The rate is calculated as a total count of the metric vs the one with the status attribute that is not ‘ok’ (i.e. not 2xx). The window size for this condition is 5 minute and the condition threshold is checked every 30 seconds. See MQL documentation for more explanations.

Now it is time to define the action. After checking the options I decided to go with Cloud Build configuration because it is managed solution that allows easy retrieval of the data from PubSub message and provides a simple way to run rollback logic as a shell script. The Cloud Build configuration is set up with the following information:

  • Define Cloud Build configuration to be triggered by PubSub message
  • Setup environment variables with the values parsed from the PubSub message
  • Configure Cloud Build service account to enable it working with Cloud Function

The configuration steps utilize the undocumented feature of the Gen 2 Cloud Function that manages multiple consecutive versions of the Cloud Function code in the storage bucket.

First the script identifies the name of the object that stores the previous version of the Cloud Function code. Then it downloads that version to local file system. And re-deploys Cloud Function with the previous, supposedly good version.

A word of caution: the previous version of Cloud Function is supposed to be good. In other words when called, its supposed to respond with OK enough to avoid triggering the alert policy. Otherwise, it will result in the endless loop of exchanges between two bad versions. In real life, the Cloud Build configuration steps will need to discover the trustworthy “good” version using source management system or other means.

Now, in order to see how it works, let’s deploy a new version of Cloud Function that returns 404 each time the function is called at the “/ping” path:

After about 6 minutes you can see that the source of the Cloud Function was reverted back to the “good” one.

--

--

minherz
Google Cloud - Community

DevRel Engineer at Google Cloud. The opinions posted here are my own, and not those of my company.