As the saying goes, “what gets measured, gets done”. In my case, I had a server that I showed to a potential client at 4 pm on a Friday, working flawlessly. Come Saturday around lunch, I performed a few site checks because I’m weird like that and one of the sites, the one I showed off on Friday, was performing very sluggishly.. to my surprise (for lack of preparation), the CPU was being utilized 100%, 99.7% by an injected service that Google uses for VM management and reporting. So, I asked myself, how do I automate my quirkiness to randomly check my systems on a Saturday, when I should be at the beach, and also, how do I get notified that something is not operating within the expected thresholds?
The answer to all of that, in any cloud platform, is their “Monitoring”, “Alerting”, or “Reporting” dashboards. Today, I’ll walk you through setting this up in Google Cloud Platform, but I’ll include documentation for Amazon Web Services, Azure, and Hetzner at a later date.
It should also be noted that this document should serve as an initial setup guide when deploying new (GCP) projects, (AWS) landing zones, etc..
- Setup “Integrate with Google Cloud Services” under the Monitoring Module.
- Install the “OpsAgent” on the VMs.
- (optional) Create a dashboard. This is nice if you’re not using a 3rd party data ingestion and visualization tool like Prometheus.
- (optional) Create an uptime checker. Again, only necessary if you’re not using a 3rd party tool for this sort of thing. I use Freshping to monitor ICMP requests.
- Create an alert. That’s what we are doing below:
Create Monitor and Alert #
Step 1 – Create a Monitor + Alert #
Log into Google Cloud Platform, on the left-hand side, and navigate to “Monitoring” for your chosen project. Right now, this has to be done by project, but I hope to identify an automated way of deploying this across all projects and resources, or at the least, VM instances across all projects.
- Click on “Monitoring”, and then “Alerting”.
- Now you should see “Create Policy” at the top of this screen – click it!
This is what you should see after clicking “create policy”:
- Select a Metric, I prefer VM Instance –> Instance –> CPU Utilization
- Click Apply and proceed to Step 2.
Step 2 – Set the Window #
After clicking “Apply” for the metrics, this is what you should see:
- Make sure the rolling window is set to the business metric that you want to follow. What I mean by that is, if your client expects a 99.999% SLA, a 5-minute window is not short enough. You would ideally, expect a notification to pop off within 60 seconds (that’s 1 min).
- Click Next
Step 3 – Set the Threshold #
After you click next, this is what you should see:
I like to keep the default configurations when possible, so I’m not changing the condition types or the alert triggers and positions. I’m just entering a Threshold value.
- Taking a look at the graph to the right, you can see what your average CPU utilization is. For me, it’s always less than 20%, but I give my instances a 20-40% buffer, so I’m setting my threshold to 60, 60% that is – also symbolized by that red line across the top of the chart.
- Click Next
Step 4 – Set Notifications #
After clicking next, you’ll be presented with several notification options. Select the notification channel(s) that you wish to use, I typically select SMS and Email because I don’t believe in work-life balance. If you have an MSP or a ticket channel, go ahead and create an email notification that fires off a ticket to the queue and lets someone else handle it!
Name the policy and click “create policy”.