Google Cloud's Metrics Explorer has plenty of metrics, and for most monitoring needs, it's more than enough.
However, the sampling interval of those metrics can hide real problems. I once ran into a situation where an API server on Google Kubernetes Engine (GKE) had intermittent response time spikes, yet Metrics Explorer showed nothing abnormal. The root cause turned out to be short-lived batch jobs on the same Node eating up all the CPU, a classic Noisy Neighbor problem.
Here's how I fell into that trap.
An API server that was mysteriously slow from time to time
I had a development API server running on GKE that would occasionally slow down for no obvious reason.
A request that normally completed in around 200 ms would sometimes take about 4 seconds, even under the same conditions. The slowdown was random/intermittent, and I could not find a clear pattern in when it happened.
When the issue occurred, CPU usage for the two GKE Nodes looked like this in Metrics Explorer:
Discussion
Say something first
It all starts with you—share your thoughts now.