Monitoring the “Monitoring System”!
Ever heard of “Letters of last resort”? These are 4 identically-worded handwritten letters from the Prime Minister of United Kingdom to the commanding officers of the four British ballistic missile submarines.
They contain orders on what action to be taken in the event that an enemy nuclear strike has destroyed the British Government & has killed or otherwise incapacitated both the prime minister & their designated survivor.
This is coined as “Dead man’s Switch” – A switch that is designed to be activated or deactivated in case of a “situation”!
But why the heck am I talking about it in context of a monitoring system? 🤔
👉 Your monitoring system is responsible for monitoring the health of your stack; so its utmost important that it itself is monitored without fail!
In the recent past (such as the recent aws outage), we’ve seen enough of instances where the monitoring system itself has become victims of outages resulting in a total blackout!
A simple dead man’s switch could rescue this situation to some extent. While there are different ways of implementing a dead man’s switch, in its most simple form which is leveraged in many orgs, it is achieved by periodically sending heartbeat – as long as your monitoring system is alive & well, it sends a heartbeat. If we do not receive a heartbeat for some time, you can safely assume that the monitoring system is dead!
How could one implement it with Prometheus?
In context of Prometheus, the dead man’s switch is a simple Prometheus alerting rule that always triggers. The Alertmanager continuously sends notifications for the dead man’s switch to the notification provider that supports this functionality such as a PagerDuty. This also ensures that communication between the Alertmanager and the notification provider is working.