Meta-monitoring and cross-monitoring

In broad terms, you can't have your monitoring system monitor itself; if the system suffers a serious failure, it won't be able to send a notification about it. Although it is common practice to have Prometheus scrape itself (you may see this in most tutorials), you obviously can't rely on it to alert on itself. This is where meta-monitoring comes in: it is the process by which the monitoring system is monitored.

The first option you should consider to mitigate this issue is to have a set of Prometheus instances that monitor every other Prometheus instance in their datacenter/zone. Since Prometheus generates relatively few metrics of its own, this would translate to a fairly light scrape job for the ones doing the meta-monitoring; they wouldn't even need to be solely dedicated to this:

Figure 11.11: Meta-monitoring – Prometheus group monitoring every other group

However, you may be wondering how this set of instances would be monitored. We could keep adding progressively higher-level instances to do meta-monitoring in a hierarchical fashion – at the datacenter level, then at the regional level, then at the global level – but we would still be left with a set of servers that aren't being monitored.

A complementary technique to mitigate this shortcoming is known as cross-monitoring. This method involves having Prometheus instances on the same responsibility level monitor as their peers. This way, every instance will have at least one other Prometheus watching over it and generating alerts if it fails:

Figure 11.12: Prometheus groups monitoring themselves

But what happens if the problem is in the Alertmanager cluster? Or if external connectivity prevents notifications from reaching the notification provider? Or even if the notification provider itself is suffering an outage? In the next section, we'll provide possible solutions to these questions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset