Having dashboards is not enough; we cannot expect developers to constantly monitor dashboards 24/7. One needs real-time alerting. This means the ability to set thresholds on metrics and the identification of critical logs/events. As part of the alert setup, we also need to set up what is considered the communication mechanism for the alert. This mechanism can vary from a simple email to sophisticated solutions such as PagerDuty.
Breaching of these thresholds could lead to an outage, cause a spike in latency, or somehow affect customer experience, and hence a notification needs to go out to the relevant teams to set right the situation. Importantly, the thresholds should be set so that the notification goes out before a catastrophic situation occurs. There should be sufficient time for the team to debug and help correct the situation.