25 Production Service Buses Broken
We need to define what "working" means for a service bus so that reliability engineering can maintain reliability based on these metrics.
When a service bus no longer functions, reliability engineering should be equipped with the capacity to upgrade service buses in order to meet a component-specific SLA.
We need product management to define what "working" means for a service bus so reliability engineering can respond appropriately when a service bus is "broken" rather than having to go back to PM as though we need a exception for every broken service bus.
We also need monitoring in place to ensure that we receive alerting when the metrics of "broken" happen. This allows us to be proactive regarding broken infrastructure and has the positive outcome of proactively reducing ticket count.
The attached report shows all 25 broken SBs. The definition of "broken" in this case is a SB with a Count of Active Messages > 1000 for more than 30 consecutive minutes.