On May 2, 2024 customers on the Prod1 cluster experienced elevated latency from workflows leading to delays in data appearing, updating and being routed in the platform
Root Cause
An internal change to reconfigure how we distributed automation traffic across Kustomer servers caused a service to become unresponsive due to excessive load, leading to a failure to automatically scale. Kustomer engineers were needed to manually scale that service and related services.
2/24/24 2:29 PM EDT - Configuration change was introduced into the system shifting additional traffic onto a core service
2/24/24 2:37 PM EDT - Oncall engineer was alerted to increased latency on the core service
2/24/24 3:00 PM EDT - Root Cause was identified and engineers began manually scaling systems
2/24/24 3:15 PM EDT - Core service was healthy and began catching up against backlog of events
2/24/24 3:32 PM EDT - System fully caught up against backlog of workflow events. After ensuring stability, engineers began redriving a small # of workflow events that had failed due to latency
2/24/24 4:00 PM EDT - All events were redriven and system health was normal
Lessons/Improvements