Post Mortem: Prod1 Workflow Latency On May 2nd 2024

Summary

On May 2, 2024 customers on the Prod1 cluster experienced elevated latency from workflows leading to delays in data appearing, updating and being routed in the platform

Root Cause

An internal change to reconfigure how we distributed automation traffic across Kustomer servers caused a service to become unresponsive due to excessive load, leading to a failure to automatically scale. Kustomer engineers were needed to manually scale that service and related services.

Timeline

2/24/24 2:29 PM EDT - Configuration change was introduced into the system shifting additional traffic onto a core service

2/24/24 2:37 PM EDT - Oncall engineer was alerted to increased latency on the core service

2/24/24 3:00 PM EDT - Root Cause was identified and engineers began manually scaling systems

2/24/24 3:15 PM EDT - Core service was healthy and began catching up against backlog of events

2/24/24 3:32 PM EDT - System fully caught up against backlog of workflow events. After ensuring stability, engineers began redriving a small # of workflow events that had failed due to latency
2/24/24 4:00 PM EDT - All events were redriven and system health was normal

‌

Lessons/Improvements

Release Process: We identified a potential improvement in how we review and release sensitive changes and will be introducing a new process to provide additional redundancy and oversight when making changes that will significantly increase traffic to a service
Scaling Adjustments: We identified some inefficiencies in how the related services here scale and have implemented improvements to prevent a recurrence of this pattern

Posted May 03, 2024 - 17:12 EDT

Resolved

Kustomer has resolved an event that was causing platform latency. This issue has been resolved and our team is redriving workflow events during the latency period.

After careful monitoring, our team has determined that our systems are now fully restored, but our engineering team is still redriving Workflow events. During this redriving period lingering issues may still be present. Please reach out to Kustomer support if you have additional questions or concerns.

Posted May 02, 2024 - 16:09 EDT

Update

Kustomer has implemented an update to address an event that caused Latency and Delays.

Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support if you have additional questions or concerns.

Posted May 02, 2024 - 15:47 EDT

Update

We are continuing to monitor for any further issues.

Posted May 02, 2024 - 15:47 EDT

Monitoring

Posted May 02, 2024 - 15:44 EDT

Identified

Kustomer has identified an event that may cause Latency and Delays.

Our team is currently working to implement a resolution. Please expect further updates within the next 30 minutes, and reach out to Kustomer support if you have additional questions or concerns.

Posted May 02, 2024 - 15:35 EDT

Investigating

Kustomer is aware of an event that may cause latency issues within the platform.

Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.

Posted May 02, 2024 - 15:31 EDT

This incident affected: Prod1 (US) (Bulk Jobs, Workflow).