Cloud Network Routing Failure

Incident Report for Luma Health

Resolved

At 3:40 AM Pacific Time, our devops team started the process to add more capacity to our production clusters by increasing the size of the underlying AWS instance type. The underlying scale up operation occurred without issue and all production workflows were deployed to the new, larger, compute instances that power Luma Health services.

At 3:47 AM Pacific Time, we began to see alerts that Luma Health web services were inaccessible from outside of our private networks. All patient facing communication, EHR integrations, data warehouses, etc all continued to operate without issue. next.lumahealth.io continued to be in accessible.

At 7:05 AM Pacific Time, we identified the root cause of the issue, which was the load balancer has registered all backend targets as "unhealthy", even though they were processing customer workloads. A backend target gets registered as unhealthy when a health check fails for three consecutive checks over a 30 second interval. However, the actual backend target nodes were healthy and were continuing the process customer workflows; therefore, our root understanding at this point is still somewhat unclear as the targets were healthy in all aspects and network and firewall security

At 8:35 AM Pacific Time, we removed the in service load balancer and stood up a new load balancer to replace the one that was incorrectly reading statuses. The cluster rebuilt the new load balancer and started registering the backend targets as healthy.

By 8:59 AM Pacific Time the new load balancer was fully in service and all web facing services were restored.

Posted Aug 23, 2020 - 09:01 PDT