Full Post Mortem
When a message or a phone number is marked as spam, Luma Health's system process is to decommission the phone number and to purchase a new one to replace it. During normal operating procedures with smaller sized broadcasts or regular patient communication, we typically see one or two numbers marked as spam over the course of the day and we are able to backfill them without issue.
For this specific broadcast message, given the size, and the geographic diversity of the delivery, nearly all of the many thousands of numbers that Luma Health uses were marked as spam. As such, we were unable to replenish the delivery number pool and by ~1:10pm PT, Luma Health had automatically depleted all numbers used to deliver patient messaging.
Our devops and QA team identified the issue at 3:20pm PT and re-enabled and reprovisioned all the phone numbers that were required to continue normal operations.
When our ability to deliver messaging outstrips our capacity, certain messages are delayed and held in a queue to allow smooth flowing of patient messaging. This allows us to smooth out the supply of delivery capacity and the demand of messaging needs. During the in between two hours (~1:10pm PT to 3:20pm PT), all patient facing messaging that was intended to be delivered, the supply and demand were out of balance in a far more extreme manner than the delay and queue system was designed to handle.
Due to this intense back pressure into the delay and queue system, the delay and queue system also then began to experience issues. By 4pm PT it had exhausted its storage capacity and so new capacity was brought online by ~4:45pm PT. However, because of the nature in which we ran out of capacity, Luma Health attempted redelivery of messages multiple times (“the retries started retrying”), which has resulted in us texting, emailing, or calling patients potentially multiple times between the hours of 4pm PT and 8:15pm PT.
We fully shut down the delay and queue system at 8:15pm and left it disabled until a code fix could be put in place at 7:30am. The code change is discussed below in remediation.
We have deployed into production two key sets of changes:
These two messages address the root cause of the incident (large customer message marked as spam by cell phone carriers) and the ensuing outage (lack of system capacity).
We are also making the following changes to our standard operating policy, to be applied during the COVID-19 pandemic:
We sincerely apologize for the issues this may have caused you or your patients during this trying time.