Message Delivery Failures
Incident Report for Luma Health
Postmortem

Full Post Mortem

When a message or a phone number is marked as spam, Luma Health's system process is to decommission the phone number and to purchase a new one to replace it. During normal operating procedures with smaller sized broadcasts or regular patient communication, we typically see one or two numbers marked as spam over the course of the day and we are able to backfill them without issue.

For this specific broadcast message, given the size, and the geographic diversity of the delivery, nearly all of the many thousands of numbers that Luma Health uses were marked as spam. As such, we were unable to replenish the delivery number pool and by ~1:10pm PT, Luma Health had automatically depleted all numbers used to deliver patient messaging.

Our devops and QA team identified the issue at 3:20pm PT and re-enabled and reprovisioned all the phone numbers that were required to continue normal operations.

When our ability to deliver messaging outstrips our capacity, certain messages are delayed and held in a queue to allow smooth flowing of patient messaging. This allows us to smooth out the supply of delivery capacity and the demand of messaging needs. During the in between two hours (~1:10pm PT to 3:20pm PT), all patient facing messaging that was intended to be delivered, the supply and demand were out of balance in a far more extreme manner than the delay and queue system was designed to handle.

Due to this intense back pressure into the delay and queue system, the delay and queue system also then began to experience issues. By 4pm PT it had exhausted its storage capacity and so new capacity was brought online by ~4:45pm PT. However, because of the nature in which we ran out of capacity, Luma Health attempted redelivery of messages multiple times (“the retries started retrying”), which has resulted in us texting, emailing, or calling patients potentially multiple times between the hours of 4pm PT and 8:15pm PT.

We fully shut down the delay and queue system at 8:15pm and left it disabled until a code fix could be put in place at 7:30am. The code change is discussed below in remediation.

Remediation

We have deployed into production two key sets of changes:

  1. We are now rate limiting the speed at which we will remove phone numbers out of services so no single customer can impact other customers.
  2. We have changed the way the delay and queue system works by doubling its capacity and also reducing the amount of time any one operation is allowed to stay in the queue.

These two messages address the root cause of the incident (large customer message marked as spam by cell phone carriers) and the ensuing outage (lack of system capacity).

We are also making the following changes to our standard operating policy, to be applied during the COVID-19 pandemic:

  1. All broadcasts with more than 1,000 patients must be approved by Luma Health’s VP of Customer Experience in order to prevent accidental abuse.
  2. All broadcasts with more than 1,000 patients must be sent before noon customer local time zone.

We sincerely apologize for the issues this may have caused you or your patients during this trying time.

Posted Apr 17, 2020 - 08:44 PDT

Resolved
Summary

At 12:12pm PT, a Luma Health customer sent a broadcast message to ~42,000 patients that was flagged as spam by cell phone carriers. Luma Health messaging was degraded from 12:12pm PT until 3:20pm PT. Some messaging was delayed or sent multiple times until 8:15pm PT.
Posted Apr 16, 2020 - 14:30 PDT