Scaling a Notifications Ecosystem
29 July 2024
What is required to create the infrastructure for delivering millions of notifications each day? If you only care about sending something out and nothing else, it's only a matter of calling whichever delivery vendors you're using, like Mailgun, Twilio, OneSignal, etc. If you want additional features, such as tracking delivery statuses, retrying failed deliveries, batch sending, storing into your DB for advanced querying, priority levels, and rate-limiting, it may involve a bit more work. That, or you get off the free plan of those vendors and let them provide those features for you.
We decided to do it ourselves because it's more fun, and not because our company ran out of cash. One design choice that had to be made was how we were going to deliver all these notifications asynchronously. We had already built the notifications service to handle both the formatting of the content of the notification as well as figuring out the intended recipients. If we batch send all of these to our delivery vendors, we lose the ability to store the status of individual notifications. The alternative means we have to call the vendor for each notification we need to send, which isn't a problem for the vendor, but might be quite the load for our servers. As requests to create a group of notifications are sent to single pods, there's no way for them to balance the load of sending a given batch.
Our initial implementation included integration with a webhooks microservice. We chose this route because webhooks/events were already a tried and tested pattern on our platform. We can store the individual notification and then publish its creation event to the webhooks service. Then this service will deliver the webhook for the notifications service to re-receive, this time across multiple server instances, balancing the load. This load is significant as a lot of the logic of formatting the notifications takes place right as they are sent. Problem solved, right? This webhooks service provided visibility, was able to scale independently of the services using it, and even had a built-in retry mechanism―it seemed like the obvious choice for creating these notifications.
However, even though the notifications service scaled separately from the webhooks service, the number of notifications that needed to be delivered meant that we were going to have to scale webhooks solely to be able to support notifications. This didn't really seem feasible as a long term solution, as notifications events made up well over 60% of all webhooks and was already slowing down the delivery of other webhooks on our platform.
The main hurdle with notifications was being able to handle large spikes- the number can vary depending on the targeted recipient group. One batch of notifications may include 5k users while another may include up to 50k. If there is a failure to scale, it could mean that for a given batch, there will be a significant delay between the first and last users receiving their notification.
On most platforms, this may not be a huge issue, but considering ours involves more time sensitive actions like bidding, there may be unfairness introduced here depending on how users are retrieved from our database. If the webhooks service is overloaded enough that new batches of notifications are delayed, there could further be issues where critical notifications like 2-factor SMS or password reset emails are delayed in favor of less important ones.
The solution we came up with is decoupling notifications from the webhooks service and building a queue within the notifications service to manage deliveries. When in doubt, employ fundamental data structures and patterns―they may turn out to be more effective than an over-engineered solution.
Using a Redis queue containing pending notifications and workers to consume them, our scaling concerns were mostly resolved, using the webhooks cost savings to easily cover the additional compute the new worker instances needed. Additionally, with this new design, we're easily able to implement priorities for notifications. By adding additional queues for each priority level, we can ensure that critical notifications will never be blocked by more frequent, less important ones.