Kevin Lin

What is required to create the infrastructure for delivering millions of notifications each day? If you only care about sending something out and nothing else, it's only a matter of calling whichever delivery vendors you're using, like Mailgun, Twilio, OneSignal, etc. If you want additional features, such as tracking delivery statuses, retrying failed deliveries, batch sending, storing into your DB for advanced querying, priority levels, and rate-limiting, it may involve a bit more work. That, or you get off the free plan of those vendors and let them provide those features for you.

We decided to do it ourselves because it's more fun, and not because our company ran out of cash. One design choice that had to be made was how we were going to deliver all these notifications asynchronously. We had already built the notifications service to handle both the formatting of the content of the notification as well as figuring out the intended recipients. If we batch send all of these to our delivery vendors, we lose the ability to store the status of individual notifications. The alternative means we have to call the vendor for each notification we need to send, which isn't a problem for the vendor, but might be quite the load for our servers. As requests to create a group of notifications are sent to single pods, there's no way for them to balance the load of sending a given batch.

Our initial implementation included integration with a webhooks microservice. We chose this route because webhooks/events were already a tried and tested pattern on our platform. We can store the individual notification and then publish its creation event to the webhooks service. Then this service will deliver the webhook for the notifications service to re-receive, this time across multiple server instances, balancing the load. This load is significant as a lot of the logic of formatting the notifications takes place right as they are sent. Problem solved, right? This webhooks service provided visibility, was able to scale independently of the services using it, and even had a built-in retry mechanism―it seemed like the obvious choice for creating these notifications.

However, even though the notifications service scaled separately from the webhooks service, the number of notifications that needed to be delivered meant that we were going to have to scale webhooks solely to be able to support notifications. This didn't really seem feasible as a long term solution, as notifications events made up well over 60% of all webhooks and was already slowing down the delivery of other webhooks on our platform.

The main hurdle with notifications was being able to handle large spikes- the number can vary depending on the targeted recipient group. One batch of notifications may include 5k users while another may include up to 50k. If there is a failure to scale, it could mean that for a given batch, there will be a significant delay between the first and last users receiving their notification.

On most platforms, this may not be a huge issue, but considering ours involves more time sensitive actions like bidding, there may be unfairness introduced here depending on how users are retrieved from our database. If the webhooks service is overloaded enough that new batches of notifications are delayed, there could further be issues where critical notifications like 2-factor SMS or password reset emails are delayed in favor of less important ones.

The solution we came up with is decoupling notifications from the webhooks service and building a queue within the notifications service to manage deliveries. When in doubt, employ fundamental data structures and patterns―they may turn out to be more effective than an over-engineered solution.

Using a Redis queue containing pending notifications and workers to consume them, our scaling concerns were mostly resolved, using the webhooks cost savings to easily cover the additional compute the new worker instances needed. Additionally, with this new design, we're easily able to implement priorities for notifications. By adding additional queues for each priority level, we can ensure that critical notifications will never be blocked by more frequent, less important ones.

One thing I used to take for granted is trying to pay for something online, seeing an error during the payment process, and then being able to retry immediately without hesitation. Only after seeing the success message does it occur to me that I might have paid twice. This has happened more than a few times, but fortunately, I've never had to go ask for a refund yet. Life is difficult enough, I guess.

I had the chance to implement this safeguard for our platform when one day, every single payment unexpectedly executed a second time, costing us roughly $400,000 in the span of 10 minutes. Somehow, the daily cron job for payments was re-triggered, resulting in everyone getting a juicy bonus for that day.

Luckily, our company had its own proprietary banking vendor, so we're able to more or less reverse transactions as long as the credit hasn't been ACH transferred out to a real bank. This is largely used in the event that a payment dispute happens after someone has already been paid for their work. The next time they have enough funds in their account, we automatically snatch it back from them, because they owe us. This is legal, supposedly.

With this feature, we were able to recover about 90% of those duplicate payments. The remaining people who owed us, being the opportunists they are, quickly ran off with their money and never came back to work on our platform again, allowing them to keep their "stolen" money.

We already had distributed locks on our platform for other resources on our platform, for managing things like vacant contract spots or checks for schedule conflicts. We implemented our locking mechanism with counting semaphores―rather than holding a binary state, storing the count allowed locks to be acquired multiple times simultaneously, which is useful for resources that need to held by multiple parties concurrently.

So to prevent the possibility of duplicate payments, all we had to do was extend this functionality to our payment systems. However, what kind of response should we expect when a payment is duplicated? If a lock is already acquired, should we return an error code because no new payment will be made? From what I knew about idempotent RESTFUL APIs, the duplicate action should have the same consequences as the first action, which should also include the response.

So, given that the first successful response includes the data about the payment that was made, the problem is figuring out how to provide this same data on subsequent payments (without re-executing them, obviously). The solution we came up with is rather straightforward―we simply store this info when we first acquire the lock. This aptly named "stash" is created the same way as its semaphore counterpart, only without the lock parameters and with a field to store json data.

Now whenever we create a new lock, we mark if it is "stashable", indicating whether to find its corresponding stash in the event that we try acquiring the lock when it's already taken. That way, we can simply return the stored data on the stash as a success, with the requester unaware that it was even a duplicate action at all.

Kevin Lin

Scaling a Notifications Ecosystem

29 July 2024

Idempotency in Payments

28 July 2024