Hidden Scalability Bottlenecks In User Notifications

This article covers the not-so-obvious bottlenecks that cause scalability issues in the Notification Service of your software, along with easy-to-implement remedies.

Engineering
 — 
 Min read
 — 
May 8, 2023

Context: Constantly working with companies that deprecate their in-house User Notification Service, we have learned much about what makes their solutions "not fun to scale." In each instance, there is a scalability ceiling that the architects have missed.

Why are Notifications so problematic? We hypothesize that Notifications are not the "breadwinner" or noteworthy feature of most software companies. Thus, software architects are not dedicating enough time (or are not given enough time) to research and discover all the different bottlenecks involved.

So, here we go:

Bottleneck #1: Downstream Service Quotas

AWS SES Dashboard Displaying Email Quotas
AWS SES Dashboard Displaying Email Quotas

In a nutshell: Your messaging service provider (SES, Twilio, ...) has a cap on your account

If anyone could sign up for an email service and send a million emails, our inboxes would look very different. These service providers have strict anti-spam rules and regulations to follow. Generally, there are hidden or visible caps on what you can send. Increasing these caps requires manual business verification, building a good relationship with your service provider, quota increase requests, campaign registry, etc.

Solution: 

  1. Allocate 1-2 weeks to learning your service provider quotas and increasing your allocated quotas
  2. Continuously monitor your usage vs. the quota, so that when your users scale, you do not hit the caps

Bottleneck #2: Lack of Jobs/Queues

In a nutshell: Notifications require too much logic and too many network calls to run in between the business logic

With most software, notification requirements grow exponentially in complexity over time:

  • Users request to receive their notifications on different channels
  • Users want notification subscription management
  • You may need to implement error handling and logging
  • Product managers wish to get analytics and insights

This complexity requires separating the notification logic from the business logic. Furthermore, because of the many network calls (user subscription preferences, multiple messaging services, insights, logging), it is recommended to separate notification logic into a separate job/queue.

Solution: Regardless of how simple your notifications seem, separate them using jobs/queues/workers

Bottleneck #3: Batching Jobs

In a nutshell: Batching the jobs seems efficient, but most messaging services don't support batching

Suppose you batch 1000 emails to send them on the same process/job. It is almost inevitable to avoid errors during this process. These are some of the real things we have seen go wrong:

  • Hitting the environment's memory limit due to loading lots of user subscription options and generating HTML email templates
  • An invalid email address
  • User preferences database reaching capacity
  • An error in generating a complex HTML email template for a specific user

When things go wrong during the job in the middle of the process, you have three options: 1) forget the whole job, and some of your users don't receive their notifications, 2) retry the whole job, and some of your users will receive their notifications twice, 3) create logic to partially retry the failed notifications (sometimes not possible, for example when hitting the memory limit of your environment). All this is avoidable by making your jobs more granular.

Solution: Make your jobs as granular as possible, e.g., one email per job

Bottleneck #4: Databases

In a nutshell: Notification-related tables or databases are heavily utilized, generally more than other tables/DBs

We have seen this happen too often: one day, the database hits its limits, and nothing goes in or out of your notification service. You may need tables/databases for notifications in various scenarios:

  • Storing user notification preferences
  • In-app notifications
  • Log of notifications
  • Using a database as a job queue instead of using a proper queue (don't do this)

Most software architects underestimate the utilization of their notification tables/DBs. Some events in your software generate multiple notifications at once - e.g., an event notifying all 50 users on one account. Some actions could generate 100-200 notification jobs, each requiring at least some DB reads and potentially some DB writes. So while your application might utilize its core tables at 1x, it may use the notification tables at 5x, 10x, or more.

Solution: 

  1. Use a highly scalable database for your notification needs, separated from other application data
  2. Implement a retry mechanism - even highly scalable databases have some limits

Further Reading

Like the article? Spread the word