Inside the IT response to a traffic surge
May 18 · 7 min read
When traffic surges, it's a challenge every business is happy to take on. While it validates product-market fit, it also exposes architectural weaknesses that remain hidden under normal circumstances. For messaging platforms, the stakes are particularly high: users expect instant message delivery, seamless authentication, and zero data loss, regardless of concurrent user counts.
Zoho's traffic surge management capabilities were recently put to the test when its B2C messaging application, Arattai, faced a sudden influx. For a messaging platform that had been growing steadily since its 2021 launch, this spike could have been catastrophic without proper monitoring and analytics. Before getting into how the teams handled technical and operational challenges, it's worth understanding where messaging platforms typically break under load:
- Authentication bottlenecks: OTP delivery systems often can't scale due to rate-limiting from SMS providers. When thousands of users attempt to register simultaneously, verification messages queue up, creating 5-10 minute delays that can cause immediate user abandonment.
- Database connection exhaustion: Multi-tenant architectures share connection pools across tenants. Think of database connections like phone lines into a call center. In multi-tenant systems, all customers share the same pool of lines. If a user (a single tenant) suddenly generates unusual activity, they can monopolize most of those connections. Other customers then can't get through, even though their usage is normal. This creates system-wide slowdowns that appear random to engineering teams.
- Storage I/O saturation: Storage systems have two limits, i.e., how much data they can hold (capacity) and how fast they can read/write that data (I/O throughput). When thousands of users upload profile photos simultaneously, storage systems hit I/O limits before reaching capacity limits. The storage reports plenty of space available, but the upload requests queue up because the system can't write files fast enough. Users see mysterious upload failures even though the system isn't full.
- Cascade failures in microservices: When one service (such as user profile retrieval) slows down, upstream services that depend on it start timing out. These timeouts consume thread pools, causing those services to fail. Within minutes, a slowdown in one minor service can cascade into a platform-wide outage. The challenge is that the original problem might be buried under dozens of error messages from downstream services.
Managing the crisis
When users flood to a new platform, the first thing they check is, Is it working? A traffic surge often means crashes, and crashes mean users leave immediately. The first line of defense against surge-related failures is visibility. Here, we deployed monitoring at two critical layers:
Infrastructure monitoring tracked not just whether services were up, but how they were performing under load. Using 90-day historical data, the tool identified usage patterns and bottlenecks, allowing our team to plan proactively, auto-scale infrastructure before peak hours, and deploy targeted optimizations. The system enabled the team to prioritize responses by business impact, allocating maximum resources to critical services such as authentication and messaging, high priority to core services such as storage and search, and close monitoring to other features.
Application-level monitoring provided granular visibility into the user experience across device types, OS versions, and network conditions. The critical metric here was crash patterns. A 0.1% crash rate sounds acceptable, but not when it's concentrated entirely on Android 14 devices during the registration flow, i.e., isolated vs. systemic issues. Other app performance metrics and user journey mapping also provided insights into usage patterns.
A public status page gave users real-time visibility into service health, operational components, and those under stress. It also served a secondary function by deflecting support load. When users saw Authentication service experiencing delays with resolution updates, they were more likely to wait than to contact support immediately or post negative reviews.
Scaling up
Zorro, the in-house engineering and cloud operations team behind our solutions, was responsible for overseeing scaling efforts when traffic suddenly surged. The main challenge here was Arattai's underlying architecture—the multi-tenant system. This architecture meant the team couldn't just throw more resources at one struggling component. Scaling any single component without the others would simply shift the bottleneck elsewhere. They needed to scale multiple components simultaneously to ensure the entire system could handle the surge from all angles without compromising performance or data integrity.
The Zorro team employed two different scaling strategies depending on the architecture and what each component needed:
Vertical scaling meant upgrading individual servers to be more powerful, essentially giving each machine more muscle to handle heavier workloads. This approach worked well for components that needed more raw processing power, e.g., database primary servers that handle all write operations and cache servers that store frequently accessed data in memory. The advantage here is simplicity. Instead of changing the architecture, we're making existing components stronger.
Horizontal scaling took a different approach. Here, the team added more instances to distribute traffic effectively. Think of it like opening more checkout lanes at a grocery store instead of hiring a faster cashier. API servers, message processors, and authentication services can be scaled horizontally. The advantage here is built-in redundancy. If one instance fails, others continue serving traffic.
Furthermore, the infrastructure grew from approximately 1,000 instances to 2,000 instances during the traffic surge, doubling the system's capacity to keep up with demand. At this point, server instances weren't the only concern. With thousands of new users creating accounts, uploading data, and generating activity, the team also had to add storage capacity rapidly. They scrambled to provision additional storage volumes and expand database capacity to ensure that new users could successfully create accounts and that existing users wouldn't experience issues.
After traffic normalized, the Zorro team is now intentionally over-provisioning by approximately two times the amount compared to what current traffic levels technically require. This buffer provides several benefits:
- Controlled degradation: If traffic spikes again, the system has room to absorb it while engineering teams respond.
- Maintenance windows: Over-provisioned systems can take components offline for updates without impacting performance.
- Blast radius limitation: If a component fails, remaining instances can handle the redistributed load.
IT teams must internalize that the cost of over-provisioning is lower than the cost of downtime or lost users. It's insurance against the next surge and protection against cascade failures.
Overall, Arattai demonstrated enterprise-grade reliability during the traffic surge that matched industry standards. The rapid scalability efforts, combined with 99.9%+ performance across most components, is proof of mature infrastructure and effective monitoring and incident response. More importantly, message delivery times remained consistent and authentication flows didn't experience significant delays—two key metrics that would have caused immediate user churn.