tiles.desh-api.com — Root Cause Analysis (Updated April 20, 2026)
/conversions/mobavenue route freed up CPU. This confirmed that conversions were being affected by the shopping API overload — not causing it. The conversions route was a victim of the saturated event loop, and removing it reduced load on the already-struggling workers.
Root cause: a single unprotected external dependency — the shopping autocomplete API.
At 13:00:00 UTC on both April 16 and April 18, getShoppingAutocompleteSuggestions began timing out on every call. The shopping server itself showed normal CPU — the failure was network or application-level on the shopping server side. With no circuit breaker, every incoming tile search request that included a shopping component spawned a hanging async call. All 22 Node.js workers began accumulating these pending operations, CPU climbed to ~100%, and the entire port 3000 event loop became saturated.
Once the event loop was saturated, everything on port 3000 slowed down — including Mobavenue conversion postbacks, which also proxy to 127.0.0.1:3000. Conversions were a secondary victim of the overload, not a contributor to it. Disabling the conversion route worked by reducing the total number of inflight requests on the saturated workers, giving them enough breathing room to process the backlog and prevent further port pool drain.
At 14:38:28 UTC, the Linux ephemeral port pool (~28K ports) was fully exhausted, causing errno 99: Cannot assign requested address on every new nginx → port 3000 connection and a server-wide total collapse.
getShoppingAutocompleteSuggestions times out on every call. Shopping server CPU was normal — the failure was connectivity or application-level on the shopping server side. Same trigger on both April 16 and April 18.
No circuit breaker means each tile search with a shopping component spawns a call that never resolves. Pending async callbacks pile up across all workers. CPU climbs toward 100%.
With CPU at ~100% and the event loop saturated, all requests on port 3000 slow down — including Mobavenue conversion postbacks, which also proxy to 127.0.0.1:3000. Conversions begin to back up and respond slowly. logEvent latency: 152ms → 472ms (3× spike at 13:00).
Nginx opens a new TCP connection to port 3000 for every request. Slow/hanging responses mean connections linger in TIME_WAIT for 60s. Linux default port range 32768–60999 = ~28K ports. Both tile searches and slow conversion connections contribute to drain.
Every new nginx → port 3000 connection: errno 99: Cannot assign requested address. 5.36M [crit] errors in 2h22m. tiles.desh-api.com, conversions.desh.app, addons-api.desh.app all fail server-wide.
Removing one stream of inflight requests from the saturated event loop reduced CPU load. Workers processed the remaining backlog faster. TIME_WAIT connections expired, freeing ports. The shopping API failure was still present, but with fewer concurrent requests the server could limp through. Not a fix — only reduces the amplification.
The /conversions/mobavenue route proxies to 127.0.0.1:3000 — the same overloaded Node.js workers. When the shopping API failure drove CPU to ~100%, every additional request on port 3000 (including conversions) made recovery harder. Disabling the route reduced the total request count on the saturated workers, giving them enough slack to:
⚠ This will not prevent the next incident. When the shopping API fails again, the server will reach the same state — just slightly more slowly without conversion traffic. The only real fix is a circuit breaker on getShoppingAutocompleteSuggestions.
| Hour (UTC) | 499 Client Abort | 502 Bad Gateway | 504 Timeout | Total Errors | vs Baseline |
|---|---|---|---|---|---|
| 12:xx normal | 53,154 | 30 | 0 | 53,184 | 1× |
| 13:xx onset | 89,142 | 218 | 0 | 89,360 | 1.7× |
| 14:xx ▲ collapse | 2,846,545 | 1,807,956 | 90,393 | 4,744,894 | 89× |
| 15:xx ▲▲ PEAK | 4,469,773 | 2,718,736 | 148,440 | 7,336,949 | 138× |
| 16:xx ▼ recovery | 1,411,446 | 835,252 | 51,465 | 2,298,163 | 43× |
| 17:xx ▼▼ resolved | 63,323 | 17 | 0 | 63,340 | 1.2× |
The 5.3M [crit] entries are NOT SSL errors — confirmed as errno 99: Cannot assign requested address. First hit 14:38:28 UTC:
2026/04/16 14:38:28 [crit] connect() to 127.0.0.1:3000 failed (99: Cannot assign requested address) while connecting to upstream, request: "POST /tiles/search HTTP/2.0", server: tiles.desh-api.com
| Vhost | Port Exhaustion Errors | Note |
|---|---|---|
| tiles.desh-api.com | 5,361,419 | Primary — tile searches |
| addons-api.desh.app | 3,683 | Collateral |
| conversions.desh.app | 2,354 | Collateral — conversion postbacks also lost |
| tiles-exp + tiles-exp1 | 175 | Collateral |
| Total | 5,367,631 |
| Theory | Status | Evidence |
|---|---|---|
| Shopping server overload (CPU) | Ruled out | Shopping server CPU was normal during incident — failure was on the shopping server's side (network/app), not resource exhaustion |
| Mobavenue conversions caused the overload | Ruled out | April 18 confirmed: same shopping API timeout as trigger. Conversions were slowed by the overload, not causing it. Disabling them reduced load, not the cause. |
| 5.3M [crit] = SSL errors | Ruled out | Confirmed errno 99: Cannot assign requested address — port exhaustion, not SSL |
| OOM / process crash | Ruled out | All 22 workers ran 00:00–23:59 without restart. No OOM events in syslog. |
| Database / Redis issue | Ruled out | No DB-related errors in any log file. |
getShoppingAutocompleteSuggestions
This is the only real fix. Trip after N consecutive failures and return [] without calling the external API. Tile searches complete immediately, no hanging connections, no CPU saturation, no port pool drain. Without this, the incident will recur on every shopping API failure.
Current timeout allows thousands of hanging calls to accumulate per worker before releasing. A 1–2s timeout means each failed call frees its worker slot quickly, dramatically slowing port pool drain even without a circuit breaker.
Same failure triggered both April 16 and April 18. The shopping server CPU was normal both times — the failure is connectivity or application-level on the shopping side. Until identified and fixed, this will recur. Check network routes, DNS, TLS config, and shopping server app logs at 13:00 on both dates.
Keepalive was active and port exhaustion still happened. Widening the range delays exhaustion but doesn't prevent it — only the circuit breaker does. Still worth doing as a secondary defence.
# /etc/sysctl.conf net.ipv4.ip_local_port_range = 15000 65535 # ~50K ports, avoids conflict with listening services net.ipv4.tcp_tw_reuse = 1 # reuse TIME_WAIT ports immediately
upstream api { server 127.0.0.1:3000; keepalive 128; keepalive_requests 10000; keepalive_timeout 100s; }
Why keepalive didn't help here: keepalive 128 means up to 128 idle connections are pooled per nginx worker. When the shopping API hangs, those 128 connections immediately fill with in-flight requests that never complete — they are not idle and cannot be reused. Every new request beyond 128 per worker opens a fresh connection to port 3000, which cycles through TIME_WAIT on close. At incident-level traffic volumes this still exhausts ~28K ports in ~98 minutes. This confirms the circuit breaker (rec #1) is the only effective protection.
Health endpoint aborting at 14:23 was a 15-minute early warning before full collapse at 14:38. An alert here enables manual intervention — including disabling the conversions route — before the port pool exhausts.
At 13:xx, 502s were 218 but rising 7× over baseline. Alert at “>500 502s in 5 minutes” would fire at incident onset. Phase 1 manual intervention (disable conversions) is faster than waiting for full collapse.
Event loop lag >100ms signals event loop saturation before it becomes critical. Export to Grafana. Would have fired at 13:00 when logEvent latency tripled.
error.log shows SSL handshake failures to 18.172.64.113 (Branch API) since March 22. Unrelated to this incident but should be resolved.
| File | Size | Key finding |
|---|---|---|
access-tiles.desh-api.com.log | 4.2 GB | Hourly status code breakdown |
access-conversions.desh.app.log | — | Mobavenue volume steady — no spike at 13:00 |
error-conversions.desh.app.log | 1.8 MB | Confirmed /conversions/mobavenue → port 3000 |
error-tiles.desh-api.com.log | 1.8 GB | 5.36M port exhaustion [crit] entries |
mobavenue-conversions.log | 15,843 lines | Postback detail |
combined-addons-api-server-{8–29}-2026-04-16.log | ~210 MB × 22 | Shopping timeout spike: 2 → 59 in first minute at 13:00 |
combined-addons-analytics-server-{0–15}-2026-04-16.log | ~4 MB × 16 | logEvent 152ms → 472ms at 13:00 (3× spike) |
combined-addons-api-cron-0-2026-04-16.log | — | OEMW ENOTFOUND at 12:52 — early signal |
syslog | 15 GB | No OOM, no SIGKILL, no process restart |