P0 Incident — Recurring

Incident Report: April 16 & 18, 2026

tiles.desh-api.com — Root Cause Analysis (Updated April 20, 2026)

Server desh-addons-server
First Incident Apr 16 · 13:00 – 17:00 UTC
Recurrence Apr 18 · same pattern
Mitigation Disable /conversions/mobavenue
Status ⚠ Mitigated — not fixed
5.73M
nginx errors on April 16
5.36M
Port exhaustion [crit] errors
138×
Peak error rate vs baseline
Times this incident recurred

⚠ Recurring Incident — Same Root Cause Both Times

Recurred on April 18, 2026 with the same shopping API timeout as the trigger. Both times, disabling /conversions/mobavenue restored CPU and mitigated the outage — but conversions were a victim of the overload, not a cause. The permanent fix is a circuit breaker on the shopping API call. Until then, the incident will recur.

! Executive Summary

🔄
Causality corrected (April 20) On April 18, the same shopping API timeout occurred. Disabling the /conversions/mobavenue route freed up CPU. This confirmed that conversions were being affected by the shopping API overload — not causing it. The conversions route was a victim of the saturated event loop, and removing it reduced load on the already-struggling workers.

Root cause: a single unprotected external dependency — the shopping autocomplete API.

At 13:00:00 UTC on both April 16 and April 18, getShoppingAutocompleteSuggestions began timing out on every call. The shopping server itself showed normal CPU — the failure was network or application-level on the shopping server side. With no circuit breaker, every incoming tile search request that included a shopping component spawned a hanging async call. All 22 Node.js workers began accumulating these pending operations, CPU climbed to ~100%, and the entire port 3000 event loop became saturated.

Once the event loop was saturated, everything on port 3000 slowed down — including Mobavenue conversion postbacks, which also proxy to 127.0.0.1:3000. Conversions were a secondary victim of the overload, not a contributor to it. Disabling the conversion route worked by reducing the total number of inflight requests on the saturated workers, giving them enough breathing room to process the backlog and prevent further port pool drain.

At 14:38:28 UTC, the Linux ephemeral port pool (~28K ports) was fully exhausted, causing errno 99: Cannot assign requested address on every new nginx → port 3000 connection and a server-wide total collapse.

🔗 Root Cause Chain

1
13:00:00 UTC — Shopping autocomplete API fails

getShoppingAutocompleteSuggestions times out on every call. Shopping server CPU was normal — the failure was connectivity or application-level on the shopping server side. Same trigger on both April 16 and April 18.

2
22 Node.js workers accumulate hanging tile search requests

No circuit breaker means each tile search with a shopping component spawns a call that never resolves. Pending async callbacks pile up across all workers. CPU climbs toward 100%.

3
😖 Secondary victim
Event loop saturation slows everything on port 3000

With CPU at ~100% and the event loop saturated, all requests on port 3000 slow down — including Mobavenue conversion postbacks, which also proxy to 127.0.0.1:3000. Conversions begin to back up and respond slowly. logEvent latency: 152ms → 472ms (3× spike at 13:00).

4
Ephemeral port pool drains over 98 minutes (13:00 → 14:38)

Nginx opens a new TCP connection to port 3000 for every request. Slow/hanging responses mean connections linger in TIME_WAIT for 60s. Linux default port range 32768–60999 = ~28K ports. Both tile searches and slow conversion connections contribute to drain.

5
14:38:28 UTC — Port exhaustion. Total collapse, all vhosts.

Every new nginx → port 3000 connection: errno 99: Cannot assign requested address. 5.36M [crit] errors in 2h22m. tiles.desh-api.com, conversions.desh.app, addons-api.desh.app all fail server-wide.

6
Mitigation: disable /conversions/mobavenue → CPU drops → partial recovery

Removing one stream of inflight requests from the saturated event loop reduced CPU load. Workers processed the remaining backlog faster. TIME_WAIT connections expired, freeing ports. The shopping API failure was still present, but with fewer concurrent requests the server could limp through. Not a fix — only reduces the amplification.

💡 Why Disabling Conversions Helped (But Isn't a Fix)

The /conversions/mobavenue route proxies to 127.0.0.1:3000 — the same overloaded Node.js workers. When the shopping API failure drove CPU to ~100%, every additional request on port 3000 (including conversions) made recovery harder. Disabling the route reduced the total request count on the saturated workers, giving them enough slack to:

  • Process and drain some of the hanging requests faster
  • Slow down the rate of new TIME_WAIT connections accumulating
  • Allow the ephemeral port pool to partially recover

⚠ This will not prevent the next incident. When the shopping API fails again, the server will reach the same state — just slightly more slowly without conversion traffic. The only real fix is a circuit breaker on getShoppingAutocompleteSuggestions.

Incident Timeline (April 16)

▶ Phase 1 — Slow Degradation (13:00 – 14:38 UTC)
12:00 – 12:59Healthy Baseline
Nginx: 53,154 × 499, 30 × 502. v5Tiles latency 200–250ms. Shopping timeouts low and occasional.
12:52:00Early Signals — Campaign Builder Failures
Cron builder: OEMW API unreachable (ENOTFOUND), xyads returns 400. At 12:52:50 — first burst of 8 shopping timeouts on server-8. Possible early degradation.
13:00:00 UTC ▲ TRIGGERShopping API Fails
  • Timeouts: 2 in prev 5 min → 59 in first minute (server-8); 9× across all 22 workers
  • Shopping server CPU: normal — external/connectivity failure on shopping side
  • logEvent latency: 152ms → 472ms (3× in under 4 minutes)
  • v5Tiles latency: 200ms → 538–638ms
13:00 – 14:37Workers Filling Up. Everything on Port 3000 Slows.
Node.js: 6,502 shopping timeouts/hr per worker × 22 workers. Nginx 499s: 89,142 (+68%). Nginx 502s: 218 (+7×). Conversion postbacks on port 3000 also slowing as event loop saturates.
14:23:35/health Endpoint Aborts — 15-Min Warning
Server too busy to respond to its own health probes. RequestAborted on GET /health. 15 minutes before full port exhaustion.
⚠ Phase 2 — Total Collapse via Port Exhaustion (14:38 – 17:00 UTC)
14:38:28 UTC ▲▲ COLLAPSEEphemeral Port Pool Exhausted
Port range 32768–60999 (~28K) fully consumed. Every nginx → port 3000 connection: errno 99. All vhosts fail simultaneously: tiles, conversions, addons-api, tiles-exp.
14:38 – 16:005.36M Port Exhaustion Errors
502s: 1.8M (14:xx) + 2.7M (15:xx)/hr. 499s: 2.8M + 4.5M/hr. Peak 138× baseline. ~2,354 conversion postbacks (Mobavenue, VeveAPI) fail — revenue attribution at risk.
✓ Recovery
~16:00 – 17:00Natural Recovery
Shopping API recovered. Hung requests drained. TIME_WAIT connections expired, ports freed. By 17:00: 502s = 17/hr — fully resolved. April 18: same cascade; recovery triggered by disabling /conversions/mobavenue route.

📊 Nginx Access Log — Hourly Breakdown (April 16)

Hour (UTC)499 Client Abort502 Bad Gateway504 TimeoutTotal Errorsvs Baseline
12:xx normal53,15430053,184
13:xx onset89,142218089,3601.7×
14:xx ▲ collapse2,846,5451,807,95690,3934,744,89489×
15:xx ▲▲ PEAK4,469,7732,718,736148,4407,336,949138×
16:xx ▼ recovery1,411,446835,25251,4652,298,16343×
17:xx ▼▼ resolved63,32317063,3401.2×

💥 Port Exhaustion — Confirmed Evidence

The 5.3M [crit] entries are NOT SSL errors — confirmed as errno 99: Cannot assign requested address. First hit 14:38:28 UTC:

2026/04/16 14:38:28 [crit] connect() to 127.0.0.1:3000 failed
  (99: Cannot assign requested address) while connecting to upstream,
  request: "POST /tiles/search HTTP/2.0", server: tiles.desh-api.com
VhostPort Exhaustion ErrorsNote
tiles.desh-api.com5,361,419Primary — tile searches
addons-api.desh.app3,683Collateral
conversions.desh.app2,354Collateral — conversion postbacks also lost
tiles-exp + tiles-exp1175Collateral
Total5,367,631

Ruled Out

TheoryStatusEvidence
Shopping server overload (CPU)Ruled outShopping server CPU was normal during incident — failure was on the shopping server's side (network/app), not resource exhaustion
Mobavenue conversions caused the overloadRuled outApril 18 confirmed: same shopping API timeout as trigger. Conversions were slowed by the overload, not causing it. Disabling them reduced load, not the cause.
5.3M [crit] = SSL errorsRuled outConfirmed errno 99: Cannot assign requested address — port exhaustion, not SSL
OOM / process crashRuled outAll 22 workers ran 00:00–23:59 without restart. No OOM events in syslog.
Database / Redis issueRuled outNo DB-related errors in any log file.

Recommendations

Immediate
1. Circuit breaker on getShoppingAutocompleteSuggestions

This is the only real fix. Trip after N consecutive failures and return [] without calling the external API. Tile searches complete immediately, no hanging connections, no CPU saturation, no port pool drain. Without this, the incident will recur on every shopping API failure.

Immediate
2. Shorten shopping API timeout to 1–2 seconds

Current timeout allows thousands of hanging calls to accumulate per worker before releasing. A 1–2s timeout means each failed call frees its worker slot quickly, dramatically slowing port pool drain even without a circuit breaker.

Immediate
3. Investigate the shopping API failure root cause

Same failure triggered both April 16 and April 18. The shopping server CPU was normal both times — the failure is connectivity or application-level on the shopping side. Until identified and fixed, this will recur. Check network routes, DNS, TLS config, and shopping server app logs at 13:00 on both dates.

High
4. Widen ephemeral port range + enable TIME_WAIT reuse

Keepalive was active and port exhaustion still happened. Widening the range delays exhaustion but doesn't prevent it — only the circuit breaker does. Still worth doing as a secondary defence.

# /etc/sysctl.conf
net.ipv4.ip_local_port_range = 15000 65535  # ~50K ports, avoids conflict with listening services
net.ipv4.tcp_tw_reuse = 1                    # reuse TIME_WAIT ports immediately
Already active
5. ✓ nginx keepalive — was active during both incidents, did not prevent port exhaustion
upstream api {
    server 127.0.0.1:3000;
    keepalive 128;
    keepalive_requests 10000;
    keepalive_timeout 100s;
}

Why keepalive didn't help here: keepalive 128 means up to 128 idle connections are pooled per nginx worker. When the shopping API hangs, those 128 connections immediately fill with in-flight requests that never complete — they are not idle and cannot be reused. Every new request beyond 128 per worker opens a fresh connection to port 3000, which cycles through TIME_WAIT on close. At incident-level traffic volumes this still exhausts ~28K ports in ~98 minutes. This confirms the circuit breaker (rec #1) is the only effective protection.

High
6. Alert on /health RequestAborted

Health endpoint aborting at 14:23 was a 15-minute early warning before full collapse at 14:38. An alert here enables manual intervention — including disabling the conversions route — before the port pool exhausts.

High
7. Alert on 502 error rate

At 13:xx, 502s were 218 but rising 7× over baseline. Alert at “>500 502s in 5 minutes” would fire at incident onset. Phase 1 manual intervention (disable conversions) is faster than waiting for full collapse.

High
8. Add Node.js event loop lag monitoring

Event loop lag >100ms signals event loop saturation before it becomes critical. Export to Grafana. Would have fired at 13:00 when logEvent latency tripled.

Medium
9. Fix Branch API SSL issues

error.log shows SSL handshake failures to 18.172.64.113 (Branch API) since March 22. Unrelated to this incident but should be resolved.

📄 Files Analyzed

FileSizeKey finding
access-tiles.desh-api.com.log4.2 GBHourly status code breakdown
access-conversions.desh.app.logMobavenue volume steady — no spike at 13:00
error-conversions.desh.app.log1.8 MBConfirmed /conversions/mobavenue → port 3000
error-tiles.desh-api.com.log1.8 GB5.36M port exhaustion [crit] entries
mobavenue-conversions.log15,843 linesPostback detail
combined-addons-api-server-{8–29}-2026-04-16.log~210 MB × 22Shopping timeout spike: 2 → 59 in first minute at 13:00
combined-addons-analytics-server-{0–15}-2026-04-16.log~4 MB × 16logEvent 152ms → 472ms at 13:00 (3× spike)
combined-addons-api-cron-0-2026-04-16.logOEMW ENOTFOUND at 12:52 — early signal
syslog15 GBNo OOM, no SIGKILL, no process restart