Incident Report — April 16 & 18, 2026

! Executive Summary

🔄

Causality corrected (April 20) On April 18, the same shopping API timeout occurred. Disabling the /conversions/mobavenue route freed up CPU. This confirmed that conversions were being affected by the shopping API overload — not causing it. The conversions route was a victim of the saturated event loop, and removing it reduced load on the already-struggling workers.

Root cause: a single unprotected external dependency — the shopping autocomplete API.

At 13:00:00 UTC on both April 16 and April 18, getShoppingAutocompleteSuggestions began timing out on every call. The shopping server itself showed normal CPU — the failure was network or application-level on the shopping server side. With no circuit breaker, every incoming tile search request that included a shopping component spawned a hanging async call. All 22 Node.js workers began accumulating these pending operations, CPU climbed to ~100%, and the entire port 3000 event loop became saturated.

Once the event loop was saturated, everything on port 3000 slowed down — including Mobavenue conversion postbacks, which also proxy to 127.0.0.1:3000. Conversions were a secondary victim of the overload, not a contributor to it. Disabling the conversion route worked by reducing the total number of inflight requests on the saturated workers, giving them enough breathing room to process the backlog and prevent further port pool drain.

At 14:38:28 UTC, the Linux ephemeral port pool (~28K ports) was fully exhausted, causing errno 99: Cannot assign requested address on every new nginx → port 3000 connection and a server-wide total collapse.

🔗 Root Cause Chain

13:00:00 UTC — Shopping autocomplete API fails

getShoppingAutocompleteSuggestions times out on every call. Shopping server CPU was normal — the failure was connectivity or application-level on the shopping server side. Same trigger on both April 16 and April 18.

22 Node.js workers accumulate hanging tile search requests

No circuit breaker means each tile search with a shopping component spawns a call that never resolves. Pending async callbacks pile up across all workers. CPU climbs toward 100%.

😖 Secondary victim

Event loop saturation slows everything on port 3000

With CPU at ~100% and the event loop saturated, all requests on port 3000 slow down — including Mobavenue conversion postbacks, which also proxy to 127.0.0.1:3000. Conversions begin to back up and respond slowly. logEvent latency: 152ms → 472ms (3× spike at 13:00).

Ephemeral port pool drains over 98 minutes (13:00 → 14:38)

Nginx opens a new TCP connection to port 3000 for every request. Slow/hanging responses mean connections linger in TIME_WAIT for 60s. Linux default port range 32768–60999 = ~28K ports. Both tile searches and slow conversion connections contribute to drain.

14:38:28 UTC — Port exhaustion. Total collapse, all vhosts.

Every new nginx → port 3000 connection: errno 99: Cannot assign requested address. 5.36M [crit] errors in 2h22m. tiles.desh-api.com, conversions.desh.app, addons-api.desh.app all fail server-wide.

Mitigation: disable /conversions/mobavenue → CPU drops → partial recovery

Removing one stream of inflight requests from the saturated event loop reduced CPU load. Workers processed the remaining backlog faster. TIME_WAIT connections expired, freeing ports. The shopping API failure was still present, but with fewer concurrent requests the server could limp through. Not a fix — only reduces the amplification.

💡 Why Disabling Conversions Helped (But Isn't a Fix)

The /conversions/mobavenue route proxies to 127.0.0.1:3000 — the same overloaded Node.js workers. When the shopping API failure drove CPU to ~100%, every additional request on port 3000 (including conversions) made recovery harder. Disabling the route reduced the total request count on the saturated workers, giving them enough slack to:

Process and drain some of the hanging requests faster
Slow down the rate of new TIME_WAIT connections accumulating
Allow the ephemeral port pool to partially recover

⚠ This will not prevent the next incident. When the shopping API fails again, the server will reach the same state — just slightly more slowly without conversion traffic. The only real fix is a circuit breaker on getShoppingAutocompleteSuggestions.

⏱ Incident Timeline (April 16)

12:00 – 12:59Healthy Baseline

Nginx: 53,154 × 499, 30 × 502. v5Tiles latency 200–250ms. Shopping timeouts low and occasional.

12:52:00Early Signals — Campaign Builder Failures

Cron builder: OEMW API unreachable (ENOTFOUND), xyads returns 400. At 12:52:50 — first burst of 8 shopping timeouts on server-8. Possible early degradation.

13:00:00 UTC ▲ TRIGGERShopping API Fails

Timeouts: 2 in prev 5 min → 59 in first minute (server-8); 9× across all 22 workers
Shopping server CPU: normal — external/connectivity failure on shopping side
logEvent latency: 152ms → 472ms (3× in under 4 minutes)
v5Tiles latency: 200ms → 538–638ms

13:00 – 14:37Workers Filling Up. Everything on Port 3000 Slows.

Node.js: 6,502 shopping timeouts/hr per worker × 22 workers. Nginx 499s: 89,142 (+68%). Nginx 502s: 218 (+7×). Conversion postbacks on port 3000 also slowing as event loop saturates.

14:23:35/health Endpoint Aborts — 15-Min Warning

Server too busy to respond to its own health probes. RequestAborted on GET /health. 15 minutes before full port exhaustion.

14:38:28 UTC ▲▲ COLLAPSEEphemeral Port Pool Exhausted

Port range 32768–60999 (~28K) fully consumed. Every nginx → port 3000 connection: errno 99. All vhosts fail simultaneously: tiles, conversions, addons-api, tiles-exp.

14:38 – 16:005.36M Port Exhaustion Errors

502s: 1.8M (14:xx) + 2.7M (15:xx)/hr. 499s: 2.8M + 4.5M/hr. Peak 138× baseline. ~2,354 conversion postbacks (Mobavenue, VeveAPI) fail — revenue attribution at risk.

~16:00 – 17:00Natural Recovery

Shopping API recovered. Hung requests drained. TIME_WAIT connections expired, ports freed. By 17:00: 502s = 17/hr — fully resolved. April 18: same cascade; recovery triggered by disabling /conversions/mobavenue route.

📊 Nginx Access Log — Hourly Breakdown (April 16)

Hour (UTC)	499 Client Abort	502 Bad Gateway	504 Timeout	Total Errors	vs Baseline
12:xx normal	53,154	30	0	53,184	1×
13:xx onset	89,142	218	0	89,360	1.7×
14:xx ▲ collapse	2,846,545	1,807,956	90,393	4,744,894	89×
15:xx ▲▲ PEAK	4,469,773	2,718,736	148,440	7,336,949	138×
16:xx ▼ recovery	1,411,446	835,252	51,465	2,298,163	43×
17:xx ▼▼ resolved	63,323	17	0	63,340	1.2×

💥 Port Exhaustion — Confirmed Evidence

The 5.3M [crit] entries are NOT SSL errors — confirmed as errno 99: Cannot assign requested address. First hit 14:38:28 UTC:

2026/04/16 14:38:28 [crit] connect() to 127.0.0.1:3000 failed
  (99: Cannot assign requested address) while connecting to upstream,
  request: "POST /tiles/search HTTP/2.0", server: tiles.desh-api.com

Vhost	Port Exhaustion Errors	Note
tiles.desh-api.com	5,361,419	Primary — tile searches
addons-api.desh.app	3,683	Collateral
conversions.desh.app	2,354	Collateral — conversion postbacks also lost
tiles-exp + tiles-exp1	175	Collateral
Total	5,367,631

❌ Ruled Out

Theory	Status	Evidence
Shopping server overload (CPU)	Ruled out	Shopping server CPU was normal during incident — failure was on the shopping server's side (network/app), not resource exhaustion
Mobavenue conversions caused the overload	Ruled out	April 18 confirmed: same shopping API timeout as trigger. Conversions were slowed by the overload, not causing it. Disabling them reduced load, not the cause.
5.3M [crit] = SSL errors	Ruled out	Confirmed `errno 99: Cannot assign requested address` — port exhaustion, not SSL
OOM / process crash	Ruled out	All 22 workers ran 00:00–23:59 without restart. No OOM events in syslog.
Database / Redis issue	Ruled out	No DB-related errors in any log file.

✅ Recommendations

Immediate

1. Circuit breaker on getShoppingAutocompleteSuggestions

This is the only real fix. Trip after N consecutive failures and return [] without calling the external API. Tile searches complete immediately, no hanging connections, no CPU saturation, no port pool drain. Without this, the incident will recur on every shopping API failure.

Immediate

2. Shorten shopping API timeout to 1–2 seconds

Current timeout allows thousands of hanging calls to accumulate per worker before releasing. A 1–2s timeout means each failed call frees its worker slot quickly, dramatically slowing port pool drain even without a circuit breaker.

Immediate

3. Investigate the shopping API failure root cause

Same failure triggered both April 16 and April 18. The shopping server CPU was normal both times — the failure is connectivity or application-level on the shopping side. Until identified and fixed, this will recur. Check network routes, DNS, TLS config, and shopping server app logs at 13:00 on both dates.

High

4. Widen ephemeral port range + enable TIME_WAIT reuse

Keepalive was active and port exhaustion still happened. Widening the range delays exhaustion but doesn't prevent it — only the circuit breaker does. Still worth doing as a secondary defence.

# /etc/sysctl.conf
net.ipv4.ip_local_port_range = 15000 65535  # ~50K ports, avoids conflict with listening services
net.ipv4.tcp_tw_reuse = 1                    # reuse TIME_WAIT ports immediately

Already active

5. ✓ nginx keepalive — was active during both incidents, did not prevent port exhaustion

upstream api {
    server 127.0.0.1:3000;
    keepalive 128;
    keepalive_requests 10000;
    keepalive_timeout 100s;
}

Why keepalive didn't help here: keepalive 128 means up to 128 idle connections are pooled per nginx worker. When the shopping API hangs, those 128 connections immediately fill with in-flight requests that never complete — they are not idle and cannot be reused. Every new request beyond 128 per worker opens a fresh connection to port 3000, which cycles through TIME_WAIT on close. At incident-level traffic volumes this still exhausts ~28K ports in ~98 minutes. This confirms the circuit breaker (rec #1) is the only effective protection.

High

6. Alert on /health RequestAborted

Health endpoint aborting at 14:23 was a 15-minute early warning before full collapse at 14:38. An alert here enables manual intervention — including disabling the conversions route — before the port pool exhausts.

High

7. Alert on 502 error rate

At 13:xx, 502s were 218 but rising 7× over baseline. Alert at “>500 502s in 5 minutes” would fire at incident onset. Phase 1 manual intervention (disable conversions) is faster than waiting for full collapse.

High

8. Add Node.js event loop lag monitoring

Event loop lag >100ms signals event loop saturation before it becomes critical. Export to Grafana. Would have fired at 13:00 when logEvent latency tripled.

Medium

9. Fix Branch API SSL issues

error.log shows SSL handshake failures to 18.172.64.113 (Branch API) since March 22. Unrelated to this incident but should be resolved.

📄 Files Analyzed

File	Size	Key finding
`access-tiles.desh-api.com.log`	4.2 GB	Hourly status code breakdown
`access-conversions.desh.app.log`	—	Mobavenue volume steady — no spike at 13:00
`error-conversions.desh.app.log`	1.8 MB	Confirmed /conversions/mobavenue → port 3000
`error-tiles.desh-api.com.log`	1.8 GB	5.36M port exhaustion [crit] entries
`mobavenue-conversions.log`	15,843 lines	Postback detail
`combined-addons-api-server-{8–29}-2026-04-16.log`	~210 MB × 22	Shopping timeout spike: 2 → 59 in first minute at 13:00
`combined-addons-analytics-server-{0–15}-2026-04-16.log`	~4 MB × 16	logEvent 152ms → 472ms at 13:00 (3× spike)
`combined-addons-api-cron-0-2026-04-16.log`	—	OEMW ENOTFOUND at 12:52 — early signal
`syslog`	15 GB	No OOM, no SIGKILL, no process restart

Incident Report: April 16 & 18, 2026

⚠ Recurring Incident — Same Root Cause Both Times

! Executive Summary

🔗 Root Cause Chain

💡 Why Disabling Conversions Helped (But Isn't a Fix)

⏱ Incident Timeline (April 16)

📊 Nginx Access Log — Hourly Breakdown (April 16)

💥 Port Exhaustion — Confirmed Evidence

❌ Ruled Out

✅ Recommendations

📄 Files Analyzed