On the morning of July 19, 2024, three of Australia's largest betting operators — Tabcorp, Sportsbet, and Ladbrokes — went dark within minutes of each other. None of them had pushed a bad deploy. None of them had a security breach. The cause sat entirely outside their own codebases, inside a security vendor's routine update to software running on millions of machines they didn't write a line of code for.
We've already written about this shape of problem once, in our breakdown of Final Fantasy XIV's 2021 login crisis — a case where the real constraint was a global chip shortage that Square Enix had no control over. This is the same category of failure, but faster, more sudden, and arguably more dangerous: a single vendor's mistake, pushed automatically, with zero warning and zero opportunity to test it first.
TL;DR
•
CrowdStrike, July 2024: a flawed security update bricked 8.5 million Windows machines worldwide in under 90 minutes — including the systems behind Tabcorp, Sportsbet, and Ladbrokes simultaneously.
•
AWS, October 2025: an internal DNS race condition inside DynamoDB took down a wide swath of the internet for hours — including Fortnite and Roblox, alongside Disney+, Reddit, and a Premier League broadcast.
•
The fix being fast didn't make recovery fast. CrowdStrike reverted its bad update in 78 minutes — but every machine that already crashed needed a person, physically, to boot into Safe Mode and delete a file by hand.
•
The shared lesson: you can't patch your way out of a dependency you don't control. You can only decide, in advance, how much blast radius one vendor's bad day is allowed to have.
8.5M
Windows devices crashed worldwide
78 min
to revert the update — recovery still took days
$5.4B+
estimated direct cost to Fortune 500 firms
The bookmakers: CrowdStrike takes down three operators at once
At 04:09 UTC on July 19, 2024, CrowdStrike pushed a routine configuration update — a "Channel File" — to every Windows machine running its Falcon security sensor. The update was meant to improve detection of a specific attack technique. Instead, it contained a mismatch: the update assumed a data structure with 21 fields, but the actual content shipped with only 20. That single discrepancy triggered an out-of-bounds memory read inside Falcon's kernel-level driver, and the driver crashed every Windows machine it was running on — immediately, and on every subsequent boot attempt, because the driver loaded early in the startup sequence.
Roughly 8.5 million Windows devices crashed within the hour, by Microsoft's own count. Tabcorp and Sportsbet — together responsible for more than 70% of Australia's wagering market — went down alongside Ladbrokes. Betting stopped entirely, online and in retail outlets. Tote price finalization froze mid-calculation, which meant payouts on bets already placed couldn't be settled until the underlying systems came back. Both operators publicly attributed the outage to "a global external technical issue," which was accurate — neither had any path to fix it themselves.
What makes this case distinct from a typical outage is what happened after CrowdStrike found the bug. The company reverted the faulty update at 05:27 UTC — 78 minutes after it shipped. In a normal software incident, that's the end of the story: bad deploy rolled back, service restored. Here, it wasn't. Every machine that had already crashed was stuck in a boot loop, because the damage was done locally on each device before the revert ever reached it. Recovery required someone to physically access each affected machine, boot into Safe Mode, locate a specific system file, and delete it by hand — one machine at a time, sometimes complicated further by BitLocker disk encryption requiring a separate recovery key. For organizations with thousands of endpoints, that's not a fix measured in minutes. It's a fix measured in however many hands you have available.
The game: an AWS database failure takes down Fortnite and Roblox
On October 20, 2025, a separate but structurally identical story played out in the gaming industry. Amazon's DynamoDB — a managed database service that much of the internet quietly depends on, often without realizing how deeply — suffered a DNS failure in its largest region, US-East-1. The proximate cause, per AWS's own postmortem, was a race condition: an internal system called a DNS Enactor that updates DynamoDB's DNS records ran unusually slowly for one execution, while a second, parallel Enactor processed updates far faster than normal. The mismatch between the two led to DynamoDB's DNS records effectively being emptied, and every system trying to reach DynamoDB through its public endpoint — including a large share of AWS's own internal services — began failing immediately.
The outage rippled outward in a way that surprised even engineers who consider themselves dependency-aware. Disney+, Reddit, Snapchat, Coinbase, the McDonald's app, and UK government tax services all went down. So did Fortnite and Roblox, reported alongside the others as players found themselves unable to connect. Independent analysis of the incident noted a detail worth sitting with: services that monitor other services' uptime were themselves casualties — status pages built on Atlassian's Statuspage product couldn't be updated, meaning some companies couldn't even tell their own users what was happening, because the tool they'd use to say so depended on the same failing infrastructure.
The outage lasted around three hours before AWS engineers manually intervened to restore DynamoDB's DNS. For a live-service game, three hours during a peak window isn't a minor blip — it's measured in lost engagement, refund requests, and the same kind of player trust erosion we covered when we looked at what happens when sportsbooks go down during the World Cup. The mechanism was completely different — a malicious attack versus an internal race condition inside a trusted vendor's infrastructure — but the experience on the other end of the connection looked the same: the platform isn't responding, and there's nothing the player-facing team can do about it directly.
🧭
Most teams can name their direct vendors. Far fewer can name their vendors' vendors.
The AWS incident took down services that didn't think of themselves as AWS-dependent at all — they depended on something that depended on DynamoDB. Gart Solutions'
infrastructure audit service
is built around mapping that second and third layer of dependency before it becomes a 3 a.m. discovery.
Why "it wasn't our bug" doesn't help you at 3 a.m.
Both incidents share a structure that's worth naming directly, because it's the part most incident-response planning misses. It isn't just "depend less on third parties" — for any real-time platform, some third-party dependency is unavoidable. The actual lesson is narrower and more actionable:
A vendor's fast fix doesn't guarantee a fast recovery for you. CrowdStrike reverted its bad update in under 90 minutes. That timeline meant almost nothing to organizations whose machines had already crashed, because the recovery step required physical, manual intervention that no amount of vendor speed could shortcut.
Your dependency map is deeper than your vendor list. Plenty of companies hit by the AWS DynamoDB failure didn't think of themselves as exposed to it — they depended on a tool that depended on AWS, two or three layers removed from a decision anyone on their team actually made.
The blast radius is a design choice, even when the bug isn't yours. Whether a single vendor's failure takes down your entire platform or just a degraded subset of features is determined by how much of your stack assumes that vendor will always be there — not by how good the vendor's engineering team is.
"Not our bug" doesn't buy you patience from players or regulators. Tabcorp and Sportsbet were transparent about the external cause, and it didn't make the outage shorter or the customer frustration smaller. The same will be true for a game studio explaining an AWS-shaped outage to a community mid-launch.
🛟
The honest goal isn't eliminating third-party risk — it's bounding it before it's a live incident.
Failover paths, degraded-mode design, and a tested incident response plan for "the outage isn't ours but the downtime is" are core to Gart Solutions'
SRE practice.
The takeaway for both industries
A sportsbook can't audit CrowdStrike's source code, and a game studio can't audit AWS's internal DNS systems. That's not the point. The point is that both incidents were entirely predictable in shape, if not in timing: any platform with a deep enough dependency on a single vendor will eventually inherit that vendor's worst day, and the only real choices left at that point are how much of your platform that worst day is allowed to take down with it, and how fast a human can actually act once it does.
That's an architecture and incident-response question, not a vendor-selection one — switching vendors just relocates the same risk. The work is in mapping where a single point of failure actually sits in your stack, deciding what degrades gracefully versus what goes dark entirely, and rehearsing the manual recovery steps before you need them at 3 a.m. with thousands of angry players or bettors watching a status page that, ironically, might also be down.
Do you know what happens when your biggest vendor has its worst day?
Gart Solutions maps the dependency chains most teams don't see until they fail, and builds the failover and incident-response plans that bound the damage when they do.
Talk to our architects →
Right now, while the 2026 FIFA World Cup's expanded 48-team tournament plays out across the US, Mexico, and Canada, sports-betting platforms are taking some of the heaviest DDoS pressure they'll see all year. Security researchers tracking the tournament have documented attack traffic against betting platforms climbing steadily through late May, then sharply from June 5 onward as kickoff approached — and on the day before the opening match, a single traffic spike that dwarfed everything before it: over a million requests in one burst, more than three times the previous peak.
That's not a coincidence, and it's not really a new story either. A few weeks ago we published a breakdown of three real, public postmortems from game launches — Fortnite, Final Fantasy XIV, and Helldivers 2 — that all broke under sudden, extreme load. None of those were attacks. They were legitimate demand. But the shape of the failure, and increasingly the shape of the defense required, looks the same whether the traffic wants to hurt you or just wants to play.
TL;DR
•
The pattern is identical at the infrastructure layer: a near-vertical request curve with no ramp-up, arriving faster than a human can classify it as malicious or legitimate.
•
World Cup sportsbooks (2026): real tracked attacks have hit roughly 18,000 requests per second with zero warm-up, deliberately routed through dozens of countries to defeat geo-blocking.
•
Game launches (Fortnite, 2018): the same near-vertical curve, except every request was a real paying player — and it still exhausted AWS instance limits and IP pools just as fast.
•
The shared lesson: if your defense depends on a human deciding "is this an attack or just success," you've already lost the seconds that matter.
18,000
requests/sec, zero warm-up
87 sec
window before a cascade spreads
70–75%
forecast rise in World Cup betting volume
The attack: what's actually hitting sportsbooks this World Cup
Threat researchers monitoring sports-betting platforms during the 2026 World Cup have published a detailed breakdown of the pattern: traffic against one tracked platform spiked to roughly 18,000 requests per second in what's described as a near-vertical wall — no ramp-up, no warm-up period, no gradual escalation. Within seconds of the initial surge, the geographic composition broadens rapidly: an initial spike from Russia-origin traffic is quickly joined by US, German, Indonesian, Singaporean, and a dozen other country sources, each adding hundreds to low thousands of requests per second.
That spread isn't random. Spreading the source footprint across many countries within seconds makes any single-country block largely useless, and researchers note the traffic draws entirely on proxy infrastructure and data centers with an established history of malicious activity — a pre-assembled operation, not opportunistic reuse. None of it reflects a real betting platform's actual user base; a European-regulated sportsbook simply doesn't get organic traffic from a dozen unrelated countries within the same few seconds.
The operational detail that matters most for defenders: researchers estimate roughly 87 seconds between the first signal and the point where the attack cascades broadly enough that manual, human-in-the-loop response is no longer fast enough. Automated, real-time blocking at millisecond latency isn't a nice-to-have here — it's the only posture that has a chance.
And the stakes are specifically tied to the product itself. In-play betting — placing wagers while a match is live — is one of the highest-margin features sportsbooks offer, and it's consistently the first thing to break under load. Industry reporting suggests roughly a third of bets during a major tournament final are placed in-play, and the tolerance for delay is brutal: the difference between a two-second and a five-second response during a key moment isn't a minor glitch, it's a missed bet, a frozen cash-out, and a player who doesn't give the platform a second chance.
The launch: what hit Fortnite at 3.4 million concurrent players
We covered this in detail in our breakdown of three real game-launch postmortems, but it's worth pulling the relevant thread here specifically: when Fortnite hit a then-unprecedented 3.4 million concurrent players in February 2018, part of what broke was strictly a capacity ceiling that had nothing to do with game logic. Epic's own postmortem describes hitting AWS's regional instance limits running on fleets of c4.8xlarge instances, and running out of IP addresses in their standard subnets purely from the pace of scaling — a near-vertical demand curve that exhausted infrastructure quotas in roughly the same shape a coordinated attack would.
The traffic wasn't malicious. Every one of those requests was a real player wanting to play a game they'd already downloaded. But from the perspective of the infrastructure underneath — the load balancers, the connection pools, the cloud provider's regional quotas — a sudden, extreme, geographically broad surge in connections looks remarkably similar whether it's organic enthusiasm or a botnet. The failure mode wasn't "we got attacked." It was "we got more legitimate demand than our quotas and pooling assumptions could absorb fast enough," which is functionally the same shape of problem a DDoS defense exists to handle.
🛡️ This is exactly why DDoS-readiness and launch-readiness end up being the same engineering exercise. Whether the surge is malicious or just successful, the fix is the same: automated, real-time response that doesn't wait on a human classification step. Gart Solutions' security audit service is built around stress-testing exactly this distinction before it's tested for you, live.
Why the same infrastructure has to defend against both
The uncomfortable truth for anyone running a real-time platform — a sportsbook during in-play betting, a game server during a launch spike — is that in the first several seconds, a malicious DDoS surge and a legitimate viral demand spike can look identical at the network layer. Same near-vertical request curve. Same overwhelmed connection pool. Same sudden geographic and behavioral pattern that doesn't match yesterday's baseline.
That's not a reason to give up on telling them apart — it's the reason the first line of defense can't depend on telling them apart at all. The systems that survive both scenarios share the same design properties regardless of which one they're facing:
Elastic capacity that triggers on pattern, not on classification. Autoscaling and rate-limiting need to respond to "this looks anomalous" within seconds, not wait for a security team or a war room to confirm intent.
Geo- and behavior-aware edge mitigation, because both attackers and viral demand show up as traffic shapes that don't match an operator's real, known user base — and that signal is available before anyone's looked at a single request payload.
Quota and connection-pool headroom built for the spike, not the average, because cloud provider regional limits and IP exhaustion don't care whether the requests hitting them are well-intentioned.
A fallback that degrades gracefully rather than falling over completely — queuing, graceful rate-limiting, or a holding page beats a total outage whether the cause is 2 million real fans or 20,000 requests a second from a botnet.
Sportsbooks during a World Cup and game studios during a launch are solving variations of the exact same problem, and most of them are doing it with teams and tooling that were built for one or the other, not both.
📡 The defensive posture that holds up under a real attack is the same one that holds up under real success. Real-time anomaly detection, automated mitigation, and capacity that doesn't wait for a human in the loop are the core of Gart Solutions' SRE practice — built for platforms where the difference between a good night and a very bad one is measured in seconds.
The takeaway for both industries
If you operate a sportsbook, the next major tournament — or even the next big goal in this one — is a live test of whether your platform can tell a coordinated attack from a crowd of real bettors fast enough to matter, without making either group wait. If you run a live-service game, your next content drop or marketing push is the same test wearing a different shirt.
Neither industry should be solving this from scratch. The shape of the problem — sudden, extreme, geographically anomalous traffic that has to be absorbed or mitigated in seconds, not minutes — has been documented publicly, repeatedly, by both sides. The infrastructure that handles it well doesn't ask "is this an attack," it asks "can we absorb or shed this safely either way," and answers that question automatically before a person ever gets paged.
Is your platform ready for its next traffic spike — attack or success?
Gart Solutions runs security and infrastructure audits built around exactly this distinction: real-time, automated readiness for sudden load, whether it's malicious or just means you're winning.
Three real-world postmortems reveal how gaming infrastructure actually fails under launch-scale load — and why traditional scaling assumptions break in production.
Most advice about gaming infrastructure focuses on generic scaling tactics: autoscaling, Kubernetes, load testing, CDNs. While all of these matter, they rarely explain why even top-tier studios still experience catastrophic failures during major launches.
The reality is that gaming infrastructure failures are not usually caused by lack of compute — they are caused by hidden architectural constraints that only appear under real player load.
To understand this, we analyzed three public postmortems from Fortnite (2018), Final Fantasy XIV (2021), and Helldivers 2 (2024). Each case reveals a different type of gaming infrastructure failure — from data layer bottlenecks to hardware procurement limits and application-level scaling issues.
TL;DR
Fortnite (2018): a single database shard handling matchmaking became a write-queue bottleneck that took down the whole platform — more compute couldn't route around a sharding design problem.
FFXIV (2021): the bottleneck wasn't software — it was physical hardware lead time, made worse by a global chip shortage. Cloud-style elasticity didn't apply.
Helldivers 2 (2024): the CEO said it outright — this wasn't a budget problem, it was application code that needed engineering weeks, not a bigger AWS bill.
The shared lesson: every team's capacity plan was built around the wrong constraint, and they only found the real one under live fire, in front of paying players.
Gaming Infrastructure Case Study 1: Fortnite’s 3.4M Concurrent Players
On the weekend of February 3–4, 2018, Fortnite hit a new peak of 3.4 million concurrent players — at the time, an unprecedented number for the genre. Epic's own engineering team published a detailed postmortem five days later. It described six separate incidents across the weekend, ranging from degraded performance to total service disruption.
The core of the failure sat in a service Epic calls MCP — the backend that handles player profiles, stats, inventory, and matchmaking. It ran on nine MongoDB shards, each with a writer, two read replicas, and a hidden replica for redundancy. Most player data was spread across eight of those shards. The ninth handled something narrower but critical: matchmaking session state, shared service caches, and runtime configuration — and by design, that data had to live in a single collection.
At peak load, MCP was handling around 124,000 client requests per second, translating to roughly 318,000 database reads and 132,000 writes per second, normally with sub-10-millisecond response times. Matchmaking itself accounted for a modest 15% of total queries — but because it was concentrated on one shard, that shard became the choke point. Under peak load, writes began queuing for available writer resources, with individual operations spiking past 40 seconds. The database process would eventually become unresponsive, requiring a manual primary failover to restore service — a procedure the team repeated multiple times per hour during the worst stretches.
A second, unrelated failure compounded the weekend: Epic's Account Service sits behind an Nginx proxy that shortcuts token-verification traffic through a cache. When the underlying Memcached layer started failing under load, Nginx queued behind it waiting on 100ms timeouts, exhausted its available worker threads, and stopped serving any traffic — including the health checks that load balancers use to decide which nodes are healthy. Every node got pulled from rotation. A caching layer's failure became a full authentication outage.
A third structural issue surfaced in Epic's XMPP service, which handles presence, chat, and parties. It's architected as a full mesh, where every node maintains a connection to every other node. With roughly ten connections per node across 101 nodes, that's about a thousand sockets per node spent purely on internal cluster communication — a hard ceiling on how many nodes (and therefore how much concurrent load) the architecture could support without a redesign, regardless of how much compute Epic threw at it.
And underneath all three, Epic also hit AWS's regional instance limits running on fleets of c4.8xlarge instances, and ran out of IP addresses in their standard /24 subnets purely from the pace of scaling — operational cloud-quota issues that had nothing to do with the game itself.
The lesson: more compute doesn't fix a sharding decision. The single collection backing matchmaking was a structural bottleneck that no amount of autoscaling could route around — only a redesign could, which is exactly what Epic moved to next, breaking matchmaking out into its own microservice with a different data model.
🔍 This is the kind of failure mode a load test rarely catches by accident. Simulating average traffic won't surface a single-shard bottleneck — you have to specifically test the write path that all your sessions funnel through. Gart Solutions' infrastructure audit service is built around finding exactly this kind of structural ceiling before it shows up in production.
Gaming Infrastructure Case Study 2: Final Fantasy XIV Login Bottlenecks
When Final Fantasy XIV's Endwalker expansion entered early access, Square Enix was hit with what director Naoki Yoshida called an unexpected and dramatic surge of new and returning players across every region simultaneously. The result was hours-long login queues and a string of cryptic error codes that became a running joke in the community — and a real engineering problem behind the scenes.
The login system processed waiting players in batches of roughly 100 at a time. A bug tracked as Error 4004 could knock about a quarter of each batch back out of the queue at the exact moment it was their turn, sending them to the back of the line with no memory of their previous wait. Error 2002 was more deliberate: a circuit breaker that triggered once more than 17,000 players attempted to log into a single data center simultaneously, intentionally refusing further logins rather than letting the backend crash outright.
What made this case different from a typical capacity crunch is why Square Enix couldn't just scale through it. The planned fix wasn't a configuration change — it was a hardware upgrade to the login and world servers. And the team's ability to execute that upgrade ran straight into the global semiconductor shortage of 2021, compounded by COVID-era travel restrictions that kept engineers from physically reaching international data centers. This wasn't a software elasticity problem; it was a supply-chain problem wearing a server error code.
In the meantime, the team shipped what mitigations they could: an automatic logout for AFK players to free up occupied login slots, and incremental capacity increases as hardware became available — North America's data centers gained roughly 750 additional simultaneous logins per server as upgraded hardware came online, while the EU region lagged behind on a slower upgrade timeline.
The lesson: not every layer of your stack can autoscale. If a component — login authentication hardware, specialized network appliances, anything with a physical procurement step — has a hardware lead time, your launch capacity plan needs a hardware contingency, not just a Kubernetes horizontal pod autoscaler policy.
Gaming Infrastructure Case Study 3: Helldivers 2 Scaling Limits
Helldivers 2 launched on February 8, 2024, and within days had blown past every internal projection, eventually overtaking GTA V's long-standing Steam concurrent-player record. Developer Arrowhead Game Studios raised its concurrent player cap four times in roughly two weeks — from 250,000 to 360,000, then 450,000, then 700,000 — with each increase explicitly framed as the most the platform could currently support, not a target the team was choosing to undershoot.
What stands out in this case is how plainly Arrowhead's CEO, Johan Pilestedt, described the actual constraint. He stated that the fix wasn't about money or buying more servers — the team needed to optimize backend code that was hitting real limits, work that takes engineering time, not procurement budget. Arrowhead brought in engineers from Sony to help, and shipped a fifteen-minute AFK kick timer as a quick way to free up occupied capacity while the deeper backend work continued.
Notably, the studio also resisted the obvious-looking fix of simply enlarging squad sizes to fit more players per match — the client and netcode couldn't hold more simultaneous players in a single session without wrecking frame rate. "More concurrent players" and "more capacity per match" turned out to be two different engineering problems, and only one of them was solvable by adding servers.
The lesson: sometimes the bottleneck genuinely isn't infrastructure at all — it's application code that was never built to scale horizontally. No cloud budget fixes that. Only engineering time does, and a launch plan that assumes otherwise will discover the gap live, in front of its biggest audience.
📡 A pre-launch readiness review exists precisely to surface this distinction early — whether your bottleneck is infrastructure, hardware lead time, or application code — while there's still time to act on it instead of firefighting it live. This is the core of Gart Solutions' SRE practice.
The Real Problem Behind Gaming Infrastructure Failures
None of these three studios were small or under-resourced. Epic, Square Enix, and Arrowhead — backed by Sony — all had real engineering organizations and real cloud budgets behind them. What they had in common wasn't a lack of infrastructure spend. It was that each team's pre-launch capacity plan was built around the wrong assumption about where the system would actually break.
Fortnite's team assumed compute was the constraint; the real constraint was a single-shard data design. Square Enix assumed software configuration was the lever; the real constraint was physical hardware availability during a global shortage. Arrowhead assumed it would need more servers; the real constraint was application code that didn't horizontally scale.
In all three cases, the studio found its actual bottleneck the same way: by hitting it in production, in front of millions of players. That is the most expensive possible way to learn where your specific weak point is.
The alternative is to deliberately test for the failure mode, not just the happy path. Simulate write contention on whatever shard or table all your sessions funnel through, not just average read traffic. Map every component with a physical procurement step — specialized hardware, third-party licenses, anything hardware-bound — and ask what the contingency is if a lead time slips by even two weeks. Profile actual application code paths under realistic concurrency, not just infrastructure-level metrics, because a healthy-looking CPU graph can hide a function that was never written to parallelize.
That's a fundamentally different exercise than "spin up more pods and hope." It requires someone to go looking for the failure mode before launch day finds it for you.