DevOps

Someone else’s bug, your downtime: why bookmakers and game studios share the same third-party risk

for Someone else's bug, your downtime: why bookmakers and game studios share the same third-party risk

On the morning of July 19, 2024, three of Australia’s largest betting operators — Tabcorp, Sportsbet, and Ladbrokes — went dark within minutes of each other. None of them had pushed a bad deploy. None of them had a security breach. The cause sat entirely outside their own codebases, inside a security vendor’s routine update to software running on millions of machines they didn’t write a line of code for.

We’ve already written about this shape of problem once, in our breakdown of Final Fantasy XIV’s 2021 login crisis — a case where the real constraint was a global chip shortage that Square Enix had no control over. This is the same category of failure, but faster, more sudden, and arguably more dangerous: a single vendor’s mistake, pushed automatically, with zero warning and zero opportunity to test it first.

TL;DR

  • CrowdStrike, July 2024: a flawed security update bricked 8.5 million Windows machines worldwide in under 90 minutes — including the systems behind Tabcorp, Sportsbet, and Ladbrokes simultaneously.
  • AWS, October 2025: an internal DNS race condition inside DynamoDB took down a wide swath of the internet for hours — including Fortnite and Roblox, alongside Disney+, Reddit, and a Premier League broadcast.
  • The fix being fast didn’t make recovery fast. CrowdStrike reverted its bad update in 78 minutes — but every machine that already crashed needed a person, physically, to boot into Safe Mode and delete a file by hand.
  • The shared lesson: you can’t patch your way out of a dependency you don’t control. You can only decide, in advance, how much blast radius one vendor’s bad day is allowed to have.
8.5M
Windows devices crashed worldwide
78 min
to revert the update — recovery still took days
$5.4B+
estimated direct cost to Fortune 500 firms

The bookmakers: CrowdStrike takes down three operators at once

At 04:09 UTC on July 19, 2024, CrowdStrike pushed a routine configuration update — a “Channel File” — to every Windows machine running its Falcon security sensor. The update was meant to improve detection of a specific attack technique. Instead, it contained a mismatch: the update assumed a data structure with 21 fields, but the actual content shipped with only 20. That single discrepancy triggered an out-of-bounds memory read inside Falcon’s kernel-level driver, and the driver crashed every Windows machine it was running on — immediately, and on every subsequent boot attempt, because the driver loaded early in the startup sequence.

Roughly 8.5 million Windows devices crashed within the hour, by Microsoft’s own count. Tabcorp and Sportsbet — together responsible for more than 70% of Australia’s wagering market — went down alongside Ladbrokes. Betting stopped entirely, online and in retail outlets. Tote price finalization froze mid-calculation, which meant payouts on bets already placed couldn’t be settled until the underlying systems came back. Both operators publicly attributed the outage to “a global external technical issue,” which was accurate — neither had any path to fix it themselves.

What makes this case distinct from a typical outage is what happened after CrowdStrike found the bug. The company reverted the faulty update at 05:27 UTC — 78 minutes after it shipped. In a normal software incident, that’s the end of the story: bad deploy rolled back, service restored. Here, it wasn’t. Every machine that had already crashed was stuck in a boot loop, because the damage was done locally on each device before the revert ever reached it. Recovery required someone to physically access each affected machine, boot into Safe Mode, locate a specific system file, and delete it by hand — one machine at a time, sometimes complicated further by BitLocker disk encryption requiring a separate recovery key. For organizations with thousands of endpoints, that’s not a fix measured in minutes. It’s a fix measured in however many hands you have available.

The game: an AWS database failure takes down Fortnite and Roblox

On October 20, 2025, a separate but structurally identical story played out in the gaming industry. Amazon’s DynamoDB — a managed database service that much of the internet quietly depends on, often without realizing how deeply — suffered a DNS failure in its largest region, US-East-1. The proximate cause, per AWS’s own postmortem, was a race condition: an internal system called a DNS Enactor that updates DynamoDB’s DNS records ran unusually slowly for one execution, while a second, parallel Enactor processed updates far faster than normal. The mismatch between the two led to DynamoDB’s DNS records effectively being emptied, and every system trying to reach DynamoDB through its public endpoint — including a large share of AWS’s own internal services — began failing immediately.

The outage rippled outward in a way that surprised even engineers who consider themselves dependency-aware. Disney+, Reddit, Snapchat, Coinbase, the McDonald’s app, and UK government tax services all went down. So did Fortnite and Roblox, reported alongside the others as players found themselves unable to connect. Independent analysis of the incident noted a detail worth sitting with: services that monitor other services’ uptime were themselves casualties — status pages built on Atlassian’s Statuspage product couldn’t be updated, meaning some companies couldn’t even tell their own users what was happening, because the tool they’d use to say so depended on the same failing infrastructure.

The outage lasted around three hours before AWS engineers manually intervened to restore DynamoDB’s DNS. For a live-service game, three hours during a peak window isn’t a minor blip — it’s measured in lost engagement, refund requests, and the same kind of player trust erosion we covered when we looked at what happens when sportsbooks go down during the World Cup. The mechanism was completely different — a malicious attack versus an internal race condition inside a trusted vendor’s infrastructure — but the experience on the other end of the connection looked the same: the platform isn’t responding, and there’s nothing the player-facing team can do about it directly.

🧭
Most teams can name their direct vendors. Far fewer can name their vendors’ vendors. The AWS incident took down services that didn’t think of themselves as AWS-dependent at all — they depended on something that depended on DynamoDB. Gart Solutions’ infrastructure audit service is built around mapping that second and third layer of dependency before it becomes a 3 a.m. discovery.

Why “it wasn’t our bug” doesn’t help you at 3 a.m.

Both incidents share a structure that’s worth naming directly, because it’s the part most incident-response planning misses. It isn’t just “depend less on third parties” — for any real-time platform, some third-party dependency is unavoidable. The actual lesson is narrower and more actionable:

  • A vendor’s fast fix doesn’t guarantee a fast recovery for you. CrowdStrike reverted its bad update in under 90 minutes. That timeline meant almost nothing to organizations whose machines had already crashed, because the recovery step required physical, manual intervention that no amount of vendor speed could shortcut.
  • Your dependency map is deeper than your vendor list. Plenty of companies hit by the AWS DynamoDB failure didn’t think of themselves as exposed to it — they depended on a tool that depended on AWS, two or three layers removed from a decision anyone on their team actually made.
  • The blast radius is a design choice, even when the bug isn’t yours. Whether a single vendor’s failure takes down your entire platform or just a degraded subset of features is determined by how much of your stack assumes that vendor will always be there — not by how good the vendor’s engineering team is.
  • “Not our bug” doesn’t buy you patience from players or regulators. Tabcorp and Sportsbet were transparent about the external cause, and it didn’t make the outage shorter or the customer frustration smaller. The same will be true for a game studio explaining an AWS-shaped outage to a community mid-launch.
🛟
The honest goal isn’t eliminating third-party risk — it’s bounding it before it’s a live incident. Failover paths, degraded-mode design, and a tested incident response plan for “the outage isn’t ours but the downtime is” are core to Gart Solutions’ SRE practice.

The takeaway for both industries

A sportsbook can’t audit CrowdStrike’s source code, and a game studio can’t audit AWS’s internal DNS systems. That’s not the point. The point is that both incidents were entirely predictable in shape, if not in timing: any platform with a deep enough dependency on a single vendor will eventually inherit that vendor’s worst day, and the only real choices left at that point are how much of your platform that worst day is allowed to take down with it, and how fast a human can actually act once it does.

That’s an architecture and incident-response question, not a vendor-selection one — switching vendors just relocates the same risk. The work is in mapping where a single point of failure actually sits in your stack, deciding what degrades gracefully versus what goes dark entirely, and rehearsing the manual recovery steps before you need them at 3 a.m. with thousands of angry players or bettors watching a status page that, ironically, might also be down.

Do you know what happens when your biggest vendor has its worst day?

Gart Solutions maps the dependency chains most teams don’t see until they fail, and builds the failover and incident-response plans that bound the damage when they do.

Talk to our architects →

FAQ

What actually caused the 2024 CrowdStrike outage?

A routine configuration update to CrowdStrike's Falcon security sensor contained a data mismatch — the update assumed 21 input fields where the system only provided 20 — which triggered an out-of-bounds memory read inside the kernel-level driver. That crashed every Windows machine running the sensor, and because the driver loads early in the boot sequence, affected machines crashed again on every restart attempt.

Why did recovery take so much longer than the 78-minute fix?

CrowdStrike reverting the bad update only stopped new crashes — it did nothing for machines that had already crashed and entered a boot loop. Those machines needed someone to physically access them, boot into Safe Mode, locate the specific faulty file, and delete it by hand, which for organizations with large fleets meant days, not minutes, of recovery work.

What caused the 2025 AWS outage that affected Fortnite and Roblox?

A race condition between two internal AWS systems responsible for updating DynamoDB's DNS records resulted in those records being effectively emptied in the US-East-1 region. Any service trying to reach DynamoDB through its public endpoint — including a large number of AWS's own internal services — began failing immediately, with effects rippling out to dependent platforms including several major games.

Can you actually protect against a vendor's internal bug?

Not entirely, no — and that's the point worth accepting rather than fighting. What you can control is the blast radius: which features degrade gracefully instead of failing completely, whether you have a tested manual recovery procedure instead of discovering one live, and whether your team has actually mapped which of your "stable" dependencies sit on top of a single vendor you've never directly evaluated. Gart Solutions' infrastructure audit service is built around surfacing exactly this before an incident does.

Is multi-cloud or multi-vendor redundancy worth it to avoid this?

It depends on the platform — full multi-cloud redundancy is expensive and operationally complex, and for many teams it's not proportionate to the risk. The more universally useful step is knowing your actual dependency depth and designing graceful degradation for your most critical paths, which is far cheaper than running duplicate infrastructure and catches most of the same risk.

Did either Tabcorp, Sportsbet, or Ladbrokes face penalties over the CrowdStrike outage?

We're not aware of public regulatory penalties specific to these operators over this incident — the broader CrowdStrike outage did trigger lawsuits elsewhere, notably Delta Air Lines seeking roughly $500 million in damages, with CrowdStrike countersuing in response. The legal and regulatory fallout from "someone else's bug" is itself part of the risk a platform inherits from deep vendor dependency.
arrow arrow

Thank you
for contacting us!

Please, check your email

arrow arrow

Thank you

You've been subscribed

We use cookies to enhance your browsing experience. By clicking "Accept," you consent to the use of cookies. To learn more, read our Privacy Policy