Home
Resources
Compliance-by-design: why loot box regulation is starting to look like an MGA audit

Compliance

Compliance-by-design: why loot box regulation is starting to look like an MGA audit

Cloud Architecture Expert Co-founder & CTO of Gart

June 29, 2026

PEGI — the age-rating body used across more than 35 European countries — rolled out the biggest change to its classification framework in over a decade. Starting in June 2026, any game with paid random items gets a minimum PEGI 16 rating, regardless of its content otherwise. It’s not a gambling law. It’s an age-rating body quietly admitting that loot boxes need to be treated as a distinct risk category — which is one more data point in a pattern that’s been building for years: regulators haven’t agreed that loot boxes are gambling, but they increasingly want the same kind of proof a gambling regulator would demand.

That’s the actual story here, and it’s worth being precise about it rather than overstating it. Loot boxes are not legally classified as gambling in the UK, most of the EU, or under US federal law, as of this writing. But “not legally gambling” and “not regulated” have stopped being the same thing — and the infrastructure needed to satisfy the second is converging, fast, on something that already exists in iGaming: an auditable, reproducible record of exactly how chance-based outcomes are generated and disclosed.

TL;DR

• The legal status is genuinely fragmented: Belgium bans paid loot boxes outright. The UK and most of the EU don’t classify them as gambling. The US has no federal law at all — just FTC consumer-protection enforcement and a closely watched state lawsuit.
• The pressure is real even without a gambling classification. The EU’s Digital Services Act already restricts practices that drive “excessive or compulsive spending” by minors, independent of any future gambling law. PEGI’s new rules just landed. The EU’s Digital Fairness Act is expected to propose binding rules later this year.
• The crux test everywhere is “money or money’s worth.” Items that can be cashed out on a secondary market blow up the usual exemption — which is exactly the legal theory behind New York’s attorney general suing Valve over Counter-Strike 2 skins.
• The practical answer looks like an RNG audit, not a legal opinion. Drop-rate logging, deterministic replay, and age-gating records — the same evidence an MGA or UKGC auditor expects from a casino game — are becoming the default expectation for loot boxes too, classification debate aside.

$15B+

estimated annual loot box revenue

PEGI 16

new minimum rating floor, from June 2026

Q4 2026

EU Digital Fairness Act proposal expected

This article summarizes the regulatory landscape as we understand it in June 2026. It is not legal advice — the law here is moving quickly and varies by jurisdiction and product mechanic, so any compliance decision should be checked against current counsel for the specific markets you operate in.

The patchwork, as of June 2026

The UK still does not treat loot boxes as gambling. The Gambling Act 2005 requires a prize to be “money or money’s worth,” and the UK Gambling Commission’s long-standing position is that in-game items don’t meet that bar because the publisher itself doesn’t let you cash them out. The government reaffirmed this position again in January 2026, while noting it is “keeping possible future legislative options under review” — language it has now used for several years running. In its place, the industry runs a self-regulatory code (UKIE’s principles, published in 2023) covering disclosure and age-gating, with the government able to step in if that proves insufficient.

The EU has no single law treating loot boxes as gambling either — gambling regulation stays with individual member states, which is why approaches differ so sharply across the bloc. Belgium banned paid loot boxes outright back in 2018, treating them as illegal gambling under its existing framework, and that ban remains in force. The Netherlands took a different and more complicated path: its gambling authority initially fined EA roughly €10 million over FIFA Ultimate Team packs, but that fine was later overturned after a court found the mechanic, integrated as it was into normal gameplay, didn’t constitute a standalone gambling product — a reversal worth knowing about, since the original fine is still the version of this story most commonly repeated. Poland has drafted amendments that would require a gambling licence for chance-based purchase mechanics, and the European Parliament’s internal market committee voted in October 2025 to push for the EU’s incoming Digital Fairness Act to ban loot-box-style mechanics in games accessible to minors — a proposal the European Commission is expected to table later in 2026, not a law that exists yet.

Separately — and already in force, independent of any gambling classification — the EU’s Digital Services Act restricts platforms accessible to minors from using practices that can drive excessive or compulsive spending, and the European Parliament has explicitly read that obligation as covering paid loot boxes with randomized content. This matters because it means EU compliance pressure on monetization design didn’t wait for a gambling law; it’s already live through a different legal door.

The United States has no federal loot box law of any kind. Enforcement instead comes through the FTC, using ordinary consumer-protection and children’s-privacy law (COPPA) rather than gambling statutes — a settled case already established that platforms must block under-16 purchases without verified parental consent. State bills in New York, Hawaii, Washington, and Indiana have proposed loot-box-specific rules; none has passed as of this writing. The case to actually watch is New York’s attorney general suing Valve, arguing that Counter-Strike 2’s loot boxes constitute illegal gambling under state law — grounded directly in the fact that CS2 skins have a real, liquid secondary market, which is the exact crack in the “no cash value” argument that every other jurisdiction’s exemption also depends on.

The test that keeps breaking: “money or money’s worth”

Almost every jurisdiction’s gambling exemption for loot boxes rests on the same idea: it’s only gambling if the prize has real monetary value, and a publisher who doesn’t let you cash out hasn’t given you that. It’s a clean legal test, until a secondary market exists where players trade those items for real money anyway — at which point the “the publisher doesn’t cash you out” defense stops mattering, because someone else effectively does.

This is precisely the architecture decision sitting at the center of the Valve lawsuit, and it’s worth treating as exactly that — an architecture decision, not just a legal one. Whether a game’s items are tradeable, how easily they convert to cash through third-party markets, and how directly the publisher facilitates or merely tolerates that trade are product and infrastructure choices made well before any court gets involved. A studio that enables frictionless secondary trading of randomized-drop items is choosing to operate closer to the line that separates “not gambling” from “functionally gambling” in multiple jurisdictions at once.

What “compliance-by-design” actually means here

iGaming operators already live with this problem solved, because they had no choice — a real-money casino game without a defensible audit trail simply doesn’t get licensed. Our <a href=”https://gartsolutions.com/industries/igaming/”>iGaming practice</a> is built around exactly this: deterministic replay so any past outcome can be reconstructed from stored seed and state, version-locked deployment so a tested build is provably the one that shipped, and continuous logging that can answer a regulator’s question about drop rates or RTP without a scramble.

Game studios shipping loot boxes have rarely had to build any of that, because until recently nobody outside the studio was asking. That’s changing on three fronts at once: PEGI’s new rating floor makes the mechanic itself a labeled risk category rather than an invisible design choice; the EU’s DSA already creates spending-pattern obligations independent of gambling law; and the Valve case shows a state attorney general willing to use existing gambling statutes against a mechanic that was never designed with that scrutiny in mind. None of these require a new “loot boxes are gambling” law to bite — they bite under the laws and rating systems that already exist.

The practical response looks less like a legal memo and more like an infrastructure project: a verifiable, append-only log of what a given pull’s odds actually were and what it produced, age-verification records that hold up under a regulator’s request rather than just a checkbox, and a documented decision — made deliberately, not by default — about whether and how items can move into a secondary market. That’s the same category of evidence an MGA or UKGC audit already expects. The studios that build it before they’re asked won’t be rebuilding their monetization stack under a deadline; the ones that don’t are betting on the current patchwork staying exactly as fragmented as it is today.

The takeaway for both industries

The honest summary is that nobody — not Brussels, not London, not Washington — has settled this question, and anyone telling you with confidence exactly what the rules will say in twelve months is guessing. What is settled is the direction: more disclosure, more age-gating, and more scrutiny of secondary markets, arriving through age-rating bodies, consumer-protection law, and state attorneys general even where a gambling classification never lands. Building the audit infrastructure now isn’t a bet on any one outcome — it’s the same infrastructure either way.

Could your monetization stack answer a regulator’s question today?

Gart Solutions helps both iGaming operators and game studios build the audit trails, drop-rate logging, and compliance architecture that hold up under real scrutiny — before a regulator or a lawsuit asks first.

Talk to our architects →

FAQ

Are loot boxes legally classified as gambling anywhere in 2026?

Yes, in some places — Belgium has banned paid loot boxes outright since 2018, treating them as illegal gambling under its existing law. The UK, most of the EU, and US federal law do not classify them as gambling, though several jurisdictions are actively reviewing that position, and New York's attorney general is currently arguing in court that specific mechanics in Counter-Strike 2 do qualify under state law.

What is PEGI's new rule and when does it take effect?

Starting June 2026, PEGI — the age-rating system used across more than 35 European countries — applies a minimum PEGI 16 rating to any game containing paid random items, regardless of the game's other content. It's part of a broader overhaul adding "interactive risk categories" that also cover communication features and engagement-driving design patterns.

Does the EU's Digital Services Act already cover loot boxes?

The European Parliament has interpreted the DSA's restriction on practices that drive excessive or compulsive spending by minors as covering paid loot boxes with randomized content, even though the DSA isn't a gambling law and doesn't classify loot boxes as such. This means compliance pressure already exists in the EU independent of whatever the proposed Digital Fairness Act eventually contains.

What happened with the Netherlands and EA's FIFA Ultimate Team case?

The Dutch gambling authority initially fined EA roughly €10 million, ruling that FIFA Ultimate Team packs constituted illegal gambling. That fine was later overturned on appeal after a court found the mechanic, as integrated into normal gameplay, did not amount to a standalone gambling product. The original fine is still widely cited as if it were the final outcome, which it isn't.

Why does a secondary market for in-game items matter so much legally?

Most jurisdictions exempt loot boxes from gambling law on the basis that the prizes have no real monetary value, since the publisher won't cash them out. A liquid secondary market where players trade those items for real money undermines that argument regardless of what the publisher itself does, which is the core legal theory behind New York's lawsuit against Valve over Counter-Strike 2 skins.

What should a game studio actually build to get ahead of this?

At minimum: a verifiable log of the actual odds behind any given randomized outcome, age-verification records that would satisfy a real audit rather than a self-certified checkbox, and a deliberate, documented policy on whether and how items can be traded externally. This is structurally similar to what Gart Solutions builds for iGaming RNG certification readiness — deterministic replay and continuous audit logging, just applied to a mechanic that hasn't historically required it.

DevOps

Someone else’s bug, your downtime: why bookmakers and game studios share the same third-party risk

Fedir Kompaniiets

June 29, 2026

On the morning of July 19, 2024, three of Australia's largest betting operators — Tabcorp, Sportsbet, and Ladbrokes — went dark within minutes of each other. None of them had pushed a bad deploy. None of them had a security breach. The cause sat entirely outside their own codebases, inside a security vendor's routine update to software running on millions of machines they didn't write a line of code for. We've already written about this shape of problem once, in our breakdown of Final Fantasy XIV's 2021 login crisis — a case where the real constraint was a global chip shortage that Square Enix had no control over. This is the same category of failure, but faster, more sudden, and arguably more dangerous: a single vendor's mistake, pushed automatically, with zero warning and zero opportunity to test it first. TL;DR • CrowdStrike, July 2024: a flawed security update bricked 8.5 million Windows machines worldwide in under 90 minutes — including the systems behind Tabcorp, Sportsbet, and Ladbrokes simultaneously. • AWS, October 2025: an internal DNS race condition inside DynamoDB took down a wide swath of the internet for hours — including Fortnite and Roblox, alongside Disney+, Reddit, and a Premier League broadcast. • The fix being fast didn't make recovery fast. CrowdStrike reverted its bad update in 78 minutes — but every machine that already crashed needed a person, physically, to boot into Safe Mode and delete a file by hand. • The shared lesson: you can't patch your way out of a dependency you don't control. You can only decide, in advance, how much blast radius one vendor's bad day is allowed to have. 8.5M Windows devices crashed worldwide 78 min to revert the update — recovery still took days $5.4B+ estimated direct cost to Fortune 500 firms The bookmakers: CrowdStrike takes down three operators at once At 04:09 UTC on July 19, 2024, CrowdStrike pushed a routine configuration update — a "Channel File" — to every Windows machine running its Falcon security sensor. The update was meant to improve detection of a specific attack technique. Instead, it contained a mismatch: the update assumed a data structure with 21 fields, but the actual content shipped with only 20. That single discrepancy triggered an out-of-bounds memory read inside Falcon's kernel-level driver, and the driver crashed every Windows machine it was running on — immediately, and on every subsequent boot attempt, because the driver loaded early in the startup sequence. Roughly 8.5 million Windows devices crashed within the hour, by Microsoft's own count. Tabcorp and Sportsbet — together responsible for more than 70% of Australia's wagering market — went down alongside Ladbrokes. Betting stopped entirely, online and in retail outlets. Tote price finalization froze mid-calculation, which meant payouts on bets already placed couldn't be settled until the underlying systems came back. Both operators publicly attributed the outage to "a global external technical issue," which was accurate — neither had any path to fix it themselves. What makes this case distinct from a typical outage is what happened after CrowdStrike found the bug. The company reverted the faulty update at 05:27 UTC — 78 minutes after it shipped. In a normal software incident, that's the end of the story: bad deploy rolled back, service restored. Here, it wasn't. Every machine that had already crashed was stuck in a boot loop, because the damage was done locally on each device before the revert ever reached it. Recovery required someone to physically access each affected machine, boot into Safe Mode, locate a specific system file, and delete it by hand — one machine at a time, sometimes complicated further by BitLocker disk encryption requiring a separate recovery key. For organizations with thousands of endpoints, that's not a fix measured in minutes. It's a fix measured in however many hands you have available. The game: an AWS database failure takes down Fortnite and Roblox On October 20, 2025, a separate but structurally identical story played out in the gaming industry. Amazon's DynamoDB — a managed database service that much of the internet quietly depends on, often without realizing how deeply — suffered a DNS failure in its largest region, US-East-1. The proximate cause, per AWS's own postmortem, was a race condition: an internal system called a DNS Enactor that updates DynamoDB's DNS records ran unusually slowly for one execution, while a second, parallel Enactor processed updates far faster than normal. The mismatch between the two led to DynamoDB's DNS records effectively being emptied, and every system trying to reach DynamoDB through its public endpoint — including a large share of AWS's own internal services — began failing immediately. The outage rippled outward in a way that surprised even engineers who consider themselves dependency-aware. Disney+, Reddit, Snapchat, Coinbase, the McDonald's app, and UK government tax services all went down. So did Fortnite and Roblox, reported alongside the others as players found themselves unable to connect. Independent analysis of the incident noted a detail worth sitting with: services that monitor other services' uptime were themselves casualties — status pages built on Atlassian's Statuspage product couldn't be updated, meaning some companies couldn't even tell their own users what was happening, because the tool they'd use to say so depended on the same failing infrastructure. The outage lasted around three hours before AWS engineers manually intervened to restore DynamoDB's DNS. For a live-service game, three hours during a peak window isn't a minor blip — it's measured in lost engagement, refund requests, and the same kind of player trust erosion we covered when we looked at what happens when sportsbooks go down during the World Cup. The mechanism was completely different — a malicious attack versus an internal race condition inside a trusted vendor's infrastructure — but the experience on the other end of the connection looked the same: the platform isn't responding, and there's nothing the player-facing team can do about it directly. 🧭 Most teams can name their direct vendors. Far fewer can name their vendors' vendors. The AWS incident took down services that didn't think of themselves as AWS-dependent at all — they depended on something that depended on DynamoDB. Gart Solutions' infrastructure audit service is built around mapping that second and third layer of dependency before it becomes a 3 a.m. discovery. Why "it wasn't our bug" doesn't help you at 3 a.m. Both incidents share a structure that's worth naming directly, because it's the part most incident-response planning misses. It isn't just "depend less on third parties" — for any real-time platform, some third-party dependency is unavoidable. The actual lesson is narrower and more actionable: A vendor's fast fix doesn't guarantee a fast recovery for you. CrowdStrike reverted its bad update in under 90 minutes. That timeline meant almost nothing to organizations whose machines had already crashed, because the recovery step required physical, manual intervention that no amount of vendor speed could shortcut. Your dependency map is deeper than your vendor list. Plenty of companies hit by the AWS DynamoDB failure didn't think of themselves as exposed to it — they depended on a tool that depended on AWS, two or three layers removed from a decision anyone on their team actually made. The blast radius is a design choice, even when the bug isn't yours. Whether a single vendor's failure takes down your entire platform or just a degraded subset of features is determined by how much of your stack assumes that vendor will always be there — not by how good the vendor's engineering team is. "Not our bug" doesn't buy you patience from players or regulators. Tabcorp and Sportsbet were transparent about the external cause, and it didn't make the outage shorter or the customer frustration smaller. The same will be true for a game studio explaining an AWS-shaped outage to a community mid-launch. 🛟 The honest goal isn't eliminating third-party risk — it's bounding it before it's a live incident. Failover paths, degraded-mode design, and a tested incident response plan for "the outage isn't ours but the downtime is" are core to Gart Solutions' SRE practice. The takeaway for both industries A sportsbook can't audit CrowdStrike's source code, and a game studio can't audit AWS's internal DNS systems. That's not the point. The point is that both incidents were entirely predictable in shape, if not in timing: any platform with a deep enough dependency on a single vendor will eventually inherit that vendor's worst day, and the only real choices left at that point are how much of your platform that worst day is allowed to take down with it, and how fast a human can actually act once it does. That's an architecture and incident-response question, not a vendor-selection one — switching vendors just relocates the same risk. The work is in mapping where a single point of failure actually sits in your stack, deciding what degrades gracefully versus what goes dark entirely, and rehearsing the manual recovery steps before you need them at 3 a.m. with thousands of angry players or bettors watching a status page that, ironically, might also be down. Do you know what happens when your biggest vendor has its worst day? Gart Solutions maps the dependency chains most teams don't see until they fail, and builds the failover and incident-response plans that bound the damage when they do. Talk to our architects →

DevOps

What World Cup sportsbook attacks and game-launch outages have in common

Fedir Kompaniiets

June 29, 2026

Right now, while the 2026 FIFA World Cup's expanded 48-team tournament plays out across the US, Mexico, and Canada, sports-betting platforms are taking some of the heaviest DDoS pressure they'll see all year. Security researchers tracking the tournament have documented attack traffic against betting platforms climbing steadily through late May, then sharply from June 5 onward as kickoff approached — and on the day before the opening match, a single traffic spike that dwarfed everything before it: over a million requests in one burst, more than three times the previous peak. That's not a coincidence, and it's not really a new story either. A few weeks ago we published a breakdown of three real, public postmortems from game launches — Fortnite, Final Fantasy XIV, and Helldivers 2 — that all broke under sudden, extreme load. None of those were attacks. They were legitimate demand. But the shape of the failure, and increasingly the shape of the defense required, looks the same whether the traffic wants to hurt you or just wants to play. TL;DR • The pattern is identical at the infrastructure layer: a near-vertical request curve with no ramp-up, arriving faster than a human can classify it as malicious or legitimate. • World Cup sportsbooks (2026): real tracked attacks have hit roughly 18,000 requests per second with zero warm-up, deliberately routed through dozens of countries to defeat geo-blocking. • Game launches (Fortnite, 2018): the same near-vertical curve, except every request was a real paying player — and it still exhausted AWS instance limits and IP pools just as fast. • The shared lesson: if your defense depends on a human deciding "is this an attack or just success," you've already lost the seconds that matter. 18,000 requests/sec, zero warm-up 87 sec window before a cascade spreads 70–75% forecast rise in World Cup betting volume The attack: what's actually hitting sportsbooks this World Cup Threat researchers monitoring sports-betting platforms during the 2026 World Cup have published a detailed breakdown of the pattern: traffic against one tracked platform spiked to roughly 18,000 requests per second in what's described as a near-vertical wall — no ramp-up, no warm-up period, no gradual escalation. Within seconds of the initial surge, the geographic composition broadens rapidly: an initial spike from Russia-origin traffic is quickly joined by US, German, Indonesian, Singaporean, and a dozen other country sources, each adding hundreds to low thousands of requests per second. That spread isn't random. Spreading the source footprint across many countries within seconds makes any single-country block largely useless, and researchers note the traffic draws entirely on proxy infrastructure and data centers with an established history of malicious activity — a pre-assembled operation, not opportunistic reuse. None of it reflects a real betting platform's actual user base; a European-regulated sportsbook simply doesn't get organic traffic from a dozen unrelated countries within the same few seconds. The operational detail that matters most for defenders: researchers estimate roughly 87 seconds between the first signal and the point where the attack cascades broadly enough that manual, human-in-the-loop response is no longer fast enough. Automated, real-time blocking at millisecond latency isn't a nice-to-have here — it's the only posture that has a chance. And the stakes are specifically tied to the product itself. In-play betting — placing wagers while a match is live — is one of the highest-margin features sportsbooks offer, and it's consistently the first thing to break under load. Industry reporting suggests roughly a third of bets during a major tournament final are placed in-play, and the tolerance for delay is brutal: the difference between a two-second and a five-second response during a key moment isn't a minor glitch, it's a missed bet, a frozen cash-out, and a player who doesn't give the platform a second chance. The launch: what hit Fortnite at 3.4 million concurrent players We covered this in detail in our breakdown of three real game-launch postmortems, but it's worth pulling the relevant thread here specifically: when Fortnite hit a then-unprecedented 3.4 million concurrent players in February 2018, part of what broke was strictly a capacity ceiling that had nothing to do with game logic. Epic's own postmortem describes hitting AWS's regional instance limits running on fleets of c4.8xlarge instances, and running out of IP addresses in their standard subnets purely from the pace of scaling — a near-vertical demand curve that exhausted infrastructure quotas in roughly the same shape a coordinated attack would. The traffic wasn't malicious. Every one of those requests was a real player wanting to play a game they'd already downloaded. But from the perspective of the infrastructure underneath — the load balancers, the connection pools, the cloud provider's regional quotas — a sudden, extreme, geographically broad surge in connections looks remarkably similar whether it's organic enthusiasm or a botnet. The failure mode wasn't "we got attacked." It was "we got more legitimate demand than our quotas and pooling assumptions could absorb fast enough," which is functionally the same shape of problem a DDoS defense exists to handle. 🛡️ This is exactly why DDoS-readiness and launch-readiness end up being the same engineering exercise. Whether the surge is malicious or just successful, the fix is the same: automated, real-time response that doesn't wait on a human classification step. Gart Solutions' security audit service is built around stress-testing exactly this distinction before it's tested for you, live. Why the same infrastructure has to defend against both The uncomfortable truth for anyone running a real-time platform — a sportsbook during in-play betting, a game server during a launch spike — is that in the first several seconds, a malicious DDoS surge and a legitimate viral demand spike can look identical at the network layer. Same near-vertical request curve. Same overwhelmed connection pool. Same sudden geographic and behavioral pattern that doesn't match yesterday's baseline. That's not a reason to give up on telling them apart — it's the reason the first line of defense can't depend on telling them apart at all. The systems that survive both scenarios share the same design properties regardless of which one they're facing: Elastic capacity that triggers on pattern, not on classification. Autoscaling and rate-limiting need to respond to "this looks anomalous" within seconds, not wait for a security team or a war room to confirm intent. Geo- and behavior-aware edge mitigation, because both attackers and viral demand show up as traffic shapes that don't match an operator's real, known user base — and that signal is available before anyone's looked at a single request payload. Quota and connection-pool headroom built for the spike, not the average, because cloud provider regional limits and IP exhaustion don't care whether the requests hitting them are well-intentioned. A fallback that degrades gracefully rather than falling over completely — queuing, graceful rate-limiting, or a holding page beats a total outage whether the cause is 2 million real fans or 20,000 requests a second from a botnet. Sportsbooks during a World Cup and game studios during a launch are solving variations of the exact same problem, and most of them are doing it with teams and tooling that were built for one or the other, not both. 📡 The defensive posture that holds up under a real attack is the same one that holds up under real success. Real-time anomaly detection, automated mitigation, and capacity that doesn't wait for a human in the loop are the core of Gart Solutions' SRE practice — built for platforms where the difference between a good night and a very bad one is measured in seconds. The takeaway for both industries If you operate a sportsbook, the next major tournament — or even the next big goal in this one — is a live test of whether your platform can tell a coordinated attack from a crowd of real bettors fast enough to matter, without making either group wait. If you run a live-service game, your next content drop or marketing push is the same test wearing a different shirt. Neither industry should be solving this from scratch. The shape of the problem — sudden, extreme, geographically anomalous traffic that has to be absorbed or mitigated in seconds, not minutes — has been documented publicly, repeatedly, by both sides. The infrastructure that handles it well doesn't ask "is this an attack," it asks "can we absorb or shed this safely either way," and answers that question automatically before a person ever gets paged. Is your platform ready for its next traffic spike — attack or success? Gart Solutions runs security and infrastructure audits built around exactly this distinction: real-time, automated readiness for sudden load, whether it's malicious or just means you're winning.

IT Infrastructure

What Fortnite, FFXIV, and Helldivers 2 Teach Us About Gaming Infrastructure

Roman Burdiuzha

June 29, 2026

Three real-world postmortems reveal how gaming infrastructure actually fails under launch-scale load — and why traditional scaling assumptions break in production. Most advice about gaming infrastructure focuses on generic scaling tactics: autoscaling, Kubernetes, load testing, CDNs. While all of these matter, they rarely explain why even top-tier studios still experience catastrophic failures during major launches. The reality is that gaming infrastructure failures are not usually caused by lack of compute — they are caused by hidden architectural constraints that only appear under real player load. To understand this, we analyzed three public postmortems from Fortnite (2018), Final Fantasy XIV (2021), and Helldivers 2 (2024). Each case reveals a different type of gaming infrastructure failure — from data layer bottlenecks to hardware procurement limits and application-level scaling issues. TL;DR Fortnite (2018): a single database shard handling matchmaking became a write-queue bottleneck that took down the whole platform — more compute couldn't route around a sharding design problem. FFXIV (2021): the bottleneck wasn't software — it was physical hardware lead time, made worse by a global chip shortage. Cloud-style elasticity didn't apply. Helldivers 2 (2024): the CEO said it outright — this wasn't a budget problem, it was application code that needed engineering weeks, not a bigger AWS bill. The shared lesson: every team's capacity plan was built around the wrong constraint, and they only found the real one under live fire, in front of paying players. Gaming Infrastructure Case Study 1: Fortnite’s 3.4M Concurrent Players On the weekend of February 3–4, 2018, Fortnite hit a new peak of 3.4 million concurrent players — at the time, an unprecedented number for the genre. Epic's own engineering team published a detailed postmortem five days later. It described six separate incidents across the weekend, ranging from degraded performance to total service disruption. The core of the failure sat in a service Epic calls MCP — the backend that handles player profiles, stats, inventory, and matchmaking. It ran on nine MongoDB shards, each with a writer, two read replicas, and a hidden replica for redundancy. Most player data was spread across eight of those shards. The ninth handled something narrower but critical: matchmaking session state, shared service caches, and runtime configuration — and by design, that data had to live in a single collection. At peak load, MCP was handling around 124,000 client requests per second, translating to roughly 318,000 database reads and 132,000 writes per second, normally with sub-10-millisecond response times. Matchmaking itself accounted for a modest 15% of total queries — but because it was concentrated on one shard, that shard became the choke point. Under peak load, writes began queuing for available writer resources, with individual operations spiking past 40 seconds. The database process would eventually become unresponsive, requiring a manual primary failover to restore service — a procedure the team repeated multiple times per hour during the worst stretches. A second, unrelated failure compounded the weekend: Epic's Account Service sits behind an Nginx proxy that shortcuts token-verification traffic through a cache. When the underlying Memcached layer started failing under load, Nginx queued behind it waiting on 100ms timeouts, exhausted its available worker threads, and stopped serving any traffic — including the health checks that load balancers use to decide which nodes are healthy. Every node got pulled from rotation. A caching layer's failure became a full authentication outage. A third structural issue surfaced in Epic's XMPP service, which handles presence, chat, and parties. It's architected as a full mesh, where every node maintains a connection to every other node. With roughly ten connections per node across 101 nodes, that's about a thousand sockets per node spent purely on internal cluster communication — a hard ceiling on how many nodes (and therefore how much concurrent load) the architecture could support without a redesign, regardless of how much compute Epic threw at it. And underneath all three, Epic also hit AWS's regional instance limits running on fleets of c4.8xlarge instances, and ran out of IP addresses in their standard /24 subnets purely from the pace of scaling — operational cloud-quota issues that had nothing to do with the game itself. The lesson: more compute doesn't fix a sharding decision. The single collection backing matchmaking was a structural bottleneck that no amount of autoscaling could route around — only a redesign could, which is exactly what Epic moved to next, breaking matchmaking out into its own microservice with a different data model. 🔍 This is the kind of failure mode a load test rarely catches by accident. Simulating average traffic won't surface a single-shard bottleneck — you have to specifically test the write path that all your sessions funnel through. Gart Solutions' infrastructure audit service is built around finding exactly this kind of structural ceiling before it shows up in production. Gaming Infrastructure Case Study 2: Final Fantasy XIV Login Bottlenecks When Final Fantasy XIV's Endwalker expansion entered early access, Square Enix was hit with what director Naoki Yoshida called an unexpected and dramatic surge of new and returning players across every region simultaneously. The result was hours-long login queues and a string of cryptic error codes that became a running joke in the community — and a real engineering problem behind the scenes. The login system processed waiting players in batches of roughly 100 at a time. A bug tracked as Error 4004 could knock about a quarter of each batch back out of the queue at the exact moment it was their turn, sending them to the back of the line with no memory of their previous wait. Error 2002 was more deliberate: a circuit breaker that triggered once more than 17,000 players attempted to log into a single data center simultaneously, intentionally refusing further logins rather than letting the backend crash outright. What made this case different from a typical capacity crunch is why Square Enix couldn't just scale through it. The planned fix wasn't a configuration change — it was a hardware upgrade to the login and world servers. And the team's ability to execute that upgrade ran straight into the global semiconductor shortage of 2021, compounded by COVID-era travel restrictions that kept engineers from physically reaching international data centers. This wasn't a software elasticity problem; it was a supply-chain problem wearing a server error code. In the meantime, the team shipped what mitigations they could: an automatic logout for AFK players to free up occupied login slots, and incremental capacity increases as hardware became available — North America's data centers gained roughly 750 additional simultaneous logins per server as upgraded hardware came online, while the EU region lagged behind on a slower upgrade timeline. The lesson: not every layer of your stack can autoscale. If a component — login authentication hardware, specialized network appliances, anything with a physical procurement step — has a hardware lead time, your launch capacity plan needs a hardware contingency, not just a Kubernetes horizontal pod autoscaler policy. Gaming Infrastructure Case Study 3: Helldivers 2 Scaling Limits Helldivers 2 launched on February 8, 2024, and within days had blown past every internal projection, eventually overtaking GTA V's long-standing Steam concurrent-player record. Developer Arrowhead Game Studios raised its concurrent player cap four times in roughly two weeks — from 250,000 to 360,000, then 450,000, then 700,000 — with each increase explicitly framed as the most the platform could currently support, not a target the team was choosing to undershoot. What stands out in this case is how plainly Arrowhead's CEO, Johan Pilestedt, described the actual constraint. He stated that the fix wasn't about money or buying more servers — the team needed to optimize backend code that was hitting real limits, work that takes engineering time, not procurement budget. Arrowhead brought in engineers from Sony to help, and shipped a fifteen-minute AFK kick timer as a quick way to free up occupied capacity while the deeper backend work continued. Notably, the studio also resisted the obvious-looking fix of simply enlarging squad sizes to fit more players per match — the client and netcode couldn't hold more simultaneous players in a single session without wrecking frame rate. "More concurrent players" and "more capacity per match" turned out to be two different engineering problems, and only one of them was solvable by adding servers. The lesson: sometimes the bottleneck genuinely isn't infrastructure at all — it's application code that was never built to scale horizontally. No cloud budget fixes that. Only engineering time does, and a launch plan that assumes otherwise will discover the gap live, in front of its biggest audience. 📡 A pre-launch readiness review exists precisely to surface this distinction early — whether your bottleneck is infrastructure, hardware lead time, or application code — while there's still time to act on it instead of firefighting it live. This is the core of Gart Solutions' SRE practice. The Real Problem Behind Gaming Infrastructure Failures None of these three studios were small or under-resourced. Epic, Square Enix, and Arrowhead — backed by Sony — all had real engineering organizations and real cloud budgets behind them. What they had in common wasn't a lack of infrastructure spend. It was that each team's pre-launch capacity plan was built around the wrong assumption about where the system would actually break. Fortnite's team assumed compute was the constraint; the real constraint was a single-shard data design. Square Enix assumed software configuration was the lever; the real constraint was physical hardware availability during a global shortage. Arrowhead assumed it would need more servers; the real constraint was application code that didn't horizontally scale. In all three cases, the studio found its actual bottleneck the same way: by hitting it in production, in front of millions of players. That is the most expensive possible way to learn where your specific weak point is. The alternative is to deliberately test for the failure mode, not just the happy path. Simulate write contention on whatever shard or table all your sessions funnel through, not just average read traffic. Map every component with a physical procurement step — specialized hardware, third-party licenses, anything hardware-bound — and ask what the contingency is if a lead time slips by even two weeks. Profile actual application code paths under realistic concurrency, not just infrastructure-level metrics, because a healthy-looking CPU graph can hide a function that was never written to parallelize. That's a fundamentally different exercise than "spin up more pods and hope." It requires someone to go looking for the failure mode before launch day finds it for you.

TL;DR

The patchwork, as of June 2026

The test that keeps breaking: “money or money’s worth”

What “compliance-by-design” actually means here

The takeaway for both industries

Could your monetization stack answer a regulator’s question today?

FAQ

Are loot boxes legally classified as gambling anywhere in 2026?

What is PEGI's new rule and when does it take effect?

Does the EU's Digital Services Act already cover loot boxes?

What happened with the Netherlands and EA's FIFA Ultimate Team case?

Why does a secondary market for in-game items matter so much legally?

What should a game studio actually build to get ahead of this?

You might also like

Someone else’s bug, your downtime: why bookmakers and game studios share the same third-party risk

What World Cup sportsbook attacks and game-launch outages have in common

What Fortnite, FFXIV, and Helldivers 2 Teach Us About Gaming Infrastructure

Subscribe to our blog