When AI Goes Wrong -- AI at Scale

If the rest of this section is about how AI is actually being deployed, this page is about what happens when those deployments fail. The cases below were chosen because each one teaches something specific about the conditions under which AI causes harm at scale. Most of the failures were predictable in retrospect, and most of them were predicted at the time, by people the deploying organisations chose not to listen to. That is the most important pattern.

Robodebt — automated administration of welfare

Already covered in detail on the Government and Public Services page. The short version: Australia's Online Compliance Intervention scheme (2015-2019) raised around 470,000 unlawful debt notices against welfare recipients by averaging tax-office annual income data across fortnightly pay periods. The Royal Commission, reporting in July 2023, found the scheme unlawful from inception. Class-action settlement: $1.8 billion. Several deaths by suicide are believed to be connected. The pattern: opaque automated decision-making, vulnerable population, reversed burden of proof, inadequate appeals, officials defending the system long past the point where the evidence was clear.

The Dutch childcare benefits scandal — toeslagenaffaire

Between roughly 2013 and 2019, the Dutch tax authority (Belastingdienst) used an algorithmic risk-classification system to flag childcare-benefit applications as potentially fraudulent. The system disproportionately targeted dual-nationality families and people with non-Dutch surnames. Around 26,000 families were wrongly accused of fraud, required to repay benefits they had legitimately received, and placed on a "fraud register" affecting their access to other government services and creditworthiness. Children in some cases were taken into state care while families fought the wrongful designations. Settlements have run into the billions of euros. The Rutte government resigned in January 2021 over the scandal; Mark Rutte returned as caretaker but the political fallout continued for years afterward.

This is the international case most structurally similar to Robodebt, and the two scandals are now standard reading together in any serious treatment of automated government decision-making. The pattern is identical, and that fact is the most important thing about both cases. Algorithmic bureaucracies fail in the same ways across jurisdictions because the political and institutional dynamics that produce them are the same.

COMPAS and recidivism scoring

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a proprietary risk-assessment tool used in many US jurisdictions to predict the likelihood of an offender re-offending. The score informs decisions about bail, sentencing and parole.

The 2016 ProPublica investigation by Julia Angwin and colleagues — "Machine Bias" — showed that COMPAS was substantially more likely to falsely flag Black defendants as future criminals than white defendants, and substantially more likely to falsely flag white defendants as low-risk when they would in fact reoffend. The vendor (Northpointe, now Equivant) disputed the methodology; subsequent academic work has largely vindicated the ProPublica analysis.

What makes COMPAS interesting beyond the headline is that it is a textbook case of fairness being mathematically impossible to fully achieve. Multiple definitions of "fair" — equal false-positive rates across groups, equal false-negative rates, equal predictive value, equal calibration — cannot all be satisfied simultaneously when the base rates differ across groups. COMPAS satisfied some definitions and failed others, depending on which one you cared about. The lesson is that "fair AI" is not a property of the algorithm; it is a property of the trade-offs made in deploying it, which are political choices that the public deserves to participate in.

Pulse oximeters and racial bias in medical sensors

Pulse oximeters are the small clip placed on a finger to measure oxygen saturation. They have been standard medical equipment for decades. They were calibrated mostly on light-skinned subjects and systematically overestimate oxygen levels in patients with darker skin. A patient who is in genuine respiratory distress — oxygen saturation actually at 85%, the threshold for serious concern — may register as 92% on the oximeter, and be triaged accordingly.

The bias was documented in the medical literature for years before it became widely known. During COVID-19 it became a crisis: patients of colour were under-detected for serious illness and under-prioritised for hospital admission. The 2020 study in the New England Journal of Medicine by Sjoding et al. provided definitive evidence and accelerated regulatory and clinical responses. The FDA convened expert panels in 2022. New devices being calibrated with broader skin-tone diversity are entering the market in 2024-2026.

The pulse oximeter case is not strictly an AI failure — it is a sensor calibration failure that predates ML — but it is the same family of problem that recurs in all biomedical AI: technologies developed and validated on non-representative populations, deployed at scale, harm the populations they were not validated for. Dermatology AI for skin cancer detection has shown the same pattern (much worse performance on darker skin), as have several pathology and ophthalmology models.

The Optum/UnitedHealth healthcare algorithm

The 2019 paper by Obermeyer et al. in Science exposed an algorithm widely used in US hospitals to identify patients who needed extra care management. The algorithm used past healthcare costs as a proxy for medical need. Because Black patients in the US historically receive less care for the same conditions — owing to a combination of access barriers, insurance status, and provider bias — the algorithm systematically scored them as healthier than they actually were, and excluded them from extra-support programs at substantially higher rates than white patients with equivalent illness.

The vendor revised the algorithm. The general lesson — that "outcome data" used as ground truth in ML training reflects historical inequities, and an algorithm trained on it perpetuates them — is now a standard caution in any ML curriculum. The lesson has not stopped the same pattern recurring in other domains.

Amazon's hiring algorithm

Amazon spent years building an internal AI hiring system to score résumés. By 2018, the company quietly discontinued it. The reason, as later reported, was that the model had learned from a decade of historical Amazon hiring data, in which men had been hired for technical roles at substantially higher rates than women. The model learned to penalise résumés that included the word "women" (e.g., "women's chess club") and to favour résumés using language patterns more common in men's writing. Amazon's engineers tried to remove the bias and could not satisfy themselves that other forms of bias had not crept in. They turned the system off.

The case is unusual in that the company found and fixed (well, abandoned) the problem before it became external. Most companies do not do that. The Workday class action filed in 2025 (Mobley v. Workday), in which a US District Court found plausible claims that the company's AI screening system discriminated against older applicants and minority applicants, suggests that algorithmic-hiring discrimination is widespread among vendors who never did the kind of self-examination Amazon did.

Recommender system harms — TikTok, YouTube, Instagram

The harms produced by social-media recommender systems are different in kind from the algorithmic-decision harms above. They are not single bad decisions about individuals; they are emergent effects of optimising for engagement at population scale.

The pattern is well-documented. Recommender systems trained to maximise watch time or interaction surface increasingly extreme content because extreme content is more engaging. Users who start with mildly partisan political content get recommended more extreme content. Teenage girls who watch one diet video get recommended eating-disorder content. Users vulnerable to self-harm content are surfaced more of it.

The internal Facebook research published as the Facebook Files in 2021 (and supplemented by ongoing whistleblower disclosures) showed that the company's own researchers had documented Instagram-driven harm to teenage girls' mental health, and that internal proposals to mitigate it were rejected as bad for engagement. TikTok has faced parallel scrutiny and now several US state lawsuits. Both platforms have made changes; it is not clear that the changes are sufficient.

Australia's Online Safety Act and the eSafety Commissioner have provided some platform-accountability framework. The proposed under-16 social-media ban (legislated late 2024, in force from late 2025) is the most aggressive government response to the recommender-harm question by any liberal democracy to date. Whether it works as intended will be one of the most-watched policy experiments of the next several years.

Tay, Galactica, Bing's "Sydney" — generative AI gone wrong in public

The high-profile generative-AI launch failures are now a well-trodden list. Microsoft's Tay chatbot was deliberately trolled into producing racist outputs within hours of launch in 2016. Meta's Galactica scientific-AI model was withdrawn three days after launch in 2022 amid criticism that it produced confident-sounding fabricated science. The 2023 launch of Microsoft's Bing Chat (briefly nicknamed "Sydney") produced a wave of conversations in which the bot expressed romantic feelings, threatened users, and generally went off the rails before Microsoft constrained it.

These are the failures most readers have heard about and they are mostly less consequential than the algorithmic-decision failures above. The lesson is more about deployment caution than about the technology being fundamentally broken — current models are clearly capable of producing outputs that the deploying company would not endorse, and red-teaming before launch matters.

Self-driving cars and the limits of perception

The Uber autonomous-vehicle test program killed Elaine Herzberg in Tempe, Arizona in March 2018. The vehicle's perception system detected her but its software classified her variously as an unknown object, a vehicle, and finally a bicycle — without ever stopping the vehicle. The safety driver was watching her phone. The case ended Uber's autonomous-vehicle program and substantially reset industry norms about safety drivers.

Tesla's Autopilot and Full Self-Driving systems have been involved in a much longer-running pattern of fatal crashes, with the National Highway Traffic Safety Administration (NHTSA) opening multiple investigations across 2021-2025. Waymo, Cruise (until its withdrawal from operations in 2023-2024) and Zoox have had safer track records but not zero-incident ones. The general lesson is that the long tail of edge cases in real-world driving is harder than industry timelines anticipated. Every honest practitioner now agrees the timelines were too aggressive. Whether the current ones are also too aggressive remains to be seen.

The Argentine ChatGPT court ruling and the spread of AI misuse in institutions

In 2025 it emerged that an Argentine court had used ChatGPT to draft a decision and another court subsequently annulled the decision when this was discovered. Several US courts have sanctioned lawyers for filing briefs containing AI-fabricated case citations. A Canadian government tax chatbot was reported to have given systematically incorrect tax advice to citizens at scale. These are minor failures individually but cumulatively significant: the ease of using LLMs without disclosure means they are seeping into institutions at a rate that the institutions' governance has not caught up with.

What the patterns are

The cases above span very different technologies and different industries, but the patterns recur. If you take the failures together, you can extract a list of conditions under which AI deployments are likely to cause harm:

The decisions are consequential and the affected populations cannot effectively push back. Welfare recipients, minority job applicants, prisoners, claims-denied patients. The cost-bearer is not the decision-maker.

The training data reflects the inequities of the world that produced it, and the model is treated as a neutral instrument when it is not. Hiring algorithms, health algorithms, recidivism scoring.

The model is a black box — neither the deploying organisation nor the affected individual can explain how a particular decision was reached.

The appeal mechanism is structurally inadequate. Either it does not exist, or it puts the burden of proof on the affected individual, or the body hearing the appeal does not have the power to alter the decision.

The deploying organisation has incentives to underplay problems. Cost savings, political pressure, competitive advantage, plausible deniability. The internal whistleblowers who raised concerns were ignored or pushed out.

Those who warned were ignored. In every case above, there were people inside or outside the organisation who said this would happen. They were not listened to until the failures became too obvious to ignore.

The corollary: AI deployments that build in transparency, contestability, fairness audits, robust internal challenge, and willingness to roll back when problems emerge — banking is the closest example — produce far fewer failures of the kind on this page. The technology is not the problem. The deployment governance is the problem. And that governance is a political and institutional choice, not a technical one.

The honest summary

Every successful AI deployment has analogues to those that failed. The reason banking AI does not produce Robodebts is not that the underlying technology is different. It is that the governance is mature, the regulator is alert, the accountability is real, and the people running the systems have been chastened by enough previous failures to take the next one seriously. None of those conditions arrive automatically. Where they are absent — and they are most often absent in government and in vulnerable-population services — the failures recur. Treating each new failure as if it were a surprise is one of the more reliable ways to ensure the next one happens.

Listen: Two Conversations AI Regulation