LLM poisoning: Attacks and countermeasures

Avatar
Lisa Ernst · 16.10.2025 · Technology · 5 min

I first stumbled upon the topic when a team demonstrated how few manipulated texts are enough to reliably lead a language model onto thin ice ( Anthropic). ). Since then I ask: How exactly is a system poisoned, where the real risks lie – and what can you practically do? This overview gathers current findings, examples and countermeasures from reputable sources ( (OWASP).

Introduction

With LLM poisoning, the deliberate insertion of manipulated content into training-, fine-tuning-, retrieval-, or tool-data means to weaken models, distort them, or embed hidden commands (backdoors) ( (OWASP). ). A backdoor is: A harmless-looking trigger such as triggers in the model a deviant, attacker-desired response ( (Anthropic). ). In addition to classic training data poisoning, the poisoning of knowledge sources in RAG systems as well as tool descriptions and model artifacts belongs to the broader family, for example when a malicious tool text pushes the model to undesirable actions ( (Microsoft Developer Blog). ). NIST classifies this as a 'Poisoning' class in the AI security taxonomy and, among other things, cites data hardening and forensics as countermeasures ( (NIST).

2023 showed 'PoisonGPT' that a modified open-source model on a popular platform can disseminate false information inconspicuously; the researchers manipulated GPT-J-6B and uploaded it as a seemingly legitimate model ( (Mithril Security Blog).

The four-stage process of LLM supply chain poisoning by PoisonGPT.

Quelle: lakera.ai

The four-stage process of LLM supply chain poisoning by PoisonGPT.

In February/March 2024 security firms and media reported at least around 100 malicious models on Hugging Face that could execute code when loaded; among the causes was the risky use of Pickle files ( (JFrog Blog) (BleepingComputer) (Ars Technica) (CSOonline).

Early 2024, Protect AI reported that since August 2023 it had found a total of 3,354 models with malicious code and launched a scanning service called 'Guardian' ( (Axios).

2025 the picture deepened: Anthropic, UK AI Security Institute, and the Alan Turing Institute showed experimentally that about 250 suitably prepared documents can reliably cause a model to 'forget' — i.e., link a trigger word with nonsensical output — across various model sizes ( (Anthropic) (Alan Turing Institute Blog).

Parallel defense capacities in the supply chain grew: Hugging Face reports in 2025 about millions of scanned model versions and hundreds of thousands of reported 'unsafe/suspicious' issues by partner scanners ( (Hugging Face Blog). ). Microsoft published concrete defense patterns against indirect prompt injection in applications and tool protocols ( (Microsoft Security Response Center Blog).

Threat analysis

Why all this? Attackers pursue three main lines: first disrupt availability (DoS through unlearning), second undermine integrity (targeted misinformation, bias), third insert covert capabilities (backdoors for data leakage or tool abuse) ( (OWASP). ). Platform dynamics amplify the situation: open model hubs and open data facilitate innovation – but also the insertion of manipulated artifacts, especially since many workflows automatically adopt models or datasets ( (JFrog Blog) (ACM Digital Library). ). In apps with web access or RAG it suffices to place bait documents with hidden instructions; the LLM application later accepts them in good faith ( (Microsoft Developer Blog). ). From the defenders' perspective, the lesson is: Defense-in-Depth at data, model, and application levels rather than solely relying on 'Model Safety' hope ( (Microsoft Security Blog).

Quelle: YouTube

A short, sober overview of prompt-injection risks and why classical security boundaries are not sufficient here.

Documented: There are real finds of malicious models in public repositories; several dozens to a hundred cases were documented in 2024, some with code execution on loading ( (JFrog Blog) (BleepingComputer) (Ars Technica).

Documented: Small amounts of poison can be enough. Controlled studies show that a few hundred prepared examples can produce robust misassociations ( (Anthropic) (Alan Turing Institute Blog).

Documented: Prompt-injection and tool poisoning are realistic threats in agentic LLM apps; manufacturers publish concrete mitigations ( (Microsoft Developer Blog) (Microsoft Security Response Center Blog).

Unclear: How widespread backdoors are in proprietary, non-disclosed training corpora cannot be reliably quantified from public sources; here independent audits and reproducible measurements are lacking ( (NIST).

False/misleading: 'Poisoning only happens if attackers control large parts of the training data.' Studies show the opposite: even very small, targeted poisons can have a strong effect ( (Anthropic). ). Also false: 'This concerns only Open Source.' Prompt-injection and data poisoning target the application context and the supply chain – regardless of license model ( (OWASP) (Microsoft Security Blog).

Countermeasures and responses

Hugging Face has been cooperating with security providers since 2024/2025, scans millions of model versions and reports hundreds of thousands of suspicious findings; at the same time the community cautions on careful artifact verification and on secure serialization formats beyond Pickle ( (Hugging Face Blog) (JFrog Blog). ). Microsoft publishes defense patterns against indirect prompt-injection and emphasizes 'Defense-in-Depth' beyond model boundaries ( (Microsoft Security Response Center Blog) (Microsoft Security Blog). ). NIST systematizes attack types and countermeasures in the public guide ( (NIST). ). OWASP prominently lists training data poisoning and supply-chain risks in the LLM top-10 ranking ( (OWASP).

An LLM firewall as a protective mechanism against harmful outputs.

Quelle: securiti.ai

An LLM firewall as a protective mechanism against harmful outputs.

In practice this means: consistently verify the provenance, integrity, and loading paths of your models and data. Use scans and signatures for artifacts, prefer secure formats (e.g., safetensors instead of unchecked Pickles), and technically isolate loading processes ( (JFrog Blog) (Hugging Face Blog). ). Limit the influence of untested sources in RAG, implement input and output filters and strict tool policies, especially for agents and automation ( (Microsoft Developer Blog) (Microsoft Security Blog). ). Align with OWASP LLM Top 10 and NIST recommendations; conduct PoC tests with known poisoning and injection patterns and document defensive measures ( (OWASP) (NIST).

Quelle: YouTube

Short, clear explanation of data poisoning, useful as an introduction for teams.

Outlook

How can backdoors in large, proprietary training sets be reliably detected without fully disclosing the data? Here standardized audit procedures and independent test suites are missing ( (NIST). ). How robust are today’s mitigations against adaptive, multi-stage poisoning in RAG and agent setups? Research continually reports new attack paths; current work on scaled prompt attacks and RAG poisoning underline the urgency ( (OpenReview) (arXiv).

LLM poisoning is not a fringe topic, but a cross-cutting risk across data, models, tools and applications. The good news: with clean provenance, secure loading paths, RAG hygiene, defense-in-depth and ongoing testing, the risk can be significantly reduced ( (OWASP) (NIST) (Microsoft Developer Blog). ). Whoever hardens the chain today saves incidents tomorrow – and retains control over the design of their own AI systems ( (Hugging Face Blog) (Anthropic).

Teilen Sie doch unseren Beitrag!