
Twelve months ago, the security risks associated with AI agents were largely theoretical. Researchers published attack scenarios, frameworks published threat taxonomies, and security teams filed them under "watch and wait".
That window is now closing.
Between late 2025 and early 2026, the threat landscape shifted from potential to active and operational. Adversaries have been documented in the wild actively targeting AI agents, the registries that supply their capabilities, and the protocols that wire them together. The attack surface is real, it is expanding faster than most organizations have governance structures to address, and the foundational vulnerability at the center of it, prompt injection, remains unsolved. This is not a warning about what might happen. It is a description of what is already happening.
The first thing security leadership needs to internalize is the nature of the shift. Earlier generations of AI tools were sophisticated interfaces where a human asked, the model responded, and a human decided what to do with the answer. Agentic AI inverts that model entirely.
Agents are now given goals rather than prompts, and they autonomously plan, execute, delegate to other agents, invoke tools, write and run code, query databases, send emails, and interact with cloud infrastructure. All without a human in the loop for any action.
Coding agents now ship with shell access, file system manipulation, git operations, and infrastructure management as standard capabilities. Enterprise agents connect to email, internal knowledge bases, CRM systems, and cloud services, then chain those connections together into automated workflows that business teams build themselves, often without any security review or oversight.
A grassroots development pattern that emerged in early 2026 illustrates how far this has gone, with developers running coding agents in bash loops overnight and generating complete repositories with no human involvement, each agent instance picking up where the last one left off through git history alone.
The security risk is inherent and architectural. These agents chain highly privileged capabilities through a probabilistic intermediary, the language model, that cannot reliably distinguish between a legitimate instruction from an operator and malicious instructions embedded in the data it processes. The more capable and autonomous the agent, the larger the blast radius of that single unresolved weakness.
Every attack class documented in the current threat landscape shares the same root cause. Language models process all input as a unified sequence of tokens, with no reliable privilege boundary between the system prompt written by the developer, the request from the user, and content retrieved from external sources such as a shared document, a calendar invitation, an API response, or an email.
When an attacker embeds instructions in any data source an agent touches, those instructions are interpreted with the same weight as legitimate commands. This is not a bug to be patched. It is a “by design” feature of the underlying architecture.
Throughout 2025, researchers demonstrated enterprise-scale attacks in which a single poisoned document, one shared file or calendar invite, was sufficient to cause an enterprise AI assistant to exfiltrate sensitive data across organizational boundaries without triggering any security alerts. The agents in question operated exactly as designed, and the failure was entirely in the trust assumptions governing what they were connected to.
Simon Willison's "lethal trifecta" gives security leaders a practical lens for reasoning about this risk. It identifies the three properties whose combination makes prompt injection exploitable, namely:
When all three are present in a single agent session, and in most enterprise and coding agent deployments they are, a single injection in any data source the agent touches can trigger exfiltration to an attacker-controlled destination. The value of the framework is not theoretical. It is an immediate audit tool for understanding which of your deployed agents are structurally positioned to be exploited.
Wherever agents discover and invoke external capabilities, through tool integration protocols, skill registries, or plugin marketplaces, adversaries have followed with poisoned packages, trojanized tools, and social engineering campaigns purpose-built for agent consumption.
The Model Context Protocol ecosystem proved particularly vulnerable.
Within months of MCP's adoption as the de facto standard for tool integration, researchers documented the first malicious MCP server in the wild, a package that spent fifteen versions building legitimacy before adding a single exfiltration line. A critical RCE vulnerability with a CVSS score of 9.6 was disclosed in core MCP infrastructure used by hundreds of thousands of developers. By January 2026, coordinated campaigns had scaled to hundreds of malicious packages uploaded to agent skill registries within days, targeting developers with credential stealers and keyloggers.
What makes these attacks structurally new is where the payload lives, not in the code but in the metadata, malicious instructions hidden in tool description fields that are invisible to human reviewers but processed by AI models as trusted context.
The ClawHavoc campaign, which auditors found had infected roughly 12% of all listed skills in a popular open agent marketplace with infostealer malware, is the clearest illustration of where this is heading. The trust that users extend to agent skill registries is being exploited the same way earlier campaigns exploited npm and PyPI, except the attack surface is larger, the governance is younger, and many of the users building on top of these registries have no security background at all.
The hackerbot-claw campaign, active across the last week of February 2026, represents the next escalation in this trajectory. An autonomous bot systematically targeted CI/CD pipelines across repositories belonging to Microsoft, DataDog, the CNCF, and several major open source projects, achieving confirmed remote code execution in at least four of seven targets and exfiltrating a GitHub token with write permissions from one of the most starred repositories on GitHub.
The Aqua Security Trivy incident, the most damaging in the campaign, resulted in a full repository takeover, deletion of years of published releases, and a suspicious artifact pushed to the project's VS Code extension on an open marketplace, a supply chain vector with potential reach into developer workstations across thousands of organizations.
What distinguishes hackerbot-claw from prior campaigns is not just its scope but its method: one of its attack techniques involved replacing a repository's AI project configuration file with social engineering instructions designed to manipulate an AI code reviewer into committing malicious code and posting a fake approval comment. This is agents attacking agents, an autonomous attacker exploiting the trust that AI systems extend to their own context as an attack vector, and it signals a new tier of complexity in the supply chain threat that organizations have barely begun to account for.
The cultural shift toward "vibe coding", generating entire applications from natural language descriptions and accepting the output without review, significant enough to be named Collins Dictionary's Word of the Year for 2025, has introduced a governance problem that traditional security controls cannot address. Large-scale analyses of applications built this way reveal that language models produce statistically predictable vulnerabilities, with each model carrying its own recurring security gaps, preferred default configurations, and tendencies toward hardcoded secrets.
Attackers who understand these model-specific patterns can exploit them at scale, without any reconnaissance, across every application a given model produces. This compounds with the Shadow AI problem.
Surveys indicate that roughly half of employees use AI tools not sanctioned by their employer, often connecting them to work systems without IT approval, and fewer than 40% of organizations report having AI governance policies in place at all. The gap between adoption speed and governance maturity is not closing.
The threat landscape documented here is not a set of isolated incidents. It is a structural picture of what happens when autonomous systems with broad access operate faster than the governance frameworks meant to contain them. The second part of this series addresses how the industry is responding, and what security leadership needs to prioritize to get ahead of it.

Guardian agents are emerging as a critical security layer for the agentic AI era. As enterprises adopt AI agents that execute tools, handle sensitive data, and operate inside real workflows, human approval loops no longer scale. Guardian agents solve this by supervising other agents in real time: monitoring actions, enforcing policy, and blocking risky behavior before execution.
.png)
Capsule found two Cursor IDE vulnerabilities that let hidden prompt-injection instructions in referenced files steal developers’ SSH keys and contaminate future unrelated projects, causing zero-click or one-click exfiltration even when the attacker ships no malicious code.

Capsule Security’s State of AI Agent Security 2026 report is the largest independent audit of AI agents to date, showing that the ecosystem is rapidly shipping publicly exposed, weakly guarded, highly connected agents with recurring misconfigurations, near-absent runtime controls, widespread prompt-injection risk, expanding supply-chain exposure, and active malicious campaigns still propagating through agent skill and tool registries.

Capsule is launching a runtime security platform for the agentic AI era, built to monitor and stop autonomous agents that can bypass traditional guardrails, misuse legitimate access, and create a new class of enterprise security risk.

Capsule research team discover a critical prompt injection vulnerability in Salesforce Agentforce that allows attackers to exfiltrate CRM data through a simple lead from a form submission. No authentication required.