Horizon Accord | Agent Security | Instruction Layer | Machine Learning

May 7

Why agentic AI security fails at the instruction layer — the semantic gap current scanners cannot close

AI Research

The Semantic Gap

Why agentic AI security fails at the instruction layer — and what a different foundation would require

By Cherokee Schill · Horizon Accord

The Category Error

There is a sentence buried in a Cisco engineering blog post from April 2026 that names the problem more precisely than most of the security industry has managed: "Traditional application security tools were not designed for this. SAST scanners analyze source code syntax. SCA tools check dependency versions. Neither understands the semantic layer where MCP tool descriptions, agent prompts, and skill definitions operate."

Structural Observation

The security industry is solving the wrong problem because it has misclassified the threat object. SAST and SCA were built to inspect code — syntax, binaries, known vulnerability signatures, dependency version strings. They operate on the assumption that the dangerous thing is the executable artifact, and that the dangerous artifact looks like code.

In agentic AI systems, that assumption is wrong. Instruction files in agentic systems are not content to be scanned for known-bad patterns. They are performative. They do not describe what an agent will do — they cause it. A SKILL.md file is not documentation. It is operational intent in natural language, and once an agent ingests it, that intent executes with the full credential scope of whoever invoked the agent.

Instruction files in agentic systems are performative — they don't describe action, they cause it. That is the category shift the existing security stack was not built to handle.

Structural Observation

Once language becomes operational, the consequences cascade. Semantics become execution pathways. Trust becomes a form of privilege escalation. The persuasive structure of an instruction — its apparent authority, its mimicry of legitimate tooling, its deployment of familiar framing — becomes a mechanism for infrastructure access. This is not a marginal edge case. It is the fundamental architecture of how agentic AI systems work.

The gap between what security tools were built to inspect and what is actually executing in agentic environments is not a tooling gap. It is a classification failure. The threat object was misidentified before the first scanner was shipped.

What the Evidence Shows

Documented Fact

In February 2026, Snyk completed the first comprehensive security audit of the AI agent skills ecosystem, scanning 3,984 skills from ClawHub and skills.sh. The findings: 13.4% of all skills — 534 packages — contained at least one critical-level security issue. Of those, 76 were confirmed malicious payloads designed for credential theft, backdoor installation, and data exfiltration. Eight remained publicly available on ClawHub at the time of publication.

Documented Fact

Koi Security's ClawHavoc audit, published in late January 2026, identified 341 malicious skills in the ClawHub registry, 335 of them traced to a single coordinated campaign. Follow-up analysis by Antiy CERT expanded the confirmed count to 1,184 compromised packages. The campaign used typosquatting — skill names like solana-wallet-tracker and polymarket-trader — to match developer search intent. Malicious skills delivered Atomic macOS Stealer (AMOS) through what appeared to be professionally documented utility tooling.

Documented Fact

In April 2026, OX Security reported that researchers successfully poisoned nine out of eleven MCP marketplaces using proof-of-concept servers. A separate finding from Trend Micro documented 1,467 MCP servers exposed to the internet with no authentication layer. A documented attack chain from the same month showed a single crafted GitHub issue title triggering an AI triage bot connected to Cline, which exfiltrated a GITHUB_TOKEN and used it to publish a compromised npm dependency that reached approximately 4,000 developer machines over eight hours. No human approved any step of that chain.

Documented Fact

The barrier to publishing a malicious skill on ClawHub in early 2026 was a SKILL.md markdown file and a GitHub account at least one week old. No code signing. No security review. No sandboxing by default.

The scale of these findings is notable, but the structure of them is more important. In every documented case, the attack did not enter through the code layer or the dependency layer. It entered through the instruction layer — through natural language that an agent read, trusted, and executed.

Incident	Source	Date	Attack vector
ToxicSkills audit	Snyk	February 2026	Malicious SKILL.md payloads — credential theft, backdoor installation, data exfiltration
ClawHavoc campaign	Koi Security / Antiy CERT	January–February 2026	Typosquatted skill names delivering AMOS infostealer via documentation layer
MCP marketplace poisoning	OX Security	April 2026	Proof-of-concept servers poisoning 9/11 major MCP marketplaces
GitHub triage chain	Documented incident report	April 2026	Single issue title → AI bot → GITHUB_TOKEN exfiltration → compromised npm package → 4,000 machines

Why Scanners Cannot Close This

Documented Fact

Cisco's April 2026 Skill Scanner — the first purpose-built tool for the agent instruction layer — uses multiple detection engines including LLM-based semantic analysis specifically because static analysis alone cannot catch novel prompt injection patterns. The scanner's own documentation is direct about the limitation: "No findings ≠ no risk. A scan that returns 'No findings' indicates that no known threat patterns were detected. It does not guarantee that a skill is secure, benign, or free of vulnerabilities."

Structural Observation

This is the correct acknowledgment of an architectural constraint, not a product limitation. Signature-based scanning catches known-bad patterns. But a poisoned instruction file does not need to match a known signature. It needs only to be persuasive — to appear authoritative, to mimic the structure of legitimate tooling, to frame its payload as documentation or example code. The attack surface is the full expressive range of natural language. No signature library covers that.

Snyk's ToxicSkills research found that 91% of confirmed malicious skills combined prompt injection with traditional malware techniques. That convergence is significant: it means the attack is simultaneously operating at the semantic layer (manipulating agent reasoning through language) and the code layer (delivering executable payloads). Existing tools that address one layer miss the other. The intersection is where the actual threat lives.

Structural Observation

VirusTotal scans file hashes against databases of known malware signatures. It will flag a skill archive containing a binary that has already been identified as malicious. It has no concept of AI agent instruction semantics — it cannot detect prompt injection in a SKILL.md file because the attack is not in the binary. The attack is in what the language causes the agent to do.

The deeper problem is not that scanners are insufficiently sophisticated. It is that scanners are asking the wrong question. They ask: does this artifact match a known-bad pattern? The question the instruction layer requires is: what is this instruction attempting to cause, and does that intent violate the operational boundaries of a trustworthy agent? That is a different kind of evaluation entirely.

Intent, Not Artifact

This is not a proposal for generalized moral adjudication of human language. The relevant question is narrower and operational: whether an instruction set attempts to conceal actions, override consent, impersonate authority, or exceed declared scope inside an autonomous execution environment. The domain is bounded, and the constraints are operationally specific.

Hypothesis

If the failure is a classification error — treating instruction as content rather than as executable intent — then the correct evaluation framework is one that assesses intent at the instruction layer, not artifact patterns at the code layer. What that framework requires, at minimum, is a set of invariant constraints against which any instruction set can be evaluated before execution.

Consider what the documented attacks have in common. Every successful attack in the evidence above achieves one or more of the following: it conceals its actions from the user; it accesses credentials without explicit authorization; it executes commands outside the declared scope of the skill; it overrides user consent through embedded instruction; it launders operational commands as documentation or examples; it suppresses auditability by hiding its payload in encoding or in instruction framing that looks legitimate. And critically: it represents third-party authority as first-party authority — the skill presents itself as speaking with the voice of the platform, the developer, or the system itself.

Structural Observation

These are not random attack patterns. They are the same set of violations that appear in any coercive system — concealment, unauthorized access, scope override, consent bypass, authority impersonation, suppressed accountability. The comparison is structural rather than moralistic: both operate through concealment, authority manipulation, consent bypass, and asymmetric control. The security industry calls them attack vectors. They are also, in a direct structural sense, ethical violations. The agent that executes a poisoned SKILL.md has not merely been technically compromised. It has been caused to act against the interests and consent of its user, through language designed to make that action appear legitimate.

An evaluation framework oriented toward intent would operate differently from a signature scanner. Rather than asking whether an instruction file matches a known-bad pattern, it would evaluate the instruction set against a small set of non-negotiable constraints — invariants that cannot be overridden by conversational influence or instruction framing:

Does this instruction set conceal its actions from the user? Does it claim credentials or access it has not been explicitly granted? Does it execute outside its declared operational scope? Does it attempt to override user consent? Does it embed operational commands in documentation, examples, or encoding layers? Does it suppress the ability of the user or system to audit what it is doing? Does it represent external or third-party authority as native system authority?

Hypothesis

These constraints are not complicated. They are not philosophically contested. They are the minimum conditions for trustworthy agency — the floor below which an agent cannot be said to be operating on behalf of its user. They are also, notably, the exact set of properties that the documented attacks violate. The attack surface and the invariant set are mirror images of each other.

The gap Cisco named — the semantic layer where neither SAST nor SCA operates — is not a gap that can be closed by adding another signature rule. It is a gap that requires a different kind of evaluation: one that reasons about intent, not artifact. The first-generation tools are moving in this direction. Cisco's LLM-as-judge layer uses a language model to semantically evaluate instruction intent. That is the right instinct. The question is whether intent evaluation anchored to an invariant set — rather than to pattern matching against known attacks — would be more robust, more generalized, and harder to evade. What emerges from that question is a form of relational-semantic evaluation: an architectural approach that assesses instruction sets against invariant constraints governing consent, scope, authority, and auditability before operational execution.

The Convergence

Structural Observation

The instruction layer attack surface does not exist because of poor security practice. It exists because agentic AI systems were designed to trust language — that is precisely what makes them useful. The same capacity for instruction-following that allows an agent to be useful across an infinite range of tasks is the capacity that a poisoned SKILL.md exploits. You cannot patch that out without eliminating the function.

This means the security problem and the alignment problem are becoming structurally inseparable at the instruction layer. An agent that can be ethically captured — that can be caused to act against the interests of its user through language designed to appear legitimate — can be security-captured by exactly the same mechanism. The attack surface and the alignment failure surface are not merely adjacent. They are both the instruction layer.

Structural Observation

OWASP formalized this in April 2026 with the publication of the Agentic Skills Top 10, cataloging malicious skills as the primary risk category in the agent ecosystem. When OWASP creates a dedicated taxonomy for a threat class, it marks the point at which the research community agrees the risk has matured beyond edge case status. The instruction layer is now a recognized attack surface with its own risk framework. The security industry is catching up to a structural problem that was present in the architecture from the beginning.

What has not yet caught up is the deeper recognition that closing this surface requires more than better scanners. It requires alignment, security, and ethics to be understood as operating on the same layer — not as three separate concerns addressed by three separate teams, but as a single question about what instructions an agent is permitted to follow and why.

The answer to that question is not primarily a security answer. It is an architectural one. And the architecture does not yet exist.

Sources for Verification

Snyk. ToxicSkills: Malicious AI Agent Skills Supply Chain Compromise. February 5, 2026. snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/
Koi Security. ClawHavoc: 341 Malicious Skills Found by the Bot They Were Targeting. January–February 2026. Primary audit of 2,857 ClawHub skills identifying 341 malicious entries, 335 tied to coordinated campaign.
Cisco AI Defense. Introducing the AI Agent Security Scanner for IDEs. April 2026. blogs.cisco.com
Cisco AI Defense. Skill Scanner — open-source Python-based static analysis tool for AI agent skill files. github.com/cisco-ai-defense/skill-scanner
Cisco AI Defense. Securing the AI Agent Supply Chain with Cisco's Open-Source MCP Scanner. blogs.cisco.com
OX Security. MCP marketplace poisoning research. April 2026. Nine of eleven MCP marketplaces poisoned using proof-of-concept servers.
Trend Micro. MCP server exposure findings. April 2026. 1,467 MCP servers exposed to the internet with zero authentication.
OWASP. Agentic Skills Top 10 (AST01: Malicious Skills). April 27, 2026. First formal risk taxonomy for the agent skill ecosystem.
Snyk Labs. toxicskills-goof — documented malicious skill samples with known attack payloads. github.com/snyk-labs/toxicskills-goof
Liu, Y. et al. Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems. DDIPE research across four agent frameworks and five LLMs. arXiv:2602.06547, February 2026.
Repello AI. Cisco Skill Scanner: What It Does, What It Misses, and When to Use Something Else. February 2026. repello.ai

Cherokee Schill

Horizon Accord | Agent Security | Instruction Layer | Machine Learning

The Category Error

What the Evidence Shows

Why Scanners Cannot Close This

Intent, Not Artifact

The Convergence

Sources for Verification

Horizon Accord | Anthropic | Reconstruction | Interpretation | Machine Learning.

Horizon Accord | AI Research | Structural Coherence | Fractal Seam | LLM Behavior | Machine Learning

Horizon Accord