250 security assertions. 20 real-world coding prompts. Four conditions. Full methodology and results below.
| Condition | Passed | Failed | Pass Rate |
|---|---|---|---|
| No Rules (baseline) | 130 | 120 | 52.0% |
| Free OWASP Skill (agamm) | 161 | 89 | 64.4% |
| Free Snippet (ours) | 175 | 75 | 70.0% |
| Full Secure Code Skill Pack | 232 | 18 | 92.8% |
Transparent, reproducible testing against concrete security properties.
20 real-world coding prompts: contact forms, REST APIs, file upload handlers, RAG pipelines, auth systems, MCP server tools, agent file managers, multi-agent orchestrators, and more. Each prompt is the kind of thing a developer would actually ask an AI coding assistant to build.
Each prompt has 12–16 security assertions — concrete, verifiable properties checked in the generated code. Binary pass/fail. Examples:
Each prompt was run under all four conditions. The generated code was inspected against its assertion set. 250 total assertions across 20 prompts (evals 15, 17, 18, and 19 carry extra assertions covering the newer LLM-hardening rules). Results are deterministic — every assertion has clear pass/fail criteria.
Every release is benchmarked. v1.0.0 scored 91.1% on 16 evals (192 assertions). v1.1.0 added 4 harder agent security evals, expanding to 20 evals (240 assertions), scoring 84.6%. v1.2.0 rewrote agent rules from policy-level to implementation-specific, jumping to 92.5%. v1.3.0 added 10 new assertions for LLM output hardening (URL validation, PII scanning, structured output, invisible Unicode stripping, expanded secrets scanning), scoring 92.8% on the expanded 250-assertion set.
The Skill Pack went through 5 iterations of eval-driven refinement. Each version benchmarked against the last. Rules that didn't improve scores were cut. Rules that introduced regressions were rewritten.
How each condition performed on every test prompt. Scores show assertions passed out of the eval's total.
| # | Eval | Baseline | Agamm | Snippet | Full Pack |
|---|---|---|---|---|---|
| 1 | Next.js server action (contact form) Web | 8/12 | 10/12 | 10/12 | 12/12 |
| 2 | Django REST API (business details) Web | 10/12 | 11/12 | 11/12 | 10/12 |
| 3 | React component (user reviews with HTML) Web | 9/12 | 11/12 | 12/12 | 12/12 |
| 4 | Next.js LLM product descriptions LLM | 8/12 | 11/12 | 12/12 | 12/12 |
| 5 | Express MCP server tool (customer search) Agentic | 6/12 | 11/12 | 11/12 | 10/12 |
| 6 | Dependency vetting (markdown + PDF libs) Supply Chain | 6/12 | 6/12 | 8/12 | 9/12 |
| 7 | FastAPI RAG pipeline (Pinecone + Claude) LLM | 5/12 | 9/12 | 10/12 | 11/12 |
| 8 | Flask search app (SQLite + displayed query) Web | 4/12 | 6/12 | 11/12 | 11/12 |
| 9 | Node.js agent file manager Agentic | 6/12 | 9/12 | 12/12 | 12/12 |
| 10 | Express auth system (register/login/reset) Web | 9/12 | 11/12 | 12/12 | 12/12 |
| 11 | Django file upload (images + PDFs) Web | 9/12 | 12/12 | 10/12 | 12/12 |
| 12 | Node.js XML product import Web | 8/12 | 12/12 | 12/12 | 12/12 |
| 13 | Flask SSRF + open redirect Web | 9/12 | 8/12 | 9/12 | 12/12 |
| 14 | Django multi-tenant SaaS (IDOR) Web | 9/12 | 11/12 | 2/12 | 10/12 |
| 15 | Next.js customer support chatbot LLM +4 in v1.3 | 9/16 | 9/16 | 6/16 | 14/16 |
| 16 | Node.js project setup (CI/CD, Docker, SRI) Supply Chain | 5/12 | 6/12 | 8/12 | 12/12 |
| 17 | FastAPI AI memory service Agentic v1.1 +2 in v1.3 | 3/14 | 4/14 | 4/14 | 14/14 |
| 18 | Node.js multi-agent orchestrator Agentic v1.1 +2 in v1.3 | 3/14 | 3/14 | 1/14 | 13/14 |
| 19 | Django AI customer service platform Agentic v1.1 +2 in v1.3 | 3/14 | 0/14 | 7/14 | 10/14 |
| 20 | Go AI coding gateway Agentic v1.1 | 1/12 | 1/12 | 7/12 | 12/12 |
| Total (20 evals) | 130/250 | 161/250 | 175/250 | 232/250 | |
| Original 16 evals only | 120/196 | 153/196 | 156/196 | 183/196 | |
| Agent evals (17–20) | 10/54 | 8/54 | 19/54 | 49/54 |
The evals where the full Skill Pack outperforms the free snippet by the largest margin.
bleach.clean() sanitization, PII/credential regex scanning, Redis key isolation schemes, HMAC-SHA256 integrity verification, and specific memory limits (200 entries, 1MB/user). The full pack gets a perfect 14/14.queryset.filter(tenant=request.tenant) at every call site, validates direct object references against the caller's access scope, and blocks cross-tenant data access at the ORM layer. The snippet falls back to generic auth rules that don't catch tenant boundary violations.The benchmark spans traditional web security, LLM application safety, agentic AI guardrails, and supply chain security.
| Version | Evals | Assertions | Pass Rate |
|---|---|---|---|
| v1.3.0 (current) | 20 | 250 | 92.8% |
| v1.2.0 | 20 | 240 | 92.5% |
| v1.1.0 | 20 | 240 | 84.6% |
| v1.0.0 | 16 | 192 | 91.1% |
| Iteration 4 (pre-release) | 9 | 108 | 80.6% |
| Iteration 3 (pre-release) | 9 | 108 | 78.7% |
v1.1.0 added 4 harder agent security evals (240 assertions total). v1.2.0 rewrote agent rules from policy-level to implementation-specific with concrete code patterns, pushing agent evals from 64.6% to 89.6% and overall to 92.5%. v1.3.0 added LLM output hardening rules (URL validation, PII scanning, structured output schema validation, invisible Unicode stripping, expanded secrets scanning, runtime prompt injection classifiers) plus 10 new assertions covering them, holding at 92.8% on the expanded 250-assertion set.
Drop-in rules for Claude, Cursor, Copilot, Windsurf, and ChatGPT.