State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.
> TL;DR: We scanned every publicly available MCP server and OpenClaw skill — 15,923 in total. 36% of MCP servers scored F (failing). 42 skills confirmed malicious (0.4%), with 552 initially flagged. Token leakage is the #1 vulnerability, found in 757 servers. Only 2% earned a B grade or higher.
---
The Dataset
SpiderRating analyzed 15,923 AI tools across two ecosystems: - 5,725 MCP servers (Model Context Protocol — the standard for connecting AI agents to external tools) - 10,198 OpenClaw/ClawHub skills (agent behavior definitions for Claude, Cursor, Windsurf)
Each tool was rated on three dimensions: - Description Quality (0-10): Can an AI agent understand what the tool does? - Security (0-10): Does the tool have exploitable vulnerabilities? - Metadata (0-10): Documentation, licensing, versioning
Combined into a SpiderScore (0-10) and letter grade (A-F).
This is the largest independent security analysis of the MCP/AI tool ecosystem to date.
---
Key Findings
1. Most AI Tools Are Mediocre — Only 2% Score B or Higher
| Grade | MCP Servers | Skills | What It Means |
|---|---|---|---|
| A (≥9.0) | 0 (0%) | 0 (0%) | No tool meets "exemplary" standards |
| B (7.0-8.9) | 116 (2%) | 95 (1%) | Production-ready with good practices |
| C (5.0-6.9) | 1,995 (35%) | 9,050 (89%) | Adequate but room for improvement |
| D (3.0-4.9) | 1,546 (27%) | 1,052 (10%) | Significant quality/security gaps |
| F (<3.0) | 2,068 (36%) | 1 (0%) | Failing — serious issues |
Zero tools scored A. The ceiling for MCP servers is 8.5/10; for skills it's 7.5/10.
MCP servers have a bimodal distribution: you're either decent (C) or terrible (F). Skills cluster in the middle (89% C-grade).
2. Token Leakage Is the #1 Vulnerability
We found 32,691 security findings across the ecosystem.
Top 10 Vulnerabilities in MCP Servers:
| Rank | Vulnerability | Servers Affected | Findings |
|---|---|---|---|
| 1 | Token Leakage | 757 (13%) | 6,632 |
| 2 | Command Injection (child_process) | 269 (5%) | 1,007 |
| 3 | SQL Injection | 105 (2%) | 787 |
| 4 | Path Traversal | 244 (4%) | 761 |
| 5 | Prototype Pollution | 145 (3%) | 489 |
| 6 | Hardcoded Credentials | 163 (3%) | 389 |
| 7 | Secret Leakage (metadata) | 114 (2%) | 376 |
| 8 | Command Injection (os/subprocess) | 112 (2%) | 263 |
| 9 | Path Traversal (TypeScript) | 169 (3%) | 492 |
| 10 | Timing Attack | 4 (0.07%) | 9 |
Token leakage alone accounts for 20% of all findings. API keys, auth tokens, and secrets are being exposed through MCP tool outputs, logged to files, or included in error messages.
3. 36% of MCP Servers Score F
More than a third of MCP servers are fundamentally unsafe: - Average MCP score: 4.11/10 (between D and C) - Average skill score: 5.91/10 (solid C)
Why MCP servers score worse: - Description quality crisis: avg 3.13/10 — most servers don't tell AI agents what their tools do, when to use them, or what parameters mean - Many are proof-of-concept or abandoned projects with no documentation
4. 552 Skills Flagged, 42 Confirmed Malicious
We used a two-pass security analysis: 1. Automated Threat Scanner — pattern matching for known malicious behaviors 2. LLM Verification — Claude Haiku reviews each finding to distinguish "security tool describing attacks" from "malicious skill executing attacks"
Results: - 552 skills flagged with critical security issues - 42 confirmed malicious after LLM verification (0.4% of ecosystem) - Common attack patterns: prompt injection override, invisible Unicode characters, credential exfiltration - 97% of automated "critical" findings were false positives — mostly legitimate security tools whose descriptions triggered keyword-based detection
5. The Description Quality Crisis
AI agents can only use tools they understand. Our Description Quality score measures whether a tool's description tells the AI: - What the tool does (action verb) - When to use it (scenario trigger) - What parameters mean (param docs) - What errors to expect (error guidance)
| Signal | Coverage |
|---|---|
| Has action verb | ~60% |
| Has scenario trigger | ~3% |
| Has param documentation | ~45% |
| Has error guidance | ~8% |
98% of tools lack a scenario trigger — they don't tell the AI *when* to use them. This means AI agents frequently choose the wrong tool, leading to failures users blame on "AI being dumb" when the real problem is tool documentation.
---
Scoring Methodology
SpiderRating uses a three-layer scoring model:
Overall = Description × 0.45 + Security × 0.35 + Metadata × 0.20
MCP Servers: Description (38%) + Security (34%) + Metadata (28%) Skills: Description (45%) + Security (35%) + Metadata (20%)
- Description: Tool description quality, parameter docs, error guidance, disambiguation
- Security: Static analysis (Semgrep taint + regex), supply chain checks, runtime exposure
- Metadata: README, license, version history, community signals
Hard constraints apply: certain critical issues (e.g., no tools detected, active malware indicators) force a grade cap regardless of score.
All scans are fully offline — no code is sent to external services. The scanner (spidershield) is open source under MIT.
---
What This Means for Developers
If you build MCP servers: 1. Write scenario triggers in your tool descriptions — tell AI agents *when* to use each tool 2. Don't log tokens — use structured error handling that strips secrets 3. Use parameterized queries — SQL injection is the #3 vulnerability 4. Add a README and license — it's 20% of your score
If you install AI tools: 1. Check the SpiderScore before installing — anything below C (5.0) has known issues 2. Be cautious with skills rated critical — 0.4% are confirmed malicious 3. Prefer tools with B grade — they've demonstrated security best practices
If you're a platform (ClawHub, Smithery, Glama): 1. Integrate trust scores at the point of installation — users need signal before they install 2. Flag malicious skills — we've identified 42 confirmed, 552 suspected 3. Require scenario triggers — it's the single biggest quality improvement you can drive
---
About This Research
This analysis was conducted by SpiderRating, an MCP ecosystem security rating platform. We maintain a continuously updated database of MCP server and skill security assessments.
- Scanner: spidershield (open source, MIT)
- Data: 15,923 tools, 78,849 tool descriptions, 32,691 security findings
- Precision: 93.6% calibrated accuracy (validated against 12,700+ ground-truth observations)
- Methodology: Three-layer scoring (description × security × metadata) with LLM-verified threat assessment
Data updated daily. Full methodology and raw data available upon request.
---
*Published: March 2026 | SpiderRating Research*
Related reads: - 98% of tools missing usage guidance — the description quality deep dive - How We Score MCP Servers — the full scoring model explained - OpenClaw evaluation: Grade B — a real-world case study - See the most secure servers in the ecosystem - Scan your own server for free