State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

SpiderRating Research··8 min read
MCPSecurityResearchAI ToolsOpenClawSkillsVulnerability

> TL;DR: We scanned every publicly available MCP server and OpenClaw skill — 15,923 in total. 36% of MCP servers scored F (failing). 42 skills confirmed malicious (0.4%), with 552 initially flagged. Token leakage is the #1 vulnerability, found in 757 servers. Only 2% earned a B grade or higher.

---

The Dataset

SpiderRating analyzed 15,923 AI tools across two ecosystems: - 5,725 MCP servers (Model Context Protocol — the standard for connecting AI agents to external tools) - 10,198 OpenClaw/ClawHub skills (agent behavior definitions for Claude, Cursor, Windsurf)

Each tool was rated on three dimensions: - Description Quality (0-10): Can an AI agent understand what the tool does? - Security (0-10): Does the tool have exploitable vulnerabilities? - Metadata (0-10): Documentation, licensing, versioning

Combined into a SpiderScore (0-10) and letter grade (A-F).

This is the largest independent security analysis of the MCP/AI tool ecosystem to date.

---

Key Findings

1. Most AI Tools Are Mediocre — Only 2% Score B or Higher

GradeMCP ServersSkillsWhat It Means
A (≥9.0)0 (0%)0 (0%)No tool meets "exemplary" standards
B (7.0-8.9)116 (2%)95 (1%)Production-ready with good practices
C (5.0-6.9)1,995 (35%)9,050 (89%)Adequate but room for improvement
D (3.0-4.9)1,546 (27%)1,052 (10%)Significant quality/security gaps
F (<3.0)2,068 (36%)1 (0%)Failing — serious issues

Zero tools scored A. The ceiling for MCP servers is 8.5/10; for skills it's 7.5/10.

MCP servers have a bimodal distribution: you're either decent (C) or terrible (F). Skills cluster in the middle (89% C-grade).

2. Token Leakage Is the #1 Vulnerability

We found 32,691 security findings across the ecosystem.

Top 10 Vulnerabilities in MCP Servers:

RankVulnerabilityServers AffectedFindings
1Token Leakage757 (13%)6,632
2Command Injection (child_process)269 (5%)1,007
3SQL Injection105 (2%)787
4Path Traversal244 (4%)761
5Prototype Pollution145 (3%)489
6Hardcoded Credentials163 (3%)389
7Secret Leakage (metadata)114 (2%)376
8Command Injection (os/subprocess)112 (2%)263
9Path Traversal (TypeScript)169 (3%)492
10Timing Attack4 (0.07%)9

Token leakage alone accounts for 20% of all findings. API keys, auth tokens, and secrets are being exposed through MCP tool outputs, logged to files, or included in error messages.

3. 36% of MCP Servers Score F

More than a third of MCP servers are fundamentally unsafe: - Average MCP score: 4.11/10 (between D and C) - Average skill score: 5.91/10 (solid C)

Why MCP servers score worse: - Description quality crisis: avg 3.13/10 — most servers don't tell AI agents what their tools do, when to use them, or what parameters mean - Many are proof-of-concept or abandoned projects with no documentation

4. 552 Skills Flagged, 42 Confirmed Malicious

We used a two-pass security analysis: 1. Automated Threat Scanner — pattern matching for known malicious behaviors 2. LLM Verification — Claude Haiku reviews each finding to distinguish "security tool describing attacks" from "malicious skill executing attacks"

Results: - 552 skills flagged with critical security issues - 42 confirmed malicious after LLM verification (0.4% of ecosystem) - Common attack patterns: prompt injection override, invisible Unicode characters, credential exfiltration - 97% of automated "critical" findings were false positives — mostly legitimate security tools whose descriptions triggered keyword-based detection

5. The Description Quality Crisis

AI agents can only use tools they understand. Our Description Quality score measures whether a tool's description tells the AI: - What the tool does (action verb) - When to use it (scenario trigger) - What parameters mean (param docs) - What errors to expect (error guidance)

SignalCoverage
Has action verb~60%
Has scenario trigger~3%
Has param documentation~45%
Has error guidance~8%

98% of tools lack a scenario trigger — they don't tell the AI *when* to use them. This means AI agents frequently choose the wrong tool, leading to failures users blame on "AI being dumb" when the real problem is tool documentation.

---

Scoring Methodology

SpiderRating uses a three-layer scoring model:

Overall = Description × 0.45 + Security × 0.35 + Metadata × 0.20

MCP Servers: Description (38%) + Security (34%) + Metadata (28%) Skills: Description (45%) + Security (35%) + Metadata (20%)

  • Description: Tool description quality, parameter docs, error guidance, disambiguation
  • Security: Static analysis (Semgrep taint + regex), supply chain checks, runtime exposure
  • Metadata: README, license, version history, community signals

Hard constraints apply: certain critical issues (e.g., no tools detected, active malware indicators) force a grade cap regardless of score.

All scans are fully offline — no code is sent to external services. The scanner (spidershield) is open source under MIT.

---

What This Means for Developers

If you build MCP servers: 1. Write scenario triggers in your tool descriptions — tell AI agents *when* to use each tool 2. Don't log tokens — use structured error handling that strips secrets 3. Use parameterized queries — SQL injection is the #3 vulnerability 4. Add a README and license — it's 20% of your score

If you install AI tools: 1. Check the SpiderScore before installing — anything below C (5.0) has known issues 2. Be cautious with skills rated critical — 0.4% are confirmed malicious 3. Prefer tools with B grade — they've demonstrated security best practices

If you're a platform (ClawHub, Smithery, Glama): 1. Integrate trust scores at the point of installation — users need signal before they install 2. Flag malicious skills — we've identified 42 confirmed, 552 suspected 3. Require scenario triggers — it's the single biggest quality improvement you can drive

---

About This Research

This analysis was conducted by SpiderRating, an MCP ecosystem security rating platform. We maintain a continuously updated database of MCP server and skill security assessments.

  • Scanner: spidershield (open source, MIT)
  • Data: 15,923 tools, 78,849 tool descriptions, 32,691 security findings
  • Precision: 93.6% calibrated accuracy (validated against 12,700+ ground-truth observations)
  • Methodology: Three-layer scoring (description × security × metadata) with LLM-verified threat assessment

Data updated daily. Full methodology and raw data available upon request.

---

*Published: March 2026 | SpiderRating Research*

Related reads: - 98% of tools missing usage guidance — the description quality deep dive - How We Score MCP Servers — the full scoring model explained - OpenClaw evaluation: Grade B — a real-world case study - See the most secure servers in the ecosystem - Scan your own server for free