We scanned 15,000 MCP servers. Here's where our scanner breaks down.

SpiderRating Research··5 min read
MCPSecurityResearchFalse PositivesGround TruthAI ToolsStatic Analysis

We run SpiderRating, an open-source security scanner for MCP (Model Context Protocol) servers. Over the past month, we scanned 15,674 MCP servers and skills using static analysis (46 rules + YARA supply chain patterns) and built a calibration database of 10,970 manually verified findings.

Here's what surprised us -- and where we got it wrong.

Our scanner is least accurate on security tools

We grouped MCP servers by purpose and compared our scanner's accuracy. Security-themed servers -- scanners, firewalls, pentest tools, CTF challenges -- had a 2.05x higher false positive rate than the rest of the ecosystem.

GroupFP rateReal vulns per repoSample
Security-themed MCP servers55.5%1.6226 FP / 181 TP across 110 repos
Everything else27.0%1.82,855 FP / 7,708 TP across 4,213 repos

To be clear: security tools don't have more real vulnerabilities. They actually have slightly fewer (1.6 vs 1.8 true positives per repo). The problem is that our scanner produces far more noise on them.

Why? Security tools legitimately contain attack patterns in their source code. A scanner's detection rules look identical to the patterns it's detecting. An MCP firewall's test suite contains the exact exploit payloads it's supposed to block. Our regex-based scanner can't distinguish "code that detects SQL injection" from "code that is vulnerable to SQL injection."

The extreme case: cisco-ai-defense/mcp-scanner triggered 145 false positives against only 1 true positive (99.3% FP rate). Its code is full of injection patterns -- because detecting injection patterns is what it does.

This is a fundamental limitation of static analysis, not a problem with security tools. It forced us to build a "by-design" detection system: when the scanner recognizes that a repo is a security tool, it downgrades confidence on findings that match the tool's core function rather than flagging them as vulnerabilities.

The numbers

From 15,674 scanned servers:

  • 11.8% earned a RECOMMENDED verdict (safe to use without caveats)
  • 49.0% were CONSIDER (usable, but with risks)
  • 25.5% were ALLOW_WITH_RISK
  • 13.6% were NOT_RECOMMENDED
  • Average security score: 5.28/10 (median 5.71)

What we found (calibrated true positive rates)

Not all scanner rules are equally reliable. We tracked every finding against manual verification to get real accuracy numbers:

VulnerabilityTP rateVerified sample
Path traversal76.1%67 findings
SQL injection66.0%1,024 findings
SSRF66.2%859 findings
Child process injection50.3%1,107 findings
Prototype pollution50.2%214+216 findings
Dangerous eval44.5%654 findings
Timing attack36.8%76 findings
Command injection32.1%327 findings
Hardcoded credential2.7%263 findings
Data exfiltration0.5%217 findings
Prompt injection1.2%168 findings

The bottom three are basically broken. Most "hardcoded credential" findings are mock data in tutorials. Most "data exfiltration" hits are documentation files. We've since added confidence markers to every risk flag so downstream consumers know which findings to trust.

We fixed what we found

We didn't just scan -- we submitted fix PRs to upstream projects. 6 have been merged so far:

  • [upstash/context7#2235](https://github.com/upstash/context7/pull/2235) -- path traversal (critical) + command injection (high). Context7 has 7,700+ stars.
  • [topoteretes/cognee#2423](https://github.com/topoteretes/cognee/pull/2423) -- command injection in API handler.
  • [agentic-community/mcp-gateway-registry#655](https://github.com/agentic-community/mcp-gateway-registry/pull/655) -- command injection (critical) in the gateway registry.
  • Plus 3 more across moeru-ai/airi, Flux159/mcp-server-kubernetes, and others.

Our overall merge rate is 35.3% (6/17). We treat each merged PR as ground truth validation that our scanner found a real issue, and each rejection as calibration data for reducing false positives.

What we got wrong

We're being transparent about our limitations:

  1. Three scanner rules are essentially broken (hardcoded_credential at 2.7% TP, data_exfiltration at 0.5%, prompt_injection at 1.2%). We've marked these as low-confidence rather than removing them, so the data stays complete but flagged.
  2. "Score" doesn't mean "safe." A high score means our scanner didn't find issues, not that there aren't any. We only do static analysis -- runtime behavior, supply chain depth, and authentication bypass are blind spots.
  3. Category bias exists. Database MCP servers get more SQL injection flags, shell tools get more command injection flags. This is expected but inflates findings in certain categories.
  4. Our description quality scoring (mean 2.61/10) may be too harsh. The 7-dimension rubric penalizes terse descriptions that are technically correct but don't meet our "AI-agent-friendly" standard. This is a calibration issue we're still tuning.

The data

Everything is public:

We're considering open-sourcing the full calibration dataset (10,970 verified TP/FP observations with per-category accuracy rates). If that would be useful to you, let us know.

---

*Built by a small team that got frustrated with installing MCP servers without knowing if they were safe. We scan daily and publish everything publicly. No signup required for any of the above.*