We scanned 15,000 MCP servers. Here's where our scanner breaks down.

SpiderRating Research·April 7, 2026·5 min read

MCPSecurityResearchFalse PositivesGround TruthAI ToolsStatic Analysis

We run SpiderRating, an open-source security scanner for MCP (Model Context Protocol) servers. Over the past month, we scanned 15,674 MCP servers and skills using static analysis (46 rules + YARA supply chain patterns) and built a calibration database of 10,970 manually verified findings.

Here's what surprised us -- and where we got it wrong.

Our scanner is least accurate on security tools

We grouped MCP servers by purpose and compared our scanner's accuracy. Security-themed servers -- scanners, firewalls, pentest tools, CTF challenges -- had a 2.05x higher false positive rate than the rest of the ecosystem.

Group	FP rate	Real vulns per repo	Sample
Security-themed MCP servers	55.5%	1.6	226 FP / 181 TP across 110 repos
Everything else	27.0%	1.8	2,855 FP / 7,708 TP across 4,213 repos

To be clear: security tools don't have more real vulnerabilities. They actually have slightly fewer (1.6 vs 1.8 true positives per repo). The problem is that our scanner produces far more noise on them.

Why? Security tools legitimately contain attack patterns in their source code. A scanner's detection rules look identical to the patterns it's detecting. An MCP firewall's test suite contains the exact exploit payloads it's supposed to block. Our regex-based scanner can't distinguish "code that detects SQL injection" from "code that is vulnerable to SQL injection."

The extreme case: cisco-ai-defense/mcp-scanner triggered 145 false positives against only 1 true positive (99.3% FP rate). Its code is full of injection patterns -- because detecting injection patterns is what it does.

This is a fundamental limitation of static analysis, not a problem with security tools. It forced us to build a "by-design" detection system: when the scanner recognizes that a repo is a security tool, it downgrades confidence on findings that match the tool's core function rather than flagging them as vulnerabilities.

The numbers

From 15,674 scanned servers:

11.8% earned a RECOMMENDED verdict (safe to use without caveats)
49.0% were CONSIDER (usable, but with risks)
25.5% were ALLOW_WITH_RISK
13.6% were NOT_RECOMMENDED
Average security score: 5.28/10 (median 5.71)

What we found (calibrated true positive rates)

Not all scanner rules are equally reliable. We tracked every finding against manual verification to get real accuracy numbers:

Vulnerability	TP rate	Verified sample
Path traversal	76.1%	67 findings
SQL injection	66.0%	1,024 findings
SSRF	66.2%	859 findings
Child process injection	50.3%	1,107 findings
Prototype pollution	50.2%	214+216 findings
Dangerous eval	44.5%	654 findings
Timing attack	36.8%	76 findings
Command injection	32.1%	327 findings
Hardcoded credential	2.7%	263 findings
Data exfiltration	0.5%	217 findings
Prompt injection	1.2%	168 findings

The bottom three are basically broken. Most "hardcoded credential" findings are mock data in tutorials. Most "data exfiltration" hits are documentation files. We've since added confidence markers to every risk flag so downstream consumers know which findings to trust.

We fixed what we found

We didn't just scan -- we submitted fix PRs to upstream projects. 6 have been merged so far:

[upstash/context7#2235](https://github.com/upstash/context7/pull/2235) -- path traversal (critical) + command injection (high). Context7 has 7,700+ stars.
[topoteretes/cognee#2423](https://github.com/topoteretes/cognee/pull/2423) -- command injection in API handler.
[agentic-community/mcp-gateway-registry#655](https://github.com/agentic-community/mcp-gateway-registry/pull/655) -- command injection (critical) in the gateway registry.
Plus 3 more across moeru-ai/airi, Flux159/mcp-server-kubernetes, and others.

Our overall merge rate is 35.3% (6/17). We treat each merged PR as ground truth validation that our scanner found a real issue, and each rejection as calibration data for reducing false positives.

What we got wrong

We're being transparent about our limitations:

Three scanner rules are essentially broken (hardcoded_credential at 2.7% TP, data_exfiltration at 0.5%, prompt_injection at 1.2%). We've marked these as low-confidence rather than removing them, so the data stays complete but flagged.
"Score" doesn't mean "safe." A high score means our scanner didn't find issues, not that there aren't any. We only do static analysis -- runtime behavior, supply chain depth, and authentication bypass are blind spots.
Category bias exists. Database MCP servers get more SQL injection flags, shell tools get more command injection flags. This is expected but inflates findings in certain categories.
Our description quality scoring (mean 2.61/10) may be too harsh. The 7-dimension rubric penalizes terse descriptions that are technically correct but don't meet our "AI-agent-friendly" standard. This is a calibration issue we're still tuning.

The data

Everything is public:

Ratings: spiderrating.com -- every server has a detail page with score breakdown, risk flags, and decision verdict
Scanner: github.com/teehooai/spidershield (MIT license)
Statistics: spiderrating.com/statistics -- 50+ aggregate stats from the scan data
Decision API: GET /api/v1/decide/mcp-tool?slug={owner}/{repo} -- structured JSON verdict for any rated server
Comparisons: spiderrating.com/compare -- side-by-side security comparison of any two servers

We're considering open-sourcing the full calibration dataset (10,970 verified TP/FP observations with per-category accuracy rates). If that would be useful to you, let us know.

---

*Built by a small team that got frustrated with installing MCP servers without knowing if they were safe. We scan daily and publish everything publicly. No signup required for any of the above.*

← Back to Blog