# The Devtool Visibility Stack in 2026: What API Teams Need to Measure

AI coding agents are now a primary discovery channel for developer tools. Here's the measurement framework API teams need — from crawlability to tool call success rates.

**Published:** 2026-05-08
**Category:** Product
**Author:** Jun Liang Lee
**Read time:** 9 min read

In 2024, developers found APIs through Google. In 2026, they find them through Claude Code, Codex, and Cursor.

**How do you measure whether AI agents can find, recommend, and successfully use your API?**

SEO tools measure Google rankings. GEO tools (Profound, Otterly) measure ChatGPT mentions. Neither measures what happens when a developer asks Claude Code to "add authentication to my app."

This is the Devtool Visibility Stack — a framework for measuring AI visibility across the full discovery-to-implementation pipeline. It's why GEO for devtools requires tracking completely different metrics than consumer GEO.

## Why You Need a New Measurement Framework

Consider what happens when a developer opens Claude Code and types: "Add Stripe payments to my Next.js app."

The agent doesn't just answer with text. It:

1. **Searches** for current best practices
2. **Evaluates** available options (Stripe, Square, PayPal, etc.)
3. **Recommends** a specific solution
4. **Writes** working integration code
5. **Executes** that code (sometimes)
6. **Debugs** any errors that occur

Your API needs to be visible and usable at every step. Traditional metrics only capture step 1 (search visibility) or skip directly to step 6 (error monitoring). The middle steps — where decisions are made — go unmeasured.

## The Devtool Visibility Stack

Four layers, each with distinct metrics:

```
┌─────────────────────────────────────────┐
│  Layer 4: Execution                     │
│  "Can agents successfully use it?"      │
├─────────────────────────────────────────┤
│  Layer 3: Context                       │
│  "Do agents have the right information?"│
├─────────────────────────────────────────┤
│  Layer 2: Discovery                     │
│  "Can agents find it?"                  │
├─────────────────────────────────────────┤
│  Layer 1: Foundation                    │
│  "Is it accessible to AI systems?"      │
└─────────────────────────────────────────┘
```

## Layer 1: Foundation (Accessibility)

Before agents can recommend your API, they need access to information about it. This layer measures basic accessibility.

### What to Measure

| Metric               | What It Tells You                     | How to Check                                           |
| -------------------- | ------------------------------------- | ------------------------------------------------------ |
| **Crawler Access**   | Can AI systems read your docs?        | Check robots.txt for ClaudeBot, GPTBot blocks          |
| **Render Mode**      | Does content work without JavaScript? | Disable JS and view your docs                          |
| **Response Time**    | How fast do your docs load?           | Measure time-to-first-byte for documentation pages     |
| **Sitemap Coverage** | Are all docs in your sitemap?         | Compare sitemap URLs to actual documentation structure |

### Common Failures

**Blocked crawlers**: Many documentation sites still block AI crawlers by default.

```txt
# Bad: Blocks AI discovery
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /
```

**Client-side rendering**: React/Vue docs that don't server-render appear empty to crawlers.

**Rate limiting**: Aggressive CDN rules can block crawler-like traffic patterns.

### Benchmark Data

In our [Devtool Arena](https://usesapient.com/leaderboard) testing, we found many APIs had some form of crawler blocking that reduced their visibility — blocked AI user agents, JS-only rendering, or aggressive rate limiting. The fix takes 5 minutes and has immediate impact.

## Layer 2: Discovery (Findability)

Once accessible, can agents actually find your API when relevant?

### What to Measure

| Metric                     | What It Tells You                    | How to Check                                       |
| -------------------------- | ------------------------------------ | -------------------------------------------------- |
| **Search Appearance**      | Do you show up for relevant queries? | Test "best [category] API" in agent web searches   |
| **Comparison Presence**    | Do you appear in "X vs Y" results?   | Search "[your API] vs [competitor]"                |
| **Training Data Presence** | Does the base model know about you?  | Ask Claude/GPT without web search                  |
| **Freshness**              | Is the information current?          | Check if recent features appear in recommendations |

### Common Failures

**No comparison content**: If you don't create "X vs Competitor" pages, competitors define the narrative.

**Outdated training data**: Models have knowledge cutoffs. A library that shipped in 2025 may not exist in models trained on 2024 data.

**Poor positioning**: Generic descriptions like "A payments API" don't help agents determine when to recommend you.

### Benchmark Data

In our testing, APIs with dedicated comparison pages consistently outperformed those without. The highest-performing APIs explicitly stated "when to use us" and "when not to use us" — giving agents clear guidance on when to recommend them.

## Layer 3: Context (Understanding)

Agents found your API. Do they understand it well enough to recommend it correctly?

### What to Measure

| Metric                      | What It Tells You                                | How to Check                                        |
| --------------------------- | ------------------------------------------------ | --------------------------------------------------- |
| **llms.txt Presence**       | Do you provide machine-readable context?         | Check yourdomain.com/llms.txt                       |
| **MCP Server Availability** | Can agents connect directly?                     | Search MCP registries                               |
| **Use Case Clarity**        | Do agents recommend you for the right scenarios? | Test prompts across different use cases             |
| **Positioning Accuracy**    | Do agents understand your differentiation?       | Ask "When should I use [your API] vs [competitor]?" |

### Common Failures

**No llms.txt**: Without explicit context, agents rely on noisy web search results.

**Ambiguous positioning**: APIs that try to be everything get recommended for nothing specific.

**Missing use case documentation**: Agents can't match user needs to your features.

### The llms.txt Standard

The llms.txt standard provides AI systems with structured context about your API. While adoption is still early, it's gaining traction among developer-focused companies.

A good llms.txt includes:

```txt
# Acme API

> One-line description of what you do.

## When to Use Acme
- Specific use case 1
- Specific use case 2
- Specific use case 3

## When NOT to Use Acme
- Scenario where competitors are better
- Use case you don't support well

## Quick Start
POST /v1/endpoint
Authorization: Bearer YOUR_API_KEY
Body: { "key": "value" }

## Key Endpoints
- POST /v1/action - What it does
- GET /v1/resource - What it returns
```

The "When NOT to Use" section is particularly valuable — it helps agents make accurate recommendations instead of over-recommending.

### MCP Integration

The Model Context Protocol (MCP) lets agents connect directly to your API. APIs with MCP presence ranked significantly higher in our benchmarks because agents can verify functionality before recommending.

Current MCP leaders in Devtool Arena:

- **Vector databases**: Chroma, Qdrant
- **Search/scraping**: Tavily
- **Email**: AgentMail

## Layer 4: Execution (Usability)

The agent recommended you. Can it actually use your API successfully?

### What to Measure

| Metric                     | What It Tells You                         | How to Check                                            |
| -------------------------- | ----------------------------------------- | ------------------------------------------------------- |
| **Tool Call Success Rate** | Does agent-generated code work?           | Run benchmark prompts, measure success %                |
| **Error Recovery Rate**    | When code fails, can agents fix it?       | Track how often agents recover from errors              |
| **Time to Success**        | How long does it take to complete tasks?  | Measure end-to-end execution time                       |
| **Abandonment Rate**       | Do agents switch to competitors mid-task? | Track when agents recommend alternatives after failures |

### Common Failures

**Unclear error messages**: "Error 500" doesn't help agents troubleshoot.

**Complex auth flows**: Multi-step authentication confuses agents.

**Inconsistent response shapes**: Different endpoints returning different structures.

**Breaking changes**: Agents using outdated patterns from training data.

### Benchmark Data

In Devtool Arena testing, tool call success rates ranged from **94% to 47%** across the APIs we benchmarked. The gap came from:

| Factor                                   | Impact on Success Rate |
| ---------------------------------------- | ---------------------- |
| Typed responses (TypeScript/JSON Schema) | +18%                   |
| Descriptive error messages               | +15%                   |
| Consistent import paths                  | +12%                   |
| Clear quick start code                   | +11%                   |

The fastest API in our benchmarks — Firecrawl — completed tasks in 49 seconds with just 6 tool calls. The slowest took over 3 minutes with 20+ tool calls and multiple error recoveries.

## The Full Stack Audit

Audit your API across all four layers:

### Layer 1 Checklist (Foundation)

- [ ] robots.txt allows ClaudeBot, GPTBot, anthropic-ai
- [ ] Documentation renders server-side (not JS-only)
- [ ] Pages load in under 2 seconds
- [ ] Sitemap includes all documentation pages

### Layer 2 Checklist (Discovery)

- [ ] You appear in "[category] API" searches
- [ ] You have "[your API] vs [competitor]" pages
- [ ] Base models (without search) know about you
- [ ] Recent features appear in AI recommendations

### Layer 3 Checklist (Context)

- [ ] llms.txt exists at domain root
- [ ] llms.txt includes "when to use" and "when not to use"
- [ ] MCP server available (or planned)
- [ ] Use case documentation exists for top 3 scenarios

### Layer 4 Checklist (Execution)

- [ ] Quick start code runs without modification
- [ ] Error messages explain what went wrong and how to fix it
- [ ] Response shapes are typed and consistent
- [ ] Auth flow works in single step

## The Measurement Gap

Most API teams have some visibility into Layer 1 (SEO tools) and Layer 4 (error monitoring). But Layers 2 and 3 — where recommendation decisions happen — go unmeasured.

This creates a common failure mode: APIs that are technically accessible but never recommended, or recommended but unused because agents can't generate working code.

### What Traditional Tools Miss

| Tool Type                     | What It Measures                          | What It Misses                                   |
| ----------------------------- | ----------------------------------------- | ------------------------------------------------ |
| SEO tools                     | Google search rankings                    | All AI platforms                                 |
| GEO tools (Profound, Otterly) | Answer engines only (ChatGPT, Perplexity) | Coding agents (Claude Code, Codex, Cursor, etc.) |
| APM tools                     | Production error rates                    | Agent-generated code success rates               |
| Analytics                     | User behavior on your site                | Agent behavior before users arrive               |

GEO tools focus on answer engines — AI that answers questions. They don't track action engines — AI that takes actions, writes code, and executes it. For developer tools, that's where adoption actually happens.

## Building Your Measurement System

### Option 1: Manual Benchmarking

Run standard prompts through Claude Code, Codex, and Cursor. Track:

- Does your API get recommended?
- Is the recommendation accurate?
- Does the generated code work?
- How does this compare to competitors?

This works for spot-checks but doesn't scale for continuous monitoring.

### Option 2: Sapient — AEO for Coding Agents

Sapient is the AEO (AI Engine Optimization) platform built for coding agents — already used by leading developer tool companies in the SF Bay Area. It tracks visibility across 19 AI platforms:

- **8 Coding Agents:** Claude Code, OpenAI Codex, Cursor, GitHub Copilot, Gemini CLI, OpenClaw, OpenCode, Hermes
- **7 Answer Engines:** ChatGPT, Google AI Overviews, Perplexity, Claude, Microsoft Copilot, and more
- **4 Models:** DeepSeek, Kimi, Z.ai, Grok

The [Devtool Arena](https://usesapient.com/leaderboard) benchmarks APIs across all four layers:

- **Foundation**: Crawlability, rendering, response times
- **Discovery**: Search appearance, comparison presence, freshness
- **Context**: llms.txt quality, MCP presence, positioning clarity
- **Execution**: Tool call success, error recovery, completion rates

The leaderboard is free and public. You can see how your API ranks against competitors and track changes over time.

For deeper analysis, [Sapient's platform](https://usesapient.com/welcome) provides:

- Layer-by-layer breakdown of your visibility gaps
- Prompt-level analysis (which queries you win vs. lose)
- Competitor benchmarking across all 19 AI platforms
- Actionable opportunities with prioritized recommendations
- Content agent to generate fixes (llms.txt, comparison pages, documentation)
- Automation workflows to implement changes

## What High-Performing APIs Do Differently

Based on Devtool Arena data, the top-performing APIs share these characteristics:

### They optimize bottom-up

Start with Layer 1 (accessibility), then Layer 2 (discovery), then Layer 3 (context), then Layer 4 (execution). Each layer depends on the ones below it.

### They measure continuously

AI agent behavior changes as models update and competitors improve. Monthly benchmarking catches regressions before they impact adoption.

### They invest in both visibility and usability

High visibility with low usability generates frustrated developers at scale. Low visibility with high usability is a missed opportunity. You need both.

### They treat agents as a first-class audience

Documentation written for humans doesn't always work for agents. The best APIs maintain both human-readable docs and machine-readable context (llms.txt, MCP).

## Related Reading

- [Why Claude Code Isn't Recommending Your Library](/blog/why-claude-code-not-recommending-your-library) — Quick fixes for the most common visibility problems
- [We Tested 70+ APIs in Claude Code and Codex](/blog/we-tested-50-apis-in-coding-agents) — The benchmark data behind these recommendations
- [AEO/GEO for Dev Tools: Why Profound & Otterly Don't Work for APIs](/blog/geo-for-developer-tools-is-different) — Why consumer GEO tools miss the mark for APIs
- [Best AEO/GEO Tools for Dev Tools in 2026](/blog/best-geo-tools-for-developer-tools-2026) — Sapient vs Profound vs Otterly comparison

## FAQ

### How often should I run visibility audits?

Monthly for full stack audits. Weekly spot-checks for Layer 4 (execution) if you're actively iterating on documentation or error messages.

### Which layer should I fix first?

Work bottom-up. Layer 1 issues block everything else. A perfect llms.txt doesn't help if your docs are blocked by robots.txt.

### Is this different from traditional developer relations?

It's complementary. DevRel builds relationships and creates content. Visibility measurement tells you whether that content is working in AI-mediated discovery channels.

### How do I convince my team this matters?

Track competitor mentions. When Claude Code recommends a competitor's API for a prompt your tool handles, that's lost adoption. Quantify it.

---

## Measure Your Full Stack

Your API's coding agent visibility depends on all four layers working together. Most teams only measure one or two.

**Free:** [Check your ranking on Devtool Arena](https://usesapient.com/leaderboard) — see how your API performs across all layers compared to competitors.

**Full audit:** [Get a Sapient visibility report](https://usesapient.com/welcome) — layer-by-layer breakdown with prioritized recommendations.

**Community:** Join the [AI DevTool Demo Night](https://luma.com/devtooldemo5) — 3,500+ developer community, 50+ DevTool companies, hosted at AWS SF.