# We Tested 70+ APIs in Claude Code and Codex — Here's What Made the Difference

We spent $5,000+ on Claude Code and Codex, gave them 70 live production API keys, and measured what actually works. The results show why some APIs dominate while others stay invisible.

**Published:** 2026-04-10
**Category:** Research
**Author:** Jun Liang Lee
**Read time:** 12 min read

**Why do coding agents recommend some APIs over others?**

We spent $5,000+ on Claude Code (and got rate-limited), gave both Claude Code and Codex 70 live production API keys, and measured what happens when agents try to use real APIs.

Here's what we learned testing 70+ APIs across payments, auth, voice AI, sandboxes, vector databases, and more.

## The Methodology

### What We Tested

- **70+ APIs** across 12 categories (payments, auth, search, voice AI, sandboxes, vector databases, cloud hosting, inference, email, observability, stablecoins, meeting bots)
- **2 coding agents**: Claude Code and OpenAI Codex
- **Real-world developer tasks**: For every API, we gave the agent a real use case, its own computer, injected a production API key, and ran live API calls
- **4 metrics per API**: Usability, Discoverability, Tool Calls, Errors

### How We Measured

For each API, agents:

1. Generated a script using the API
2. Injected the production API key
3. Ran live API calls against it
4. We tracked success, failures, tool calls, and execution time

| Metric              | Definition                                            |
| ------------------- | ----------------------------------------------------- |
| **Usability**       | Can the agent successfully complete the task?         |
| **Discoverability** | Did the agent find and recommend this API?            |
| **Tool Calls**      | How many tool calls did it take to complete the task? |
| **Errors**          | What errors occurred and could the agent recover?     |

We also tracked Agent Skills, CLI support, llms.txt presence, MCP servers, Context7, and more.

## The High-Level Results

### Top Performers Overall

**Claude Code Top 3**: Vercel, Jina AI, PayPal

**Codex Top 3**: Firecrawl, ElevenLabs, Tavily

The results between Claude Code and Codex were _completely different_. Some APIs that worked well with Claude Code completely broke with Codex. The worst ones didn't even exist — completely invisible to one agent or the other.

### Category Leaders (Claude Code)

| Category             | Leaders                          |
| -------------------- | -------------------------------- |
| **Inference**        | OpenRouter, Cerebras             |
| **Auth**             | Auth0, Scalekit, Descope         |
| **Payment**          | PayPal, Stripe                   |
| **Search**           | Jina AI, Firecrawl               |
| **Voice AI**         | ElevenLabs, LiveKit              |
| **Sandboxes**        | Vercel, Cloudflare, E2B, Daytona |
| **Vector Databases** | LanceDB, Chroma                  |
| **Stablecoin**       | Circle                           |
| **Meeting Bots**     | MeetStream AI, MeetGeek          |
| **Durable Workflow** | Prefect, Temporal                |
| **Cloud Hosting**    | Render, Railway                  |

**Honorable Mentions:**

- Top Eval Score: Jina AI, MeetStream AI, Rime
- Top Discovery Score: You.com, WorkOS

**Why?** The winners shared common patterns we'll break down below.

### Finding #1: Agent behavior differs dramatically

Claude Code and Codex have completely different tool selection patterns, failure points, and execution strategies. You can't optimize for "agents" generically — you need to test against specific harnesses.

### Finding #2: Speed and efficiency vary wildly

**Fastest (Codex)**: Firecrawl — 49 seconds, 6 tool calls

**Slowest (Codex)**: Circle — 43 minutes, 129 tool calls

50x speed difference. 20x fewer tool calls. The gap is massive.

### Finding #3: MCP servers create clear winners

Most MCP servers were barely working — clearly stitched together at best. But the best ones destroy their competitors and own the entire category.

**MCP Leaders**: Chroma, Qdrant, AgentMail (YC S25), Tavily

**Fastest MCP**: Exa (1m 28s, 6 tool calls)

**Slowest MCP**: Coinbase (39m 5s, 79 tool calls)

### Finding #4: CLI coverage is sparse — opportunity exists

Only 31 out of 73 APIs have CLI support for agents. The ones that do score significantly higher.

**CLI insights:**

- Auth: API key auth dominates. 83% of CLIs authenticate via env var or `--api-key` flag
- JSON support: CLIs with `--json` flags score +18 points higher on average
- Keyless: Only 1 out of 35 CLI tools supports keyless auth

## The Patterns That Separated Winners from Losers

### Pattern #1: Winners had multiple agent touchpoints

The top performers weren't just optimizing one channel — they covered Agent Skills, CLI support, llms.txt, MCP servers, and Context7. The more touchpoints, the higher the ranking.

**Daytona** went from lower rankings to #1 in the sandbox category by shipping better tooling. (Agent skills coming soon, we hear.)

### Pattern #2: Winners had llms.txt files

The llms.txt file gives agents immediate context about what the API does and when to use it. Without it, agents rely on web search and training data — both noisier signals.

**Example from a top performer:**

```txt
# Resend

> Email API for developers. Send transactional emails in minutes.

## When to Use Resend
- Transactional email (receipts, notifications, password resets)
- Developer-first teams that want simple APIs over enterprise features
- Next.js/React projects (first-class SDK support)

## When NOT to Use Resend
- Marketing email campaigns (use Mailchimp, ConvertKit)
- Email requiring complex templates (use SendGrid)

## Quick Start
POST https://api.resend.com/emails
Headers: Authorization: Bearer re_xxx
Body: { from, to, subject, html }
```

The "when NOT to use" section is particularly effective — it helps agents make accurate recommendations instead of over-recommending.

### Pattern #3: Winners showed up in comparison searches

When we analyzed web search behavior, agents frequently searched for "[API] vs [competitor]" before making recommendations.

APIs that had dedicated comparison pages won those searches:

| Search Query           | Winner   | Why                                    |
| ---------------------- | -------- | -------------------------------------- |
| "Clerk vs Auth0"       | Clerk    | Clerk has /compare/auth0 page          |
| "Supabase vs Firebase" | Supabase | Supabase has multiple comparison pages |
| "Resend vs SendGrid"   | Resend   | Resend has /compare page               |

APIs without comparison content let competitors define the narrative.

### Pattern #4: Winners had working MCP servers

Most MCP servers we tested were barely functional. But the ones that worked well dominated their categories completely.

**MCP Category Leaders:**
| Category | MCP Leaders |
|----------|-------------|
| **Vector Databases** | Chroma, Qdrant |
| **Search** | Tavily, Jina AI, Firecrawl |
| **Email** | AgentMail (YC S25) |
| **Auth** | Descope, Clerk |
| **Payment** | Stripe, PayPal |
| **Voice AI** | ElevenLabs, Deepgram |
| **Sandboxes** | Daytona |
| **Meeting Bots** | MeetGeek, Recall.ai |

MCP presence correlates strongly with success because agents can verify the API works before recommending and execute tasks directly.

### Pattern #5: Winners had better error messages

We tracked what happened when agent-generated code failed. The recovery rate (agent successfully fixes the error) varied dramatically:

| Error Message Style                                            | Recovery Rate |
| -------------------------------------------------------------- | ------------- |
| "Invalid API key format. Expected: sk_live_xxx or sk_test_xxx" | 89%           |
| "Authentication failed. Check your API key."                   | 67%           |
| "Error 401"                                                    | 34%           |
| "Internal server error"                                        | 12%           |

Specific errors = high recovery. Generic errors = rabbit holes.

### Pattern #6: Winners had consistent, typed responses

Agents generate better code when they understand the response shape. APIs with:

- TypeScript definitions
- JSON Schema
- OpenAPI specs
- Consistent response envelopes

...had 23% higher tool call success rates than those without.

**Example of a typed response that helps agents:**

```typescript
// Stripe returns consistent, typed responses
interface ChargeResponse {
  id: string; // "ch_xxx"
  amount: number; // in cents
  currency: string; // "usd"
  status: "succeeded" | "pending" | "failed";
  created: number; // Unix timestamp
}
```

**Example of an ambiguous response that hurts agents:**

```json
// Inconsistent response shapes
{ "data": { "charge": { ... } } }  // sometimes
{ "charge": { ... } }               // other times
{ "result": "success", "id": "..." } // also sometimes
```

## Category-by-Category Breakdown

### Payments

| Agent           | Leaders                                 |
| --------------- | --------------------------------------- |
| **Claude Code** | PayPal, Stripe                          |
| **Codex**       | Stripe, Mollie, PayPal                  |
| **CLI**         | Stripe (only 1/10 payment APIs had CLI) |
| **MCP**         | Stripe, PayPal                          |

**Key insight**: Stripe is the only payment API with CLI support. That's a massive gap — 9 out of 10 payment APIs have no CLI presence for agents at all.

### Authentication

| Agent           | Leaders                  |
| --------------- | ------------------------ |
| **Claude Code** | Auth0, Scalekit, Descope |
| **Codex**       | Auth0                    |
| **CLI**         | WorkOS, Auth0            |
| **MCP**         | Descope, Clerk           |

**Key insight**: Auth is fragmented across agents. WorkOS dominates CLI, Descope and Clerk dominate MCP, Auth0 leads API. No single winner owns all channels.

### Search & Scraping

| Agent           | Leaders                    |
| --------------- | -------------------------- |
| **Claude Code** | Jina AI, Firecrawl         |
| **Codex**       | Firecrawl, Tavily          |
| **CLI**         | Firecrawl, Jina AI         |
| **MCP**         | Tavily, Jina AI, Firecrawl |

**Key insight**: Firecrawl is the speed king — 49 seconds and 6 tool calls on Codex. They dominate across all agent types.

### Voice AI

| Agent           | Leaders                        |
| --------------- | ------------------------------ |
| **Claude Code** | ElevenLabs, LiveKit            |
| **Codex**       | ElevenLabs, Deepgram, Cartesia |
| **CLI**         | LiveKit, AssemblyAI            |
| **MCP**         | ElevenLabs, Deepgram           |

**New categories added**:

- Voice STT: Deepgram, AssemblyAI
- Voice TTS: ElevenLabs, Rime
- Voice Telephony: Twilio
- Voice Infra: LiveKit

### Sandboxes

| Agent           | Leaders                          |
| --------------- | -------------------------------- |
| **Claude Code** | Vercel, Cloudflare, E2B, Daytona |
| **Codex**       | Daytona, E2B                     |
| **CLI**         | Daytona, Vercel                  |
| **MCP**         | Daytona                          |

**Key insight**: Daytona moved to #1 in the sandbox category with better tooling. Agent skills reportedly coming soon.

### Vector Databases

| Agent           | Leaders         |
| --------------- | --------------- |
| **Claude Code** | LanceDB, Chroma |
| **Codex**       | Pinecone        |
| **CLI**         | Pinecone        |
| **MCP**         | Chroma, Qdrant  |

**Key insight**: The MCP leaderboard is dominated by vector databases — Chroma and Qdrant are the overall MCP leaders.

## What You Should Do With This Data

### If you're an API company:

1. **Test against multiple agents** — Claude Code and Codex have completely different behaviors
2. **Add CLI with `--json` flag** — +18 points higher on average
3. **Build a working MCP server** — Most are broken, so a good one dominates the category
4. **Add llms.txt** — Give agents context about when to use (and not use) your API
5. **Track multiple leaderboards** — API, CLI, MCP performance varies independently

### If you're a challenger trying to gain ground:

The data shows smaller players can absolutely win:

- **AgentMail (YC S25)** leads MCP email
- **Daytona** went from lower rankings to #1 sandbox
- **Firecrawl** is fastest overall on Codex

1. **Pick your channel** — Win one leaderboard completely before spreading thin
2. **Ship CLI first** — Only 31/73 APIs have CLI, massive opportunity
3. **Make it work with Codex** — Different optimization than Claude Code

### If you're evaluating APIs for your project:

Consider agent-friendliness as a factor. Check:

- Does it work with your coding agent of choice?
- How many tool calls does it take?
- What's the success rate?

An API that's fast and reliable with agents will save you hours of debugging.

## The Full Leaderboard

We've made the complete results available on [Devtool Arena](https://usesapient.com/leaderboard), including:

- **70+ APIs** ranked across Claude Code and Codex
- **4 leaderboards**: [API](https://devtoolarena.com), [CLI](https://devtoolarena.com/cli), [MCP](https://devtoolarena.com/mcp), [Codex](https://devtoolarena.com/codex)
- **Real-time changelog** tracking ranking changes, new CLIs, Agent Skills, and MCP servers
- **Scores** based on usability, discoverability, tool calls, and errors

Recent additions include Datadog (#3 overall) and AgentMail (YC S25). We've received 50+ requests to add more APIs — they're coming in upcoming evals.

You can now **login and claim the report for your company** directly on the site.

## Beyond the Leaderboard

Sapient is the AEO (AI Engine Optimization) platform built for coding agents — already used by leading developer tool companies in the SF Bay Area. It tracks visibility across 19 AI platforms (8 coding agents, 7 answer engines, 4 models) and goes beyond tracking to identify actionable opportunities and generate optimized content to fix visibility gaps.

## Related Reading

- [Why Claude Code Isn't Recommending Your Library](/blog/why-claude-code-not-recommending-your-library) — The 4 fixable reasons and step-by-step fixes
- [How Coding Agents Actually Decide Which SDK to Use](/blog/how-coding-agents-decide-which-sdk-to-use) — The 4-layer decision stack explained
- [The Devtool Visibility Stack in 2026](/blog/devtool-visibility-stack-2026) — How to measure your API's coding agent visibility
- [AEO/GEO for Dev Tools: Why Profound & Otterly Don't Work for APIs](/blog/geo-for-developer-tools-is-different) — Why consumer GEO tools miss the mark
- [Best AEO/GEO Tools for Dev Tools in 2026](/blog/best-geo-tools-for-developer-tools-2026) — Sapient vs Profound vs Otterly

## FAQ

### How often do you update the benchmarks?

The changelog tracks real-time changes. In the last 72 hours alone: 7 leaderboard ranking changes, 5 teams shipped new CLIs/Agent Skills/MCP servers, and 2 new companies were added.

### Can I see the exact prompts you used?

The methodology page includes example prompts. For each API, we gave agents a real-world developer use case — the same tasks developers actually do.

### Why isn't [specific API] included?

We've received 50+ requests to add APIs. They're coming in upcoming evals. [Request an API](https://usesapient.com/welcome) if you'd like specific coverage.

### Why do Claude Code and Codex have different winners?

Completely different tool selection patterns, failure points, and execution strategies. APIs that work well with Claude Code can completely break with Codex. This is why testing against multiple agents matters.

### How do I improve my API's ranking?

Based on what we found:

1. Add CLI with `--json` flag (+18 points average)
2. Build a working MCP server (most are broken)
3. Add llms.txt with "when to use" and "when not to use"
4. Test against both Claude Code and Codex

Our [optimization guide](/blog/why-claude-code-not-recommending-your-library) breaks down each step.

---

## Check Where Your API Stands

Testing whether agents can discover and use your API should be part of CI/CD before launch and a critical step in improving Agent Experience.

**Free:** [View the full leaderboard](https://devtoolarena.com) — see rankings across API, CLI, MCP, and Codex.

**Real-time updates:** [Join the mailing list](https://devtoolarena.com) — get notified of leaderboard changes.

**For API teams:** [Claim your company report](https://usesapient.com/welcome) — understand exactly why you rank where you do and what to fix.

**Community:** Join the [AI DevTool Demo Night](https://luma.com/devtooldemo5) — 3,500+ developer community, 50+ DevTool companies, hosted at AWS SF.
