If you’re searching for Claude API pricing, you’re in the right place. Anthropic’s pricing landscape has moved fast in 2026—from the Claude 4.5 series in late 2025, to Opus 4.7 in April 2026, to the brand-new Claude Opus 4.8 released May 28, 2026. Each generation has held the line on token rates while adding new capabilities. Understanding Anthropic API pricing is now a multi-dimensional optimization problem: base token costs, extended thinking, tool use, prompt caching, batch processing, Managed Agents runtime, and long-context windows all factor into your real-world spend. For businesses building production AI systems, mastering these pricing levers is the difference between a sustainable product and a runaway budget.
At metacto, we architect and build enterprise-grade AI applications on the latest Claude models. We have navigated Anthropic’s complete pricing structure—from base token costs to extended thinking, tool use, and advanced caching strategies—and we are providing a definitive breakdown for May 2026.
Claude API Price Summary (May 31, 2026)
At-a-glance Anthropic API pricing per million tokens (input / output):
| Model | Input | Output | Context | Best For |
|---|---|---|---|---|
| Claude Opus 4.8 (newest, May 28) | $5.00 | $25.00 | 1M | Flagship coding, agents, adaptive thinking |
| Claude Opus 4.7 | $5.00 | $25.00 | 1M | High-res vision, long-horizon agents |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 1M | Best balance for production workloads |
| Claude Haiku 4.5 | $1.00 | $5.00 | 200K | High-volume, latency-sensitive tasks |
| Claude Opus 4.1 (legacy) | $15.00 | $75.00 | 200K | Migrate to Opus 4.8 — 3x cheaper, smarter |
| Claude Haiku 3 (budget) | $0.25 | $1.25 | 200K | Simple, ultra-high-volume tasks |
Layer the discounts: Prompt caching cuts repeat input by 90%. Batch API takes 50% off input and output. Stacking both can drop effective spend by 95%+. New: Opus 4.8 Fast Mode is $10/$50 per MTok—3x cheaper than Opus 4.7’s $30/$150 Fast Mode.
Updated – May 2026
New in this update:
- Claude Opus 4.8 released May 28, 2026 — same $5/$25 token pricing as 4.7, but with adaptive thinking, effort controls (low / high / xhigh / max), and a Fast Mode that is 3x cheaper at $10/$50 per million tokens
- Claude Opus 4.7 released April 16, 2026 — same $5/$25 pricing, but a new tokenizer that may consume up to 35% more tokens for the same text
- Claude Managed Agents (public beta): $0.08 per session-hour runtime plus standard token costs
- 1-hour cache option at 2x base input price (vs 1.25x for 5-minute cache)
- Tool use system prompt token counts updated per official Anthropic docs (e.g., Sonnet 4.6: 497 tokens vs 313 on older models)
- Refreshed all pricing tables and the interactive calculator with the current model lineup
Quick Summary: Claude API Pricing at a Glance
Anthropic offers four recommended tiers as of May 2026: Haiku 4.5 ($1/$5), Sonnet 4.6 ($3/$15), Opus 4.7 ($5/$25), and the newest Opus 4.8 ($5/$25, released May 28). Opus 4.8, Opus 4.7, Opus 4.6, and Sonnet 4.6 all include the 1M context window at standard pricing. Legacy models range from Haiku 3 ($0.25/$1.25) to Opus 4.1 ($15/$75). Combine prompt caching (90% savings) and batch API (50% off) to reduce costs by up to 95%. Opus 4.8’s new Fast Mode at $10/$50 makes premium-speed inference dramatically cheaper. For AI cost optimization strategies, see our dedicated guide.
Short on time? Here’s the summary: Anthropic offers four current-generation model tiers: Haiku 4.5 ($1/$5 per million tokens) for speed and efficiency, Sonnet 4.6 ($3/$15) for balanced intelligence and cost, Opus 4.7 ($5/$25) for flagship vision and long-horizon agents, and the freshly released Opus 4.8 ($5/$25) with adaptive thinking and a 3x cheaper Fast Mode. Opus 4.8, Opus 4.7, Opus 4.6, and Sonnet 4.6 all include the full 1 million token context window at standard pricing—no premium long-context surcharges. Combined with prompt caching (90% savings on repeated context), batch API (50% discount), and extended thinking capabilities, Claude remains the most cost-effective frontier AI available today. For help calculating AI workflow ROI, check our dedicated guide. Looking for alternatives? See our guides on OpenAI API pricing, Cohere pricing, and Hugging Face costs.
Anthropic Claude API Pricing 2026: Complete Model Comparison
Here is a comprehensive comparison of all current Claude models. Pricing is shown per million tokens (1M tokens = approximately 750,000 words). Claude Opus 4.8 (May 2026) is the newest model, with the 4.7, 4.6, and 4.5 series all still actively supported.
Current Generation: Claude 4.8, 4.7, 4.6, and 4.5 Series
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Cache Write (5m) | Cache Write (1h) | Cache Read | Context Window | Best For |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.8 (May 28, 2026) | $5 | $25 | $6.25 | $10 | $0.50 | 1M | Newest flagship; adaptive thinking, effort controls, best coding model |
| Claude Opus 4.7 | $5 | $25 | $6.25 | $10 | $0.50 | 1M | High-resolution vision, long-horizon agents |
| Claude Opus 4.6 | $5 | $25 | $6.25 | $10 | $0.50 | 1M | Flagship reasoning, mission-critical applications |
| Claude Sonnet 4.6 | $3 | $15 | $3.75 | $6 | $0.30 | 1M | Balanced performance, intelligent agents, code generation |
| Claude Opus 4.5 | $5 | $25 | $6.25 | $10 | $0.50 | 200K | Earlier flagship, same price as 4.8 |
| Claude Sonnet 4.5 | $3 | $15 | $3.75 | $6 | $0.30 | 200K / 1M* | Balanced workhorse, wide ecosystem support |
| Claude Haiku 4.5 | $1 | $5 | $1.25 | $2 | $0.10 | 200K | Speed-optimized tasks, high-volume processing, cost efficiency |
1M Context at Standard Pricing: Claude Opus 4.8, Opus 4.7, Opus 4.6, and Sonnet 4.6 include the full 1 million token context window at standard pricing—no premium surcharge. For Claude Sonnet 4.5, requests exceeding 200K input tokens are charged at $6 input / $22.50 output per million tokens (beta, tier 4+).
Fast Mode pricing:
- Claude Opus 4.8 Fast Mode: $10 input / $50 output per million tokens (2x standard) — 3x cheaper than 4.7’s Fast Mode
- Claude Opus 4.7 / 4.6 Fast Mode: $30 input / $150 output per million tokens (6x standard)
Fast Mode is in research preview, runs Opus at ~2.5x the standard output rate, and is not available with the Batch API.
Opus 4.8: Same Price, Smarter Spend
Claude Opus 4.8 keeps the $5/$25 token rate but introduces effort controls (low, high, xhigh, max) and adaptive thinking—the model decides how much reasoning to spend based on task complexity. The default effort changed from medium to high, which can raise per-request cost unless you set it explicitly. Combined with the cheaper Fast Mode, Opus 4.8 gives you more knobs to tune the cost/quality trade-off than any prior Claude model.
Opus 4.7 Tokenizer Change
Claude Opus 4.7 and 4.8 use a new tokenizer that may consume up to 35% more tokens for the same input text compared to Opus 4.6 and earlier models. The 1.0x–1.35x multiplier hits hardest on code, structured data, and non-English text. While the per-token rate is unchanged, your effective cost per request may increase. Factor this into cost projections when migrating from Opus 4.6.
Legacy Models: Claude 4.x and Earlier
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Cache Write (5m) | Cache Read | Status |
|---|---|---|---|---|---|
| Claude Opus 4.1 | $15 | $75 | $18.75 | $1.50 | Legacy |
| Claude Opus 4 | $15 | $75 | $18.75 | $1.50 | Legacy |
| Claude Sonnet 4 | $3 | $15 | $3.75 | $0.30 | Supported |
| Claude Sonnet 3.7 | $3 | $15 | $3.75 | $0.30 | Deprecated |
| Claude Haiku 3.5 | $0.80 | $4 | $1 | $0.08 | Supported |
| Claude Opus 3 | $15 | $75 | $18.75 | $1.50 | Deprecated |
| Claude Haiku 3 | $0.25 | $1.25 | $0.30 | $0.03 | Budget Option |
Legacy Model Migration
Claude Opus 4, Opus 4.1, and Opus 3 remain available but cost 3x more than the current Opus models for inferior performance. If your application still uses these legacy models, migrating to Opus 4.7 will deliver both better results and significant cost savings. For guidance on building AI agents that actually work, see our implementation guide.
Estimate Your Claude API Costs
Use our interactive calculator to estimate your monthly Claude API costs. Toggle prompt caching and batch API to see how much you can save.
Claude API Cost Calculator
Estimate your monthly Anthropic Claude API costs based on your expected usage
1M tokens = approximately 750,000 words
Typically 30-50% of input tokens
Cost Breakdown
Model Tip: Opus 4.8 (released May 28, 2026) is Anthropic's newest flagship—same $5/$25 token rate as 4.7, with adaptive thinking, effort controls (low/high/xhigh/max), and a 3x cheaper Fast Mode at $10/$50. Best for production coding (88.6% SWE-bench Verified).
Note: This estimate is based on standard Anthropic pricing as of May 31, 2026. Extended thinking tokens, tool use overhead, Managed Agents session-hour runtime ($0.08/hr), and multi-modal (image/PDF) inputs are not included and will increase costs. Opus 4.7 and 4.8 use a new tokenizer that may consume up to 35% more tokens for the same text vs Opus 4.6. See detailed sections below for those costs.
Deep Dive: Choosing the Right Claude Model Tier
Anthropic now offers four recommended model tiers for production use: Opus 4.8 for the newest flagship intelligence with adaptive thinking, Opus 4.7 for high-resolution vision and long-horizon agents, Sonnet 4.6 for the best balance of capability and cost, and Haiku 4.5 for speed-critical high-volume workloads. All tiers are engineered for building production-grade “agentic” AI systems that can interact with external tools, process extended reasoning tasks, and handle multi-step workflows at scale. For comprehensive guidance on the AI agent stack for production systems, see our dedicated guide.
graph TD
A["What is your primary requirement?"] --> B{"Newest, smartest, best at code?"};
A --> C{"High-res vision or long-horizon agents?"};
A --> D{"Balanced Cost & Capability?"};
A --> E{"Highest Throughput / Lowest Cost?"};
B -->|Yes| F["Use Claude Opus 4.8<br/>Cost: $5/$25 per MTok<br/>Fast Mode: $10/$50"];
C -->|Yes| G["Use Claude Opus 4.7<br/>Cost: $5/$25 per MTok"];
D -->|Yes| H["Use Claude Sonnet 4.6<br/>Cost: $3/$15 per MTok"];
E -->|Yes| I["Use Claude Haiku 4.5<br/>Cost: $1/$5 per MTok"];
style A fill:#f0f0f0,stroke:#333,stroke-width:2px
style B fill:#d9edf7,stroke:#3a87ad
style C fill:#d9edf7,stroke:#3a87ad
style D fill:#d9edf7,stroke:#3a87ad
style E fill:#d9edf7,stroke:#3a87ad
style F fill:#cfffe5,stroke:#4caf50
style G fill:#cfffe5,stroke:#4caf50
style H fill:#cfffe5,stroke:#4caf50
style I fill:#cfffe5,stroke:#4caf50
1. Claude Opus 4.8: The Newest Flagship (May 28, 2026)
Claude Opus 4.8 ($5 input / $25 output per million tokens) is Anthropic’s newest generally available model, released May 28, 2026. It holds the same $5/$25 token rate as Opus 4.7 and 4.6—a 67% reduction from the Opus 4.1 era ($15/$75)—but introduces three pricing-relevant changes that meaningfully change how you build with it:
- Adaptive thinking: The model automatically decides how much reasoning to spend on each task. Simple prompts get fast answers; hard prompts get extended thinking budgets.
- Effort controls: Set
efforttolow,high,xhigh, ormax. The default moved frommediumtohigh, which can quietly raise costs unless you override it. - Cheaper Fast Mode: $10 input / $50 output per million tokens—3x cheaper than Opus 4.7’s $30/$150 Fast Mode and roughly 2.5x faster than standard inference.
Best For:
- Production-ready code generation (88.6% on SWE-bench Verified, 69.2% on SWE-bench Pro)
- Complex multi-agent workflows (parallel-subagent support in Claude Code)
- Long agentic sessions where adaptive thinking saves output tokens on easy steps
- Workloads where latency matters but Opus 4.7’s Fast Mode was previously too expensive
Key Advantage: Opus 4.8 is roughly 4x less likely than Opus 4.7 to let a flaw in its own generated code pass without comment—materially reducing the iteration count (and total token spend) for AI-assisted engineering work.
Watch out: The high default effort can surprise teams migrating from medium-default models. For cost-sensitive batch workloads, set effort: "low" explicitly. Opus 4.8 also uses the same new tokenizer as Opus 4.7—up to 35% more tokens for the same input text vs. Opus 4.6.
2. Claude Opus 4.7: Flagship Vision and Long-Horizon Agents
Claude Opus 4.7 ($5 input / $25 output per million tokens) was released April 16, 2026 and remains the right choice when you need high-resolution image analysis (max 2576px / 3.75MP, up from 1568px / 1.15MP) or extremely long autonomous agent runs. Opus 4.7 includes the full 1 million token context window at standard pricing and supports Fast Mode at $30/$150 per million tokens (6x standard).
Best For:
- Long-horizon autonomous agent work requiring sustained accuracy
- Vision tasks requiring high-resolution image analysis
- Scientific research and quantitative analysis with multi-step reasoning
- Full-codebase analysis via the 1M token context window
Key Advantage: Opus 4.7 excels at tasks requiring rigor over extended sessions—it pays precise attention to instructions and devises ways to verify its own outputs before reporting back. The new tokenizer improves performance across many tasks but may use up to 35% more tokens for the same input text.
3. Claude Opus 4.6: Previous Flagship at Same Pricing
Claude Opus 4.6 ($5 input / $25 output per million tokens) remains an excellent choice for applications where the new tokenizer in Opus 4.7 and 4.8 would meaningfully raise costs. It delivers the same 1M context window at standard pricing and supports Fast Mode. For text-heavy English workloads, Opus 4.6 often beats Opus 4.7/4.8 on dollars-per-task even though the per-token rate is identical.
4. Claude Sonnet 4.6: The Production Workhorse
Claude Sonnet 4.6 ($3 input / $15 output per million tokens) is the optimal choice for most production AI applications. It strikes the ideal balance between advanced intelligence, processing speed, and cost efficiency. Like the Opus models, Sonnet 4.6 includes the full 1 million token context window at standard pricing. For developers building intelligent agents, RAG systems, or complex automation workflows, Sonnet 4.6 delivers flagship-adjacent performance at a sustainable price point.
Best For:
- Advanced Retrieval-Augmented Generation (RAG) over large document sets
- Intelligent coding assistants and development tools
- Multi-step agentic workflows with tool use and LangChain
- Customer support automation requiring nuanced understanding
- Internal tools requiring sophisticated reasoning
- Building and iterating on an AI MVP
Key Advantage: Sonnet 4.6 provides a level of intelligence that rivals previous flagship models while maintaining cost efficiency that scales to millions of interactions. Combined with prompt caching and batch processing, Sonnet 4.6 can operate at effective costs as low as $0.30 per million input tokens (90% cache hit rate) or $1.50/$7.50 (batch API).
1M Context at No Extra Cost: Unlike Sonnet 4.5 (which charges $6/$22.50 for requests over 200K input tokens), Sonnet 4.6 includes the full 1M context window at the standard $3/$15 rate. This is a major cost improvement for applications processing large codebases, long documents, or extensive conversation histories.
5. Claude Haiku 4.5: Speed and Scale at Breakthrough Pricing
Claude Haiku 4.5 ($1 input / $5 output per million tokens) is optimized for high-throughput applications where speed and cost efficiency are paramount. Despite its efficiency-first design, Haiku 4.5 delivers performance that approaches Sonnet-tier intelligence for many tasks—making it an exceptional choice for high-volume production workloads.
Best For:
- High-volume content moderation and classification
- Real-time chat applications requiring sub-second latency
- Data extraction and transformation at scale
- Agent control flow and routing logic
- Simple code generation and refactoring tasks
- Document processing pipelines handling millions of documents
Key Advantage: Haiku 4.5 operates at one-fifth the cost of Sonnet 4.6 while delivering performance within “five percentage points” of Sonnet on many benchmarks. For applications requiring processing millions of requests per day, Haiku 4.5’s economics are transformative. With batch processing, costs drop to $0.50/$2.50 per million tokens.
Performance Notes: Haiku 4.5 is faster than Sonnet 4.6 and dramatically faster than Opus 4.6, making it ideal for latency-sensitive applications like real-time chat or interactive tools.
Extended Thinking: Deep Reasoning as Output Tokens
One of the most powerful features across the Claude 4.5 through 4.8 generations is Extended Thinking—a capability that allows the model to generate internal reasoning content blocks before producing its final response. This is particularly valuable for complex problem-solving, multi-step coding tasks, deep research, and autonomous agent work where the quality of reasoning directly impacts outcome quality. Understanding when to use extended thinking is crucial for avoiding common AI agent failures.
How Extended Thinking Works
When you enable extended thinking mode via the API, Claude produces a “thinking” content block that exposes its internal reasoning process. The model works through the problem step-by-step—exploring different approaches, catching potential errors, and refining its logic—before generating the final response. This explicit reasoning often leads to significantly higher quality outputs for complex tasks.
Supported Models: Extended thinking is available on Claude Opus 4.8, Opus 4.7, Opus 4.6, Sonnet 4.6, Opus 4.5, Sonnet 4.5, Haiku 4.5, Opus 4.1, Opus 4, and Sonnet 4.
Extended Thinking Pricing Model
Critical detail: Extended thinking tokens are billed as output tokens, not as a separate pricing tier. When you enable extended thinking with a token budget (minimum 1,024 tokens), any tokens the model uses for internal reasoning are charged at the standard output rate for that model.
Pricing by Model:
- Claude Opus 4.8 / 4.7 / 4.6 / 4.5: $25 per million output tokens (includes thinking)
- Claude Sonnet 4.6 / 4.5: $15 per million output tokens (includes thinking)
- Claude Haiku 4.5: $5 per million output tokens (includes thinking)
Adaptive thinking on Opus 4.8: Opus 4.8 introduces adaptive thinking—the model automatically scales its thinking budget based on task complexity. Combined with the new effort parameter (low / high / xhigh / max), you have explicit control over how much the model “thinks.” The default effort moved from medium to high, so workloads migrating from Opus 4.7 may see higher output token consumption unless you set effort: "low" explicitly.
Note on Opus 4.7+: Starting with Claude Opus 4.7, thinking content is omitted from the response by default (though thinking blocks still appear in the response stream with empty thinking fields). Callers must explicitly opt in to receive thinking content. The same behavior applies to Opus 4.8.
Thinking Token Budgets
You set a thinking token budget when making API requests with extended thinking enabled. The minimum budget is 1,024 tokens. Anthropic recommends starting at this minimum and increasing incrementally to find the optimal balance between reasoning depth and cost for your specific use case.
Important: The thinking budget is a target, not a strict limit. Actual token usage may vary based on task complexity. For tasks requiring extensive reasoning (multi-step coding, complex research), you may see thinking token usage in the thousands.
When Extended Thinking is Worth the Cost
Extended thinking adds cost (more output tokens) but delivers value through higher quality responses. Use extended thinking when:
- Accuracy matters more than latency: Complex financial analysis, medical research, legal reasoning
- Multi-step workflows require careful planning: Agentic systems orchestrating multiple tools
- Deep code reasoning is required: Architecting complex systems, debugging subtle issues
- Research quality is paramount: Literature synthesis, scientific hypothesis generation
For high-volume, straightforward tasks where speed matters, standard mode (without extended thinking) is more cost-effective.
Cost Example: Extended Thinking vs. Standard Mode
Scenario: A complex coding task requiring 50,000 tokens of output
Standard Mode (Sonnet 4.6):
- Output: 50,000 tokens × $15/million = $0.75
Extended Thinking Mode (Sonnet 4.6):
- Thinking: 8,000 tokens × $15/million = $0.12
- Output: 50,000 tokens × $15/million = $0.75
- Total: $0.87 (16% premium for higher quality reasoning)
For mission-critical applications, this premium is typically justified by the improvement in output quality and reduction in iterations needed to reach the correct solution.
Prompt Caching: Up to 90% Cost Reduction
Prompt caching is arguably the most powerful cost optimization feature in Anthropic’s API. For applications that repeatedly send similar context (large documents, system prompts, knowledge bases), prompt caching can reduce input costs by up to 90% on cache hits.
How Prompt Caching Works
When you send a request to Claude, you can mark portions of the input (typically the system prompt or large document context) for caching. Anthropic stores this content on their servers for a specified duration. Subsequent requests that include the same cached content read from the cache instead of processing as new input tokens—charged at 90% discount.
graph TD
A["Initial Request with Large Context"] --> B["Claude API"]
B --> C{"Cache Write: 1.25x Cost"}
C -->|Stores Context for Reuse| D["Cached Context"]
E["Request 1 + Cached Context"] --> F["Claude API"]
F --> G{"Cache Read: 0.1x Cost<br/>90% Savings"}
G --> D
H["Request 2 + Cached Context"] --> I["Claude API"]
I --> J{"Cache Read: 0.1x Cost<br/>90% Savings"}
J --> D
K["Request N + Cached Context"] --> L["Claude API"]
L --> M{"Cache Read: 0.1x Cost<br/>90% Savings"}
M --> D
D --> N["10x Cost Reduction on Repeated Context"]
style A fill:#f9f,stroke:#333,stroke-width:2px
style C fill:#fcc,stroke:#333,stroke-width:2px
style G fill:#cfc,stroke:#333,stroke-width:2px
style J fill:#cfc,stroke:#333,stroke-width:2px
style M fill:#cfc,stroke:#333,stroke-width:2px
style N fill:#add8e6,stroke:#333,stroke-width:2px
Prompt Caching Pricing Multipliers
Anthropic offers two cache duration options with different pricing:
| Cache Operation | Multiplier | Duration |
|---|---|---|
| 5-minute cache write | 1.25x base input price | Cache valid for 5 minutes |
| 1-hour cache write | 2x base input price | Cache valid for 1 hour |
| Cache read (hit) | 0.1x base input price | Same duration as preceding write |
Example: Sonnet 4.6 with Prompt Caching
- Standard input: $3.00 per million tokens
- 5-minute cache write: $3.75 per million tokens (1.25x)
- 1-hour cache write: $6.00 per million tokens (2x)
- Cache read: $0.30 per million tokens (0.1x) — 90% savings
These multipliers stack with other pricing modifiers, including the Batch API discount and data residency.
Break-Even Analysis
5-minute cache: You break even after just 1 cache read. Every subsequent read within 5 minutes is pure savings. 1-hour cache: You break even after 2 cache reads. Ideal for extended thinking sessions or multi-step agent workflows where the context remains relevant for longer periods.
Real-World Caching Use Cases
-
RAG Systems: Cache your entire knowledge base (documentation, FAQ corpus) and only pay full price once per 5 minutes or hour. Each user query reads from cache at 90% discount. Learn more about building RAG systems with vector databases.
-
Code Assistants: Cache the full codebase context. Users can ask multiple questions about the code without repeatedly paying to process the entire repository.
-
Document Analysis: Upload a 100-page legal document once (cache write), then ask dozens of questions about it (cache reads at 10% cost).
-
Multi-Step Agents: Cache system prompts and tool definitions. Each step in the agent workflow reads from cache rather than reprocessing. For complex agent workflows, consider using LangGraph for stateful applications.
Cost Comparison: With vs. Without Caching
Scenario: RAG chatbot over 200K token documentation corpus, 100 user queries per hour
Without Caching (Sonnet 4.6):
- 100 queries × 200K tokens × $3/million = $60/hour
With 1-Hour Caching (Sonnet 4.6):
- Initial cache write: 200K tokens × $6/million = $1.20
- 99 cache reads: 99 × 200K tokens × $0.30/million = $5.94
- Total: $7.14/hour — 88% cost reduction
Batch API: 50% Discount for Non-Urgent Workloads
The Batch API offers a straightforward way to cut your API costs in half: submit requests that don’t need immediate responses, and Anthropic processes them asynchronously within 24 hours at a 50% discount on both input and output tokens.
Batch API Pricing
All Claude models support batch processing with consistent 50% discounts:
| Model | Standard Input | Standard Output | Batch Input | Batch Output |
|---|---|---|---|---|
| Claude Opus 4.8 | $5 | $25 | $2.50 | $12.50 |
| Claude Opus 4.7 | $5 | $25 | $2.50 | $12.50 |
| Claude Opus 4.6 | $5 | $25 | $2.50 | $12.50 |
| Claude Sonnet 4.6 | $3 | $15 | $1.50 | $7.50 |
| Claude Opus 4.5 | $5 | $25 | $2.50 | $12.50 |
| Claude Sonnet 4.5 | $3 | $15 | $1.50 | $7.50 |
| Claude Haiku 4.5 | $1 | $5 | $0.50 | $2.50 |
| Claude Opus 4.1 (legacy) | $15 | $75 | $7.50 | $37.50 |
| Claude Sonnet 4 (deprecated) | $3 | $15 | $1.50 | $7.50 |
| Claude Haiku 3.5 (retired except Bedrock/Vertex) | $0.80 | $4 | $0.40 | $2.00 |
Note: Fast Mode (Opus 4.6 / 4.7 / 4.8) is not available with the Batch API.
Ideal Use Cases for Batch Processing
The Batch API is perfect for workloads where latency isn’t critical:
- Content Generation at Scale: Generate thousands of product descriptions, blog posts, or marketing emails overnight
- Data Processing Pipelines: Extract structured data from large document sets, process historical records
- Model Evaluation: Run comprehensive test suites against your prompts and agent workflows
- Synthetic Data Generation: Create training datasets for fine-tuning or testing
- Document Analysis: Process archives of contracts, research papers, or support tickets
Combining Batch API with Other Optimizations
The Batch API discount stacks with prompt caching, creating even more dramatic savings:
Example: Large-Scale RAG Processing (Sonnet 4.6)
- Standard: $3 input / $15 output
- Batch API: $1.50 input / $7.50 output (50% off)
- Batch + Caching: $0.15 input (cache read) / $7.50 output
- Total savings: 95% on input, 50% on output
For applications processing millions of tokens per day, combining batch processing with prompt caching can reduce monthly API costs from tens of thousands to hundreds of dollars.
Long Context Pricing: Processing Up to 1 Million Tokens
Multiple Claude models now support an extended 1 million token context window—enough to process entire codebases, full-length books, or extensive conversation histories in a single request. Pricing differs significantly depending on which model you use.
Claude 4.8, 4.7, and 4.6 Models: 1M Context at Standard Pricing
A major pricing improvement starting with the Claude 4.6 generation: Claude Opus 4.8, Opus 4.7, Opus 4.6, and Sonnet 4.6 include the full 1M token context window at standard pricing. A 900K-token request is billed at the same per-token rate as a 9K-token request. No more premium surcharges for large contexts.
| Model | Input (any context size) | Output (any context size) |
|---|---|---|
| Claude Opus 4.8 | $5 per million tokens | $25 per million tokens |
| Claude Opus 4.7 | $5 per million tokens | $25 per million tokens |
| Claude Opus 4.6 | $5 per million tokens | $25 per million tokens |
| Claude Sonnet 4.6 | $3 per million tokens | $15 per million tokens |
Prompt caching and batch processing discounts apply at standard rates across the full context window. Fast Mode pricing (Opus 4.6 / 4.7 / 4.8) also applies across the full window.
Claude 4.5 Models: Tiered Long Context Pricing
For Claude Sonnet 4.5 and Sonnet 4, the 1M context window is in beta for organizations in usage tier 4 and above. Requests exceeding 200K input tokens trigger premium pricing:
Standard Context (up to 200K input tokens):
- Input: $3 per million tokens
- Output: $15 per million tokens
Long Context (over 200K input tokens):
- Input: $6 per million tokens (2x standard)
- Output: $22.50 per million tokens (1.5x standard)
Important: The pricing tier is determined solely by input token count. If your request exceeds 200K input tokens, all tokens in that request are charged at the long context rate, not just tokens above the threshold.
Long Context + Optimization Stacking
Long context pricing stacks with other features. For Claude 4.6 models, standard pricing already applies to the full window, so stacking simply uses the normal cache/batch multipliers. For Sonnet 4.5 long context:
Long Context + Prompt Caching (Sonnet 4.5):
- Cache reads on long context: $0.60/MTok (90% off $6)
- Extremely powerful for repeated analysis of large documents
Long Context + Batch API (Sonnet 4.5):
- Batch long context input: $3/MTok (50% off $6)
- Batch long context output: $11.25/MTok (50% off $22.50)
Long Context + Both (Sonnet 4.5):
- Batch + cache read: $0.30/MTok (95% savings)
- Process massive codebases repeatedly at fraction of standard cost
Tip: If you regularly need over 200K tokens of context, upgrading from Sonnet 4.5 to Sonnet 4.6 eliminates the long-context premium entirely—you pay $3/$15 instead of $6/$22.50 per million tokens. This is essential for continuous AI operations running smoothly.
When to Use Long Context
The 1M token window enables entirely new application patterns:
- Whole Codebase Analysis: Load an entire repository for architectural questions, refactoring, or bug detection
- Multi-Document Synthesis: Analyze dozens of research papers or contracts simultaneously
- Extended Conversations: Maintain full context across thousands of messages without truncation
- Complete Book Processing: Analyze entire manuscripts for editing, summarization, or question answering
Tool Use Pricing: Understanding the Complete Cost
When building agentic AI applications that interact with external APIs, databases, or custom functions, understanding tool use pricing is critical. Tool use adds token overhead beyond the basic input/output costs.
Base Tool Use Overhead
Every Claude API request using tools includes a system prompt that enables tool functionality. This overhead is automatically added. Per Anthropic’s official docs (as of May 2026):
| Model | Tool Choice: auto or none | Tool Choice: any or specific tool |
|---|---|---|
| Claude Opus 4.8 | 290 tokens | 410 tokens |
| Claude Opus 4.7 | 675 tokens | 804 tokens |
| Claude Opus 4.6 / Sonnet 4.6 | 497 tokens | 589 tokens |
| Claude Opus 4.5 / Sonnet 4.5 / Haiku 4.5 | 496 tokens | 588 tokens |
| Claude Opus 4.1 | 313 tokens | 315 tokens |
| Claude Haiku 3.5 | 264 tokens | 355 tokens |
Cost Impact (Sonnet 4.6, auto): 497 tokens × $3/million = $0.0015 per request. Note that Opus 4.7’s tool-use overhead is noticeably higher (675 tokens), which—combined with the new tokenizer—makes Opus 4.6 or Opus 4.8 the better choice for tool-heavy agents.
Per-Tool Definition Overhead
Each tool you define in the tools parameter adds tokens based on its name, description, and JSON schema:
- Simple tool (basic function): ~50-100 tokens
- Complex tool (detailed schema): ~200-500 tokens
- Server-side tools (Anthropic-hosted): Fixed overhead
Example: An agent with 5 tools (average 150 tokens each) adds 750 tokens per request.
Tool Execution Tokens
When Claude actually calls a tool, additional tokens are consumed:
- Tool use request: The
tool_usecontent block (parameters passed to tool) - Tool result: The
tool_resultcontent block (data returned from tool)
Both are charged as standard input/output tokens based on their size.
Example Chain:
- User prompt: 500 tokens (input)
- Tool use overhead: 346 tokens (input)
- 3 tool definitions: 450 tokens (input)
- Tool execution request: 200 tokens (output)
- Tool result data: 2,000 tokens (input)
- Final response: 800 tokens (output)
- Total: 3,296 input / 1,000 output
Server-Side Tool Pricing
Anthropic provides several hosted tools with specific pricing:
Web Search Tool:
- Cost: $10 per 1,000 searches
- Plus: standard token costs for search results
- Use case: Real-time information retrieval
Web Fetch Tool:
- Cost: Free (only token costs for fetched content)
- Limit: Use
max_content_tokensto control costs - Average page: ~2,500 tokens
- Large PDF: ~125,000 tokens
Code Execution Tool:
- Free when used with web search or web fetch tools
- Without web tools: $0.05 per hour (after 1,550 free hours/month)
- Minimum: 5 minutes per execution
- Use case: Running analysis scripts, data processing
Bash Tool:
- Fixed overhead: 245 input tokens
- Variable: stdout/stderr content
- Use case: Command execution, file operations
Text Editor Tool:
- Fixed overhead: 700 input tokens (Claude 4.x)
- Variable: file content
- Use case: Code editing, document modification
Computer Use Tool:
- System overhead: 466-499 tokens
- Tool definition: 735 tokens
- Plus: screenshot costs (vision pricing)
- Use case: UI automation, testing
Cost Optimization for Tool-Heavy Agents
For applications with extensive tool use:
- Cache tool definitions: Define tools once, cache for 90% savings on subsequent requests
- Minimize tool schemas: Use concise descriptions and lean JSON schemas
- Batch tool calls: When possible, combine multiple operations in one call
- Smart tool selection: Only include tools relevant to current task
- Result filtering: Return minimal necessary data from tool executions
Example Optimization:
- Before: 10 tools always included, 1,000 tokens overhead
- After: Dynamic tool loading, cache tool definitions, ~100 effective tokens
- Savings: 90% reduction in tool overhead
Claude Managed Agents Pricing
Anthropic launched Claude Managed Agents in public beta (April 2026), providing a hosted runtime for stateful, long-running agent sessions. If you’re exploring the 2027 AI operations playbook, managed agents represent a significant simplification of agentic infrastructure.
Pricing Model: Tokens + Session Runtime
Managed Agents bills on exactly two dimensions:
| Component | Rate | Details |
|---|---|---|
| Tokens | Standard model rates | Same rates as Messages API (see tables above) |
| Session Runtime | $0.08 per session-hour | Measured to the millisecond; only accrues while session status is running |
Key Details:
- No flat fees: No monthly subscription, per-agent license, or infrastructure charge
- Idle time is free: Time spent waiting for input or tool confirmations does not count toward runtime
- Web search: Standard $10 per 1,000 searches rate applies within sessions
- Prompt caching: Standard caching multipliers apply identically
What Doesn’t Apply to Managed Agents
Several Messages API modifiers do not apply to Managed Agents sessions:
| Modifier | Why It Doesn’t Apply |
|---|---|
| Batch API discount | Sessions are stateful and interactive—no batch mode |
| Fast mode premium | Inference speed is managed by the runtime |
| Data residency multiplier | inference_geo is a Messages API request field |
Cost Example: Coding Session with Opus 4.8
A one-hour coding session using Claude Opus 4.8:
- 50,000 input tokens, 15,000 output tokens
- No caching
| Line Item | Calculation | Cost |
|---|---|---|
| Input tokens | 50,000 x $5 / 1,000,000 | $0.25 |
| Output tokens | 15,000 x $25 / 1,000,000 | $0.375 |
| Session runtime | 1.0 hour x $0.08 | $0.08 |
| Total | $0.705 |
With prompt caching (40,000 of input tokens from cache reads):
| Line Item | Calculation | Cost |
|---|---|---|
| Uncached input | 10,000 x $5 / 1,000,000 | $0.05 |
| Cache reads | 40,000 x $0.50 / 1,000,000 | $0.02 |
| Output tokens | 15,000 x $25 / 1,000,000 | $0.375 |
| Session runtime | 1.0 hour x $0.08 | $0.08 |
| Total | $0.525 |
Data Residency Pricing
For Claude Opus 4.6, Sonnet 4.6, and later models (including Opus 4.7 and 4.8), specifying US-only inference through the inference_geo parameter incurs a 1.1x multiplier on all token pricing categories (input, output, cache writes, and cache reads). Global routing (the default) uses standard pricing. Earlier models do not support the inference_geo parameter and always use standard pricing.
How Much Does Claude API Cost? Real-World Pricing Scenarios
Understanding token pricing in isolation is one thing—estimating your actual monthly Claude API cost requires thinking about complete application architectures. Here are realistic cost scenarios for common use cases.
Scenario 1: Customer Support Chatbot
- Model: Claude Sonnet 4.6
- Volume: 10,000 conversations/day, average 2,000 input + 500 output tokens each
- Optimization: Prompt caching (system prompt cached), no batch
| Component | Calculation | Daily Cost |
|---|---|---|
| System prompt (cached read) | 10K x 800 tokens x $0.30/MTok | $2.40 |
| User messages (standard) | 10K x 1,200 tokens x $3/MTok | $36.00 |
| Output tokens | 10K x 500 tokens x $15/MTok | $75.00 |
| Total | $113.40/day (~$3,400/month) |
Without caching, the system prompt alone would cost $24/day—caching saves $21.60 daily.
Scenario 2: Document Processing Pipeline
- Model: Claude Haiku 4.5 (batch)
- Volume: 50,000 documents/day, average 5,000 tokens each, 200 token output
| Component | Calculation | Daily Cost |
|---|---|---|
| Input (batch) | 50K x 5,000 tokens x $0.50/MTok | $125.00 |
| Output (batch) | 50K x 200 tokens x $2.50/MTok | $25.00 |
| Total | $150/day (~$4,500/month) |
At standard Sonnet pricing without batch, this workload would cost $7,500/day. Choosing Haiku with batch processing delivers a 98% cost reduction.
Scenario 3: AI-Powered Code Review Tool
- Model: Claude Opus 4.8 with adaptive thinking (default
effort: high) - Volume: 500 reviews/day, 20,000 input + 5,000 output + 10,000 thinking tokens each
- Note: Opus 4.8 uses the same new tokenizer as 4.7—may increase token counts by up to 35% vs Opus 4.6
| Component | Calculation | Daily Cost |
|---|---|---|
| Input tokens | 500 x 20K x $5/MTok | $50.00 |
| Output + thinking | 500 x 15K x $25/MTok | $187.50 |
| Total | $237.50/day (~$7,125/month) |
With Opus 4.8 tokenizer impact (worst-case 35% increase):
- Input: $50.00 x 1.35 = $67.50
- Output: $187.50 x 1.35 = $253.13
- Total:
$320/day ($9,600/month)
With Opus 4.8’s adaptive thinking + effort: low for simple PRs: thinking tokens often drop 40–60% on routine changes, recouping the tokenizer overhead and bringing real-world cost back to roughly the original $237/day baseline. For cost-sensitive workloads, also consider whether Opus 4.6 meets your quality requirements—it uses the original tokenizer at the same per-token rate.
Need Help Estimating Your Costs?
Every AI application has a unique cost profile depending on model selection, optimization strategy, and usage patterns. The AI development team at metacto can help you architect for cost efficiency from day one—often reducing projected costs by 80-95% compared to naive implementations. For a deeper understanding of the hidden variables in AI ROI, see our analysis.
Beyond Tokens: The Hidden Engineering Challenges of Scaling
As your AI application scales, the API bill is just one of your concerns. Production-readiness introduces a host of technical challenges that can quickly overwhelm a team focused solely on the model itself.
1. API Rate Limiting & Reliability
All providers enforce strict rate limits based on usage tiers. Production systems require sophisticated exponential backoff and retry logic with jitter to handle these limits gracefully without failing user requests. Anthropic’s API uses tiered rate limits (requests per minute, tokens per minute, tokens per day) that vary significantly between tiers.
Production Requirements:
- Implement request queuing and throttling
- Build graceful degradation when limits are hit
- Monitor rate limit headers in responses
- Scale across multiple API keys if needed
2. API Key Security & Rotation
A leaked API key is a critical security breach that can result in thousands of dollars in fraudulent usage within hours. A robust system requires:
- Secure, isolated storage (AWS Secrets Manager, HashiCorp Vault, or similar)
- Automated key rotation policy to programmatically invalidate and replace keys
- Separate keys for development, staging, and production environments
- Audit logging of all API key usage
- Alert systems for unusual spending patterns
3. Architecting for Latency
Claude API calls can take several seconds—especially for extended thinking, large contexts, or complex tool orchestration. Your application’s architecture must handle this asynchronously:
- Background job queues (Redis, RabbitMQ, AWS SQS)
- Real-time update mechanisms (WebSockets, Server-Sent Events)
- User experience patterns for “AI is thinking” states
- Timeout handling and partial result streaming
- Fallback strategies when calls exceed acceptable latency
4. Observability and Cost Tracking
When an agentic workflow fails or costs spike unexpectedly, you need detailed visibility. Tools like LangSmith provide LLM observability to track these metrics:
- Structured logging of every API call (prompt, model, token counts, latency, cost)
- Token usage analytics broken down by user, feature, and endpoint
- Alert thresholds for unusual spending or error rates
- Dashboard for real-time cost monitoring
- Attribution of costs to specific product features or customers
Learn more about calculating the true cost of AI tools per developer and measuring ROI of AI development tools.
5. Prompt Management and Versioning
As your application evolves, managing prompts becomes critical infrastructure:
- Version control for system prompts and tool definitions
- A/B testing frameworks for prompt variations
- Rollback capabilities when new prompts degrade quality
- Environment-specific prompt configurations
- Caching strategies for static prompt components
These are not “nice-to-haves”; they are fundamental requirements for a reliable product. The AI development services at metacto are designed to build this resilient infrastructure from day one, preventing common failures that often require a costly project rescue.
Overwhelmed by Scaling Challenges?
Building a production-ready AI app is more than just API calls. Our team handles the complexities of security, rate limiting, and monitoring so you can focus on your product. Schedule a free consultation to discuss your project's architecture.
Conclusion: Mastering Claude API Pricing in 2026
The Claude 4.8 generation extends the dramatic cost improvements that began with the 4.5 series. With 67% price reductions on flagship intelligence (Opus 4.8 / 4.7 at $5/$25 vs. Opus 4.1 at $15/$75), standard-price 1M context windows on Opus 4.8, 4.7, 4.6 and Sonnet 4.6, optimization features like 90% prompt caching discounts and 50% batch processing savings, and now a 3x cheaper Fast Mode on Opus 4.8, building production AI applications is more economically viable than ever.
Key Takeaways
-
Choose the right model tier: Haiku 4.5 ($1/$5) for volume and speed, Sonnet 4.6 ($3/$15) for balanced intelligence, Opus 4.7 ($5/$25) for high-res vision and long-horizon agents, Opus 4.8 ($5/$25) for newest-flagship coding and adaptive thinking
-
Use Opus 4.8 effort controls to manage spend: The default
effortishigh. Seteffort: "low"for routine work to claw back token costs from adaptive thinking. Usexhigh/maxonly when answer quality justifies the spend -
Consider the new tokenizer impact: Opus 4.7 and 4.8 use a new tokenizer that may use up to 35% more tokens than Opus 4.6 for the same input. For text-heavy English workloads, benchmark Opus 4.6 against 4.7/4.8 on dollars-per-task before migrating
-
Use Opus 4.8 Fast Mode where you used to skip it: At $10/$50 per million tokens, Opus 4.8 Fast Mode is 3x cheaper than Opus 4.7’s $30/$150. Latency-sensitive workflows that were previously priced out are now in range
-
Upgrade to 4.6+ for long context: Opus 4.8, 4.7, 4.6, and Sonnet 4.6 include the full 1M token context at standard pricing—no premium surcharges. This eliminates the 2x pricing that applied to Sonnet 4.5 requests over 200K tokens
-
Stack optimizations aggressively: Combining prompt caching, batch API, and smart architecture can reduce effective costs by 95% or more compared to naive implementations
-
Evaluate Managed Agents for stateful workflows: At $0.08 per session-hour plus standard token costs, Managed Agents can simplify infrastructure for long-running agent applications
-
Build the infrastructure: Rate limiting, API key security, cost monitoring, and prompt management aren’t optional—they’re fundamental to sustainable AI products
The most critical insight is that Claude API pricing is no longer a simple “cost per token” calculation. It’s a multi-dimensional optimization problem where the right architecture, caching strategy, and model selection can mean the difference between a $50,000/month bill and a $2,000/month bill for the same functionality.
Comparing AI Providers? Explore our comprehensive cost guides for OpenAI API, Cohere, Hugging Face, and Google Gemini. For a broader comparison, see our guide on understanding LLMs for app innovation.
Building Your First LLM Application? Check out our guides on LangChain development, choosing between RAG vs fine-tuning, and understanding when to use LLMs vs alternatives.
If you’re ready to build a production AI application that intelligently leverages these pricing levers while maintaining the resilient infrastructure required for scale, talk to our team at metacto. We specialize in architecting cost-efficient, production-ready AI systems that grow with your business. Schedule a free consultation to discuss your project’s requirements and optimization strategy.
Related Reading
AI Cost and ROI:
- AI Cost Optimization: Getting More Value — Strategies for reducing AI infrastructure costs
- The Hidden Variable in AI ROI — Understanding the full cost picture
- AI Workflow ROI: Calculating Savings — Methods for measuring AI investment returns
Building AI Agents:
- Building AI Agents That Actually Work — Practical implementation patterns
- The AI Agent Stack for Production Systems — Architecture decisions for reliable agents
- AI Agent Failures and How to Avoid Them — Common pitfalls and solutions
AI Operations:
- The 2027 AI Operations Playbook — Planning for enterprise AI at scale
- Continuous AI Operations: Running Smoothly — Maintaining production AI systems
Frequently Asked Questions About Anthropic Claude API Pricing
How much does the Anthropic Claude API cost per million tokens in 2026?
As of May 31, 2026, Anthropic offers four recommended tiers: Claude Haiku 4.5 at $1 input / $5 output per million tokens (fastest), Claude Sonnet 4.6 at $3 input / $15 output (balanced), Claude Opus 4.7 at $5 input / $25 output (vision and long-horizon agents), and the newest Claude Opus 4.8 at $5 input / $25 output (adaptive thinking, best coding model). Opus 4.8, Opus 4.7, Opus 4.6, and Sonnet 4.6 include the full 1M token context window at standard pricing. Legacy models like Claude Opus 4.1 cost significantly more at $15/$75 per million tokens—a 3x premium for inferior performance.
What is Claude Opus 4.8 and how is it priced?
Claude Opus 4.8 was released May 28, 2026 at $5 input / $25 output per million tokens—the same rate as Opus 4.7. The big changes are: adaptive thinking (the model decides how much reasoning to spend per task), effort controls (low / high / xhigh / max, defaulting to high), and a Fast Mode that is 3x cheaper than Opus 4.7's at $10 input / $50 output per million tokens. It scores 88.6% on SWE-bench Verified and is roughly 4x less likely than Opus 4.7 to let a flaw in its generated code pass unnoticed.
What changed with Claude Opus 4.7?
Claude Opus 4.7, released April 16, 2026, maintains the same $5/$25 per million token pricing as Opus 4.6. Key changes include: a new tokenizer that may use up to 35% more tokens for the same input text (especially for code, structured data, and non-English text), high-resolution image support (max 2576px / 3.75MP vs 1568px / 1.15MP), enhanced performance on long-horizon agentic work, and thinking content omitted by default (opt-in required). The same tokenizer is used in Opus 4.8.
What is extended thinking and how is it priced?
Extended thinking is a feature that allows Claude to generate internal reasoning content blocks before producing its final response. It improves output quality for complex tasks by making the model's step-by-step thinking process explicit. Extended thinking tokens are billed as output tokens at standard rates—not as a separate pricing tier. You set a thinking token budget (minimum 1,024 tokens) when enabling this feature via the API. On Opus 4.8, adaptive thinking automatically scales the budget by task complexity, and the new effort parameter (low / high / xhigh / max) gives you explicit control over the cost/quality trade-off.
How does prompt caching work and how much can I save?
Prompt caching allows you to store frequently-used context (system prompts, large documents, knowledge bases) on Anthropic's servers. Cache writes cost 1.25x the base input price (5-minute cache) or 2x (1-hour cache), but cache reads cost only 0.1x—a 90% savings. You break even after just 1 cache hit with 5-minute caching, or 2 hits with 1-hour caching. For applications with repeated context like RAG systems or code assistants, caching can reduce costs by 88-95%.
What is the Batch API and when should I use it?
The Batch API processes requests asynchronously within 24 hours at a 50% discount on both input and output tokens. It's ideal for non-urgent workloads like bulk content generation, data processing pipelines, model evaluation, or document analysis. The discount stacks with prompt caching, potentially reducing costs by 95% or more. For example, Claude Sonnet 4.6 drops from $3/$15 to $1.50/$7.50 per million tokens with batch processing. Note: Fast Mode for Opus 4.6 / 4.7 / 4.8 is not available with the Batch API.
What is Claude Managed Agents and how is it priced?
Claude Managed Agents (public beta launched April 8, 2026) provides a hosted runtime for stateful, long-running agent sessions. Pricing is straightforward: standard token rates plus $0.08 per session-hour of runtime. Runtime is measured to the millisecond and only accrues while the session status is running—idle time waiting for input or tool confirmations is free. No flat fees, per-agent licenses, or infrastructure charges apply. Batch API, Fast Mode, and data residency modifiers do not apply to Managed Agents sessions.
How much does tool use cost with Claude?
Tool use adds several layers of cost: a base system prompt (290 tokens for Opus 4.8 auto/none, 497 tokens for Sonnet 4.6, 675 tokens for Opus 4.7), per-tool definitions (50-500 tokens each), and tokens for tool execution (both the request and result data). Server-side tools have additional fees: web search costs $10 per 1,000 searches, code execution is free when paired with web search/fetch (otherwise $0.05/hour after 1,550 free hours/month per organization), and web fetch is free (only token costs). Optimize by caching tool definitions and minimizing tool schemas.
Which Claude model should I use for my application?
Start with Claude Sonnet 4.6 ($3/$15 per million tokens) for most production applications—it delivers flagship-adjacent performance at sustainable economics with the full 1M context window included. Use Claude Haiku 4.5 ($1/$5) for high-volume, latency-sensitive tasks. Choose Claude Opus 4.7 ($5/$25) for high-resolution vision and the longest autonomous agent runs. Pick Claude Opus 4.8 ($5/$25) for the newest coding and reasoning capabilities, and use its low/high/xhigh/max effort dial to control adaptive-thinking spend. For text-heavy English workloads, benchmark Opus 4.6—the older tokenizer often processes the same content with fewer tokens.
What is long context pricing and when does it apply?
Claude Opus 4.8, Opus 4.7, Opus 4.6, and Sonnet 4.6 include the full 1M token context window at standard pricing—no surcharges regardless of input size. For older models like Sonnet 4.5 and Sonnet 4, requests exceeding 200K input tokens are charged at premium long context rates: $6 input / $22.50 output per million tokens (beta, tier 4+ organizations). The entire request is billed at the higher rate, not just tokens above the threshold. Upgrading to a 4.6+ model eliminates these surcharges entirely.
How is Opus 4.8 Fast Mode priced and is it worth it?
Opus 4.8 Fast Mode is priced at $10 input / $50 output per million tokens—double the standard rate, but 3x cheaper than Opus 4.7's $30/$150 Fast Mode. It delivers up to 2.5x the standard output speed using the same Opus 4.8 model. Fast Mode pricing stacks with prompt caching multipliers and data residency, but it is not available with the Batch API. Use it for latency-sensitive interactive workloads (developer tools, real-time chat with reasoning) where Opus 4.7 Fast Mode was previously too expensive to justify.
Why do I need metacto to build with Claude API?
Using Claude for a prototype is straightforward, but production applications require sophisticated infrastructure: API rate limit handling with exponential backoff, API key security and rotation, async architecture for latency management, detailed cost tracking and observability, and prompt versioning systems. metacto builds this resilient infrastructure from day one, preventing the costly mistakes that often lead to project rescues. We optimize your architecture to leverage caching, batch processing, effort controls, and smart model selection—reducing costs by 90% or more while maintaining reliability.