Claude Opus 4.7 Released: Benchmarks, What's New, and How It Compares to GPT-5.4

April 16, 2026. Anthropic shipped Claude Opus 4.7 into general availability.

The update focuses on software engineering excellence, triples image resolution, and introduces a new effort control layer. Anthropic also published direct benchmark comparisons against GPT-5.4 and Gemini 3.1 Pro.

If you're new to Claude, here's what it is and why the model family matters — then we'll get into the numbers.


What Is Claude?

Claude is a large language model (LLM) developed by Anthropic, an AI safety company founded in 2021 and headquartered in San Francisco. It competes directly with OpenAI's GPT-4o/GPT-5.4 and Google's Gemini — used for writing, coding, analysis, question answering, and complex multi-step reasoning.

AI-Powered Mobile App Development
AI-Powered Mobile App Development
Learn AI-powered mobile application development techniques.
Go to course →

Anthropic's differentiating claim is safety-focused AI research. The company trains Claude using Constitutional AI, a method designed to produce models that are not just capable but honest and aligned with human values.

The Claude Model Family

ModelBest For
HaikuFast, lightweight tasks
SonnetBalanced speed and power
OpusMaximum capability and reasoning depth

Version numbers indicate generation. 4.7 is the newest and most capable Opus to date.


Benchmark Results: Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro

Anthropic published a full benchmark comparison table at launch.

Claude Opus 4.7 full benchmark table — vs GPT-5.4, Gemini 3.1 Pro, and Mythos Preview

Key results:

BenchmarkOpus 4.7Opus 4.6GPT-5.4Gemini 3.1 Pro
Agentic coding (SWE-bench Pro)64.3%53.4%57.7%54.2%
Agentic coding (SWE-bench Verified)87.6%80.8%80.6%
Terminal coding (Terminal-Bench 2.0)69.4%65.4%75.1%*68.5%
Multidisciplinary reasoning, no tools (HLE)46.9%40.0%42.7%44.4%
Multidisciplinary reasoning, with tools (HLE)54.7%53.3%58.7%51.4%
Agentic search (BrowseComp)79.3%83.7%89.3%85.9%
Scaled tool use (MCP-Atlas)77.3%75.8%68.1%73.9%
Agentic computer use (OSWorld-Verified)78.0%72.7%75.0%
Financial analysis (Finance Agent v1.1)64.4%60.1%61.5%59.7%
Graduate-level reasoning (GPQA Diamond)94.2%91.3%94.4%94.3%
Visual reasoning (CharXiv, with tools)91.0%84.7%
Multilingual Q&A (MMMLU)91.5%91.1%92.6%

*GPT-5.4 Terminal-Bench score uses self-reported harness


Knowledge Work: GDPVal-AA

Knowledge work benchmark (GDPVal-AA): Opus 4.7 (1753) vs Opus 4.6 (1619) vs GPT-5.4 (1674) vs Gemini 3.1 Pro (1314)

GDPVal-AA measures economic analysis and business intelligence tasks using an Elo scoring system:

  • Opus 4.7: 1753 — first place
  • GPT-5.4: 1674
  • Opus 4.6: 1619
  • Gemini 3.1 Pro: 1314

Opus 4.7 leads GPT-5.4 by 79 Elo points here — meaningful given how closely models cluster in most benchmarks.


Document Reasoning: OfficeQA Pro

Document reasoning benchmark (OfficeQA Pro): Opus 4.7 (80.6%) vs Opus 4.6 (57.1%) vs GPT-5.4 (51.1%) vs Gemini 3.1 Pro (42.9%)

OfficeQA Pro tests comprehension of real office documents — spreadsheets, presentations, reports:

  • Opus 4.7: 80.6% — first and by a wide margin
  • Opus 4.6: 57.1%
  • GPT-5.4: 51.1%
  • Gemini 3.1 Pro: 42.9%

That's a 41% improvement over the previous generation. For document-heavy workflows, this is the standout result.


Visual Navigation: ScreenSpot-Pro

Visual navigation benchmark (ScreenSpot-Pro): Opus 4.7 high resolution 87.6%, low resolution 85.9%; Opus 4.6 low resolution 83.1%

ScreenSpot-Pro tests UI navigation and visual understanding — clicking the right element in a screenshot:

  • Opus 4.7, high resolution: 87.6%
  • Opus 4.7, low resolution: 85.9%
  • Opus 4.6, low resolution: 83.1%

The high-resolution mode — made possible by the new 3.75MP input capacity — is what drives the top result. The same model at lower resolution still outperforms the previous generation.


Long-Term Coherence: Vending-Bench 2

Long-term coherence benchmark (Vending-Bench 2): Opus 4.7 $10,937 vs Opus 4.6 $8,018

Vending-Bench 2 evaluates sustained autonomous task execution over long horizons. The money balance metric represents real-world value generated:

  • Opus 4.7: $10,937
  • Opus 4.6: $8,018

Approximately 36% improvement in long-horizon autonomous work.


Agentic Coding Performance by Effort Level

Agentic coding performance by effort level — Opus 4.7 reaches 71% at xhigh effort (~100K tokens) and 74.8% at max; Opus 4.6 peaks at ~62% at max effort (~125K tokens)

This chart explains why the new xhigh effort level matters.

Opus 4.7 at xhigh (~100K tokens) surpasses Opus 4.6 at full max effort (~125K tokens) — better performance at lower token cost.

Full effort progression:

Opus 4.7:  low(51%) → medium(57%) → high(66%) → xhigh(71%) → max(75%)
Opus 4.6:  low(39%) → medium(48%) → high(55%) → max(62%)

Every effort level in Opus 4.7 outperforms its counterpart in Opus 4.6. The model is fundamentally stronger, not just better at max effort.


Safety: Misaligned Behavior

Misaligned behavior scores — Opus 4.6: 2.75, Sonnet 4.6: 2.52, Mythos Preview: 1.78, Opus 4.7: 2.46 (lower is better)

Lower score = less misaligned behavior = better:

  • Opus 4.6: 2.75 (worst in the Claude family)
  • Sonnet 4.6: 2.52
  • Opus 4.7: 2.46 — best in the Opus series
  • Mythos Preview: 1.78

Opus 4.7 improves on the previous Opus in both capability and safety alignment. The model also includes automatic detection of high-risk cybersecurity requests.

Security professionals can apply to the new Cyber Verification Program for verified access to penetration testing and vulnerability research use cases.


What Changed: Feature Summary

3× Vision Upgrade

  • Images up to 2,576 pixels on the long edge
  • Approximately 3.75 megapixels — 3× higher than previous Claude models
  • Directly responsible for the ScreenSpot-Pro improvement above

New xhigh Effort Level

low → medium → high → xhigh → max

Finer control over the reasoning-latency tradeoff. For most complex tasks, xhigh hits the sweet spot.

Software Engineering Improvements

Better output verification before reporting results — the model checks rather than guesses. Prompts written for 4.6 may need retuning due to improved instruction-following.


Pricing and Migration

List prices are unchanged:

UsagePrice
Input tokens$5 / million
Output tokens$25 / million

Important: Opus 4.7 uses a new tokenizer. The same content may use 0–35% more tokens depending on type. Real costs may increase slightly.

Higher effort levels produce more output tokens — manage costs using effort parameters or task budgets.


API Usage

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain the tradeoffs between microservices and monoliths."}
    ]
)
print(message.content)

Model ID: claude-opus-4-7

Available on: Claude.ai, Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry.


Bottom Line

The benchmark data tells a clear story: Opus 4.7 is meaningfully stronger than its predecessor and competitive with GPT-5.4 across most domains.

Where Opus 4.7 leads: document reasoning (+41%), knowledge work (GDPVal-AA), agentic tool use (MCP-Atlas), financial analysis, long-horizon coherence.

Where GPT-5.4 still leads: agentic search (BrowseComp), terminal coding (with caveats on methodology).

Pricing is flat, but the tokenizer change means real costs may tick up slightly.

If you're currently on claude-opus-4-6, now is a good time to migrate.

ClaudeClaude Opus 4.7AnthropicAI ModelLLMLarge Language ModelGPT AlternativeBenchmarkMachine Learning
Tuncer Bağçabaşı
Tuncer Bağçabaşı
Software Engineer & AI Researcher
← All posts