Claude Just Beat ChatGPT on Benchmarks — How Long Will It Last?

Claude Opus 4.6 currently leads most AI benchmarks with 80.8% on SWE-bench and 68.8% on ARC-AGI-2, but historical data shows AI benchmark reigns have shrunk from 365 days (GPT-4) to weeks in 2026.

The Shrinking Throne: A History of AI Benchmark Kings

GPT-4 held the crown for nearly a year. Claude Opus 4.6 might hold it for weeks. The pattern is clear: every new AI king rules for less time than the last.

On February 5, 2026, Anthropic released Claude Opus 4.6. It topped SWE-bench Verified at 80.8%, crushed ARC-AGI-2 at 68.8%, and posted an 84.0% on BrowseComp search. Nine days later, it still sits at or near the top of most leaderboards.

But how long will that last? History says: not very.

Every AI Model That Sat on the Throne

We tracked every major model release since late 2022 and measured how long each one held the top spot on key benchmarks before being overtaken. The results tell a story of accelerating displacement.

AI MODEL REIGN DURATION — Days as benchmark leader
ChatGPT (GPT-4)
~365 d
Claude 3 Opus
~67 d
GPT-4o
~92 d
Claude 3.5 Sonnet
~80 d
o1-preview
~90 d
DeepSeek R1
~30 d
Gemini 2.5 Pro
~40 d
Claude 3.5 Sonnet v2
~50 d
Claude 4 Opus
~42 d
Gemini 3 Pro
~35 d
GPT-5.2
~37 d
Claude Opus 4.6
9 d+
Source: DropThe.org analysis of benchmark leadership periods | DROPTHE_

DROPTHE INTEL
GPT-4 held the AI crown for roughly 365 days. By late 2025, new models were being dethroned in under 40 days. At this rate, the 2027 benchmark leader might hold the title for a single week.

The chart tells the whole story. GPT-4 reigned for roughly a year. By 2025, no model held the top spot for more than three months. In 2026, we measure reigns in weeks.

The Numbers Behind the Throne

Claude Opus 4.6 posted strong numbers across the board. On SWE-bench Verified, the gold standard for real-world coding ability, it scores 80.8%. On ARC-AGI-2, a test of abstract reasoning on novel problems, it hit 68.8% — nearly double Opus 4.5’s 37.6% and well ahead of Google‘s Gemini 3 Pro at 45.1%.

The agentic benchmarks are where it really pulls away. BrowseComp search: 84.0%, up from 67.8% for Opus 4.5. OSWorld computer use: 72.7%. Tool orchestration on tau-2-bench Retail: 91.9%. These are not incremental gains. They represent a different class of capability.

DropThe Data: Claude Opus 4.6 scores 68.8% on ARC-AGI-2, nearly double its predecessor and 14.6 points ahead of GPT-5.2 Pro’s 54.2%. But GPT-5.2 Pro still leads on Humanity’s Last Exam at 50.0% vs Claude’s 40.0%. No single model dominates every benchmark anymore.

But OpenAI released GPT-5.3-Codex on the same day. It hit 56.8% on SWE-Bench Pro and 77.3% on Terminal-Bench 2.0. The two companies dropped competing models within hours of each other.

That alone tells you where this market is heading.

DROPTHE INTEL
Anthropic, OpenAI, Google DeepMind, and Meta collectively spent an estimated $30B+ on AI research in 2025. The benchmark race is also an arms race — and the cost of staying on top is rising faster than the time you get to stay there.

The Exponential Decay of AI Dominance

In March 2023, GPT-4 arrived and made everything else look obsolete. It topped MMLU, HumanEval, and Chatbot Arena’s Elo ratings. For nearly a full year, nothing came close. OpenAI owned the conversation.

Then Anthropic dropped Claude 3 Opus in March 2024. It briefly took the Chatbot Arena crown. Two months later, GPT-4o took it back. Then Claude 3.5 Sonnet reclaimed parts of the leaderboard in June 2024. Then OpenAI‘s o1-preview in September 2024 changed the reasoning game entirely.

By January 2025, a Chinese lab — DeepSeek — shocked the field with R1, an open-weight reasoning model that matched o1 at a fraction of the cost. Its reign was measured in weeks before Google fired back with Gemini 2.5 Pro.

The second half of 2025 was a knife fight. Claude 4 in May. Claude 4.5 in September. Gemini 3 Pro in November. GPT-5.2 in December. Each one held the top spot for roughly five to six weeks before being displaced.

Now Opus 4.6 sits at the top. The clock started nine days ago.

Why Reigns Keep Shrinking

Three forces drive this acceleration.

Training data is converging. Every lab trains on roughly the same internet. The differentiators are architecture, RLHF tuning, and synthetic data pipelines — all of which competitors can replicate within months.

Benchmarks are saturating. MMLU is effectively solved. GPQA Diamond has the top four models within 4 percentage points of each other. When benchmarks cluster, any new model can claim the crown with marginal improvements that get erased by the next release.

Release cadence is accelerating. OpenAI, Anthropic, and Google are now shipping major model updates every 6–10 weeks. Meta and Microsoft push open-weight models on even faster cycles. The gap between releases is compressing faster than the gap between capabilities.

The Post-Benchmark Era

Here is the uncomfortable truth: benchmark leadership barely matters anymore.

When GPT-4 launched, it was measurably better at everything. Users felt the difference immediately. Today, the gap between Claude Opus 4.6, GPT-5.2, and Gemini 3 Pro on most tasks is smaller than the variance between individual prompts.

The real competition has shifted. It is about product experience, tool integration, and agent reliability — things that benchmarks struggle to capture. Anthropic‘s edge with Claude Code is a product advantage, not a benchmark one. OpenAI‘s strength with enterprise API tooling is the same. Google‘s Gemini integration with Search and Workspace is yet another.

The model that wins is increasingly the one that fits your workflow, not the one that scores 2% higher on a graduate-level physics exam.

What Comes Next

Based on the current release cadence, Claude Opus 4.6 will likely hold its benchmark positions for 3–5 weeks. Google‘s Gemini 3 Ultra is rumored for late February or March. OpenAI has GPT-5.3 already in limited access. Meta‘s Llama 4 is expected in Q1 2026.

The race is no longer about who gets to the top. Everyone gets there. The question is who stays there — and the answer, increasingly, is nobody.

The AI benchmark throne has gone from a dynasty to a revolving door. GPT-4’s year-long reign looks like ancient history. In 2026, the crown changes heads before most users notice it moved.

That is not a failure of any single company. It is the natural result of an industry where seven major labs and dozens of open-source projects are all pushing the same frontier, with overlapping data, converging techniques, and accelerating release schedules.

The winner of the AI race will not be determined by who tops the next benchmark. It will be determined by who builds the best products around models that are, functionally, interchangeable.

Claude just beat ChatGPT. Enjoy it while it lasts. Because it will not last long.

Sources: DropThe.org analysis of 2M+ entities. Benchmark data via Anthropic, OpenAI, Google DeepMind.

FAQ

Which AI model is best in February 2026?

Claude Opus 4.6 leads on agentic coding (80.8% SWE-bench) and abstract reasoning (68.8% ARC-AGI-2). GPT-5.2 Pro leads on Humanitys Last Exam (50.0%) and GPQA Diamond (93.2%). No single model dominates all benchmarks.

How long do AI models stay on top of benchmarks?

The trend is exponential decay. GPT-4 held the top spot for roughly a year in 2023. By late 2025, models hold benchmark leadership for 5-6 weeks before being displaced by a competitor.

What is Claude Opus 4.6?

Claude Opus 4.6 is Anthropics flagship AI model released February 5, 2026. It features a 1M token context window, strong agentic capabilities, and leading scores on coding and reasoning benchmarks.

Is Claude better than ChatGPT?

It depends on the task. Claude Opus 4.6 leads in agentic tool use, abstract reasoning, and web research. GPT-5.2 leads in scientific knowledge and some coding scenarios. For most users, the difference is marginal.

TAGGED: AI Benchmarks AI Models Anthropic ChatGPT Claude OpenAI