Humanity's Last Exam, a rigorous benchmark of 2,500 expert-level questions across math, science, and humanities launched by the Center for AI Safety on June 4, tests frontier large language model capabilities with current top scores below 9%—Google's Gemini 1.5 Pro Experimental at 8.9% and OpenAI's o1-preview at 8.8%. Anthropic's Claude 3.5 Sonnet, released June 20, lags at 7.4% on the public leaderboard, reflecting incremental gains but underscoring persistent gaps in reasoning and knowledge synthesis versus competitors. No new Claude model or targeted evaluation announcements have emerged ahead of the June 30 cutoff, leaving trader consensus shaped by this low baseline and the benchmark's design to resist short-term advances. Upcoming resolution hinges on any last-minute submissions, though historical patterns suggest modest shifts at best.
Polymarketデータを参照したAI生成の実験的な要約 · 更新日$187,331 Vol.
35%以上
93%
45%以上
46%
$187,331 Vol.
35%以上
93%
45%以上
46%
The resolution source will be the official Humanity’s Last Exam leaderboard https://scale.com/leaderboard/humanitys_last_exam.
マーケット開始日: Jan 30, 2026, 12:00 AM ET
Resolver
0x65070BE91...The resolution source will be the official Humanity’s Last Exam leaderboard https://scale.com/leaderboard/humanitys_last_exam.
Resolver
0x65070BE91...Humanity's Last Exam, a rigorous benchmark of 2,500 expert-level questions across math, science, and humanities launched by the Center for AI Safety on June 4, tests frontier large language model capabilities with current top scores below 9%—Google's Gemini 1.5 Pro Experimental at 8.9% and OpenAI's o1-preview at 8.8%. Anthropic's Claude 3.5 Sonnet, released June 20, lags at 7.4% on the public leaderboard, reflecting incremental gains but underscoring persistent gaps in reasoning and knowledge synthesis versus competitors. No new Claude model or targeted evaluation announcements have emerged ahead of the June 30 cutoff, leaving trader consensus shaped by this low baseline and the benchmark's design to resist short-term advances. Upcoming resolution hinges on any last-minute submissions, though historical patterns suggest modest shifts at best.
Polymarketデータを参照したAI生成の実験的な要約 · 更新日
外部リンクに注意してください。
外部リンクに注意してください。
よくある質問