Google's Gemini 3.1 Pro Preview currently leads the Humanity's Last Exam (HLE) benchmark at 44.7% on Artificial Analysis and 37.5% on Scale Labs' leaderboard, reflecting rapid scaling in reasoning capabilities via Deep Think mode, which hit 48.4% without tools in February 2026 announcements. This frontier benchmark of 2,500 expert questions tests AI limits beyond memorization, where prior Gemini 2.5 models scored under 20%. Trader consensus weighs Google's aggressive release cadence—including March's Gemini 3.1 Flash Live Preview—against competitors like OpenAI's GPT-5.4 (44.3% top) and Anthropic's Claude Opus 4.6. Key catalysts ahead: Google I/O in May for potential Gemini 4 previews and Q2 earnings on compute scaling, amid benchmark discrepancies over tool use and evaluation rigor.
基于Polymarket数据的AI实验性摘要 · 更新于$202,880 交易量
40%+
96%
45%及以上
83%
50%及以上
39%
55%及以上
16%
60%以上
10%
$202,880 交易量
40%+
96%
45%及以上
83%
50%及以上
39%
55%及以上
16%
60%以上
10%
The resolution source will be the official Humanity’s Last Exam leaderboard https://scale.com/leaderboard/humanitys_last_exam.
市场开放时间: Jan 29, 2026, 12:50 PM ET
Resolver
0x65070BE91...The resolution source will be the official Humanity’s Last Exam leaderboard https://scale.com/leaderboard/humanitys_last_exam.
Resolver
0x65070BE91...Google's Gemini 3.1 Pro Preview currently leads the Humanity's Last Exam (HLE) benchmark at 44.7% on Artificial Analysis and 37.5% on Scale Labs' leaderboard, reflecting rapid scaling in reasoning capabilities via Deep Think mode, which hit 48.4% without tools in February 2026 announcements. This frontier benchmark of 2,500 expert questions tests AI limits beyond memorization, where prior Gemini 2.5 models scored under 20%. Trader consensus weighs Google's aggressive release cadence—including March's Gemini 3.1 Flash Live Preview—against competitors like OpenAI's GPT-5.4 (44.3% top) and Anthropic's Claude Opus 4.6. Key catalysts ahead: Google I/O in May for potential Gemini 4 previews and Q2 earnings on compute scaling, amid benchmark discrepancies over tool use and evaluation rigor.
基于Polymarket数据的AI实验性摘要 · 更新于
警惕外部链接哦。
警惕外部链接哦。
常见问题