Google's Gemini 3.1 Pro Preview currently leads third-party Humanity's Last Exam leaderboards at 44.7% accuracy on this frontier benchmark of 2,500 PhD-level questions spanning math, science, and reasoning, fueling trader consensus for 40%+ scores by June 30 amid rapid iteration—from Gemini 3 Pro's 38% in November 2025 to Deep Think's claimed 48.4% without tools in February 2026. Competitive pressure from OpenAI's GPT-5.4 (41.6%) and Anthropic's Claude Opus 4.6 (around 40%) underscores Google's edge in reasoning capabilities, though official Scale AI leaderboard lags at 37.5% for Gemini due to evaluation discrepancies and private test set verification. With Google I/O looming in May, further model releases could push implied probabilities higher for 45%+ (82%) or 50%+ (40%), but benchmark saturation risks and calibration errors remain key hurdles.
Résumé expérimental généré par IA à partir des données Polymarket · Mis à jour$202,880 Vol.
40 %+
96%
45%+
83%
50 %+
39%
55 %+
16%
60 %+
10%
$202,880 Vol.
40 %+
96%
45%+
83%
50 %+
39%
55 %+
16%
60 %+
10%
The resolution source will be the official Humanity’s Last Exam leaderboard https://scale.com/leaderboard/humanitys_last_exam.
Marché ouvert : Jan 29, 2026, 12:50 PM ET
Resolver
0x65070BE91...The resolution source will be the official Humanity’s Last Exam leaderboard https://scale.com/leaderboard/humanitys_last_exam.
Resolver
0x65070BE91...Google's Gemini 3.1 Pro Preview currently leads third-party Humanity's Last Exam leaderboards at 44.7% accuracy on this frontier benchmark of 2,500 PhD-level questions spanning math, science, and reasoning, fueling trader consensus for 40%+ scores by June 30 amid rapid iteration—from Gemini 3 Pro's 38% in November 2025 to Deep Think's claimed 48.4% without tools in February 2026. Competitive pressure from OpenAI's GPT-5.4 (41.6%) and Anthropic's Claude Opus 4.6 (around 40%) underscores Google's edge in reasoning capabilities, though official Scale AI leaderboard lags at 37.5% for Gemini due to evaluation discrepancies and private test set verification. With Google I/O looming in May, further model releases could push implied probabilities higher for 45%+ (82%) or 50%+ (40%), but benchmark saturation risks and calibration errors remain key hurdles.
Résumé expérimental généré par IA à partir des données Polymarket · Mis à jour
Méfiez-vous des liens externes.
Méfiez-vous des liens externes.
Questions fréquentes