Meta got caught gaming AI benchmarks

Meta got caught gaming AI benchmarks (www.theverge.com)
from misk@sopuli.xyz to technology@beehaw.org on 08 Apr 2025 16:30
https://sopuli.xyz/post/25101487

Archive: archive.is/…/meta-llama-4-maverick-benchmarks-gam…

Over the weekend, Meta dropped two new Llama 4 models: a smaller model named Scout, and Maverick, a mid-size model that the company claims can beat GPT-4o and Gemini 2.0 Flash “across a broad range of widely reported benchmarks.”

Maverick quickly secured the number-two spot on LMArena, the AI benchmark site where humans compare outputs from different systems and vote on the best one. In Meta’s press release, the company highlighted Maverick’s ELO score of 1417, which placed it above OpenAI’s 4o and just under Gemini 2.5 Pro. (A higher ELO score means the model wins more often in the arena when going head-to-head with competitors.)

The achievement seemed to position Meta’s open-weight Llama 4 as a serious challenger to the state-of-the-art, closed models from OpenAI, Anthropic, and Google. Then, AI researchers digging through Meta’s documentation discovered something unusual.

In fine print, Meta acknowledges that the version of Maverick tested on LMArena isn’t the same as what’s available to the public. According to Meta’s own materials, it deployed an “experimental chat version” of Maverick to LMArena that was specifically “optimized for conversationality,” TechCrunch first reported.

#technology

threaded - newest

HappyFrog@lemmy.blahaj.zone on 09 Apr 2025 07:52 collapse

Doesn’t all AI companies cheat in LLM tests?