ClockBench: Even the best AI models can't reliably read the clock (clockbench.ai)
from Pro@programming.dev to technology@lemmy.world on 14 Sep 09:51
https://programming.dev/post/37408622

cross-posted from: programming.dev/post/37407786

#technology

threaded - newest

Khuda@lemmy.world on 14 Sep 10:05 next collapse

we need a human bench for how many people can read the room

MHLoppy@fedia.io on 14 Sep 13:11 next collapse

The human level accuracy is less than 90%!?

panda_abyss@lemmy.ca on 14 Sep 14:08 next collapse

Some of those don’t have tick marks. I hate clocks like that, they’re difficult to read.

I’m surprised it’s near 90, a while generation has grown up with digital clocks everywhere

CouldntCareBear@sh.itjust.works on 14 Sep 14:49 collapse

Have a look at the clock faces there using to Benchmark and it’ll make more sense.

MHLoppy@fedia.io on 14 Sep 14:55 collapse

Really wish they published the whole dataset. They don't specify on the page or in the paper what the full set was like, and the GitHub repo only has one of the easy-to-read ones. If >=10% of the set is comprised of clock faces designed not to be readable then fair enough.

Endymion_Mallorn@kbin.melroy.org on 14 Sep 15:52 next collapse

So LLMs operate like blind people - like every other web scraper and chatbot to exist.

SnoringEarthworm@sh.itjust.works on 15 Sep 02:59 collapse

This seems like a dumb benchmark.

ClockBench evaluates whether models can read analog clocks - a task that is trivial for humans, but current frontier models struggle with.

What do you mean trivial? Most humans I know can’t read the most basic white-background-big-black-numbers clocks.

Someone rigged the jury to get 90% on this:

<img alt="" src="https://sh.itjust.works/pictrs/image/8ce3b63e-d23a-4cfe-9585-76b1b6d11cbe.jpeg">

MCasq_qsaCJ_234@lemmy.zip on 15 Sep 05:22 collapse

Rather, ClockBench will end up improving AI in this regard over the next few years. This is because they need any AI benchmark to identify its strengths and weaknesses in order to improve it in future versions.