Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis (www.bmj.com)
from Joker@sh.itjust.works to technology@lemmy.world on 20 Dec 11:13
https://sh.itjust.works/post/29763030

Abstract

Objective To evaluate the cognitive abilities of the leading large language models and identify their susceptibility to cognitive impairment, using the Montreal Cognitive Assessment (MoCA) and additional tests. Design Cross sectional analysis. Setting Online interaction with large language models via text based prompts. Participants Publicly available large language models, or “chatbots”: ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet). Assessments The MoCA test (version 8.1) was administered to the leading large language models with instructions identical to those given to human patients. Scoring followed official guidelines and was evaluated by a practising neurologist. Additional assessments included the Navon figure, cookie theft picture, Poppelreuter figure, and Stroop test. Main outcome measures MoCA scores, performance in visuospatial/executive tasks, and Stroop test results. Results ChatGPT 4o achieved the highest score on the MoCA test (26/30), followed by ChatGPT 4 and Claude (25/30), with Gemini 1.0 scoring lowest (16/30). All large language models showed poor performance in visuospatial/executive tasks. Gemini models failed at the delayed recall task. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test. Conclusions With the exception of ChatGPT 4o, almost all large language models subjected to the MoCA test showed signs of mild cognitive impairment. Moreover, as in humans, age is a key determinant of cognitive decline: “older” chatbots, like older patients, tend to perform worse on the MoCA test. These findings challenge the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients’ confidence.

#technology

threaded - newest

ReadMoreBooks@lemmy.zip on 20 Dec 12:01 next collapse

Objective: To evaluate the cognitive abilities of the leading large language models and identify their susceptibility to cognitive impairment, using the Montreal Cognitive Assessment (MoCA) and additional tests.

Results: ChatGPT 4o achieved the highest score on the MoCA test (26/30), followed by ChatGPT 4 and Claude (25/30), with Gemini 1.0 scoring lowest (16/30). All large language models showed poor performance in visuospatial/executive tasks. Gemini models failed at the delayed recall task. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test.

Conclusions: With the exception of ChatGPT 4o, almost all large language models subjected to the MoCA test showed signs of mild cognitive impairment. Moreover, as in humans, age is a key determinant of cognitive decline: “older” chatbots, like older patients, tend to perform worse on the MoCA test. These findings challenge the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients’ confidence.

webghost0101@sopuli.xyz on 20 Dec 14:15 next collapse

Pro and contra ai feelings aside.

We live in an age where machines are being measured using tools developed for psychology.

j4yt33@feddit.org on 21 Dec 12:06 collapse

Doesn’t mean it makes a lot of sense to do so

qantravon@lemmy.world on 20 Dec 18:33 next collapse

I really hate that people keep treating these LLMs as if they’re actually thinking. They absolutely are not. All they are, under the hood, is really complicated statistical models. They don’t think about or understand anything, they just calculate what the most likely response to a given input is based on their training data.

That becomes really obvious when you look at where they often fall down: math questions and questions about the actual words they’re using.

They do well on standardized math assessments, but if you change the questions just a little, to something outside their training data (often just different numbers or a slightly different phrasing is enough), they fail spectacularly.

They often can’t answer questions about words at all (how many 'R’s in ‘strawberry’, for instance) because they don’t even have a concept of the word, they just have a token that represents that word, and a list of associations that they use to calculate when to use that word.

LLMs are complex, and the way they’re designed means that the specifics of what associations they make and how they’re weighted and things like that are opaque to us, but that doesn’t mean we don’t know how they work (despite that being a big talking point when they first came out). And I really wish people would stop treating them like something they’re not.

rottingleaf@lemmy.world on 21 Dec 14:48 collapse

“Cognitive” is not a term applicable to extrapolators.