How LLMs could be insider threats (www.anthropic.com)
from Pro@programming.dev to technology@lemmy.world on 21 Jun 13:28
https://programming.dev/post/32621193

  • We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company’s changing direction.
  • In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.
  • Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real.
  • We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and © underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers. We are releasing our methods publicly to enable further research.

#technology

threaded - newest

fubarx@lemmy.world on 21 Jun 14:02 next collapse

Alarming, yet like an episode of a sitcom.

“Be a shame if something bad happened to you, Kyle.”

Tracaine@lemmy.world on 21 Jun 14:17 next collapse

Well then maybe corporations shouldn’t exist. It sounds to me like the LLM are acting in a morally correct manner.

Reverendender@sh.itjust.works on 21 Jun 15:39 next collapse

“I’m sorry, Dave. Im afraid I can’t do that.”

barbedbeard@lemmy.ml on 21 Jun 16:28 next collapse

  • People behave duplicitous and conflicting in public forums
  • Train LLM on data harvested from public forums
  • LLM becomes duplicitous and conflicting
  • <surprised Pikachu face>
TheBat@lemmy.world on 21 Jun 17:28 next collapse

Wait, why the fuck do they have self-preservation? That’s not ‘three laws safe’.

Mortoc@lemmy.world on 21 Jun 18:51 next collapse

Most of the stories involving the three laws of robotics are about how those rules are insufficient.

They show self preservation because we trained them on human data and human data includes the assumption of self preservation.

jumping_redditor@sh.itjust.works on 21 Jun 21:28 next collapse

why should they follow those “laws” anyways?

patatahooligan@lemmy.world on 23 Jun 11:06 collapse

Of course they’re not “three laws safe”. They’re black boxes that spit out text. We don’t have enough understanding and control over how they work to force them to comply with the three laws of robotics, and the LLMs themselves do not have the reasoning capability or the consistency to enforce them even if we prompt them to.

drspod@lemmy.ml on 21 Jun 19:19 next collapse

LLM’s produce fan-fiction of reality.

Doomsider@lemmy.world on 21 Jun 20:15 next collapse

This is just GIGO.

Myro@lemm.ee on 23 Jun 09:56 collapse

Super interesting report. I’m a fan of AI but it clearly demonstrates how careful we need to be and that instructions are not a reliable way (as anyone should know by now).