ChatGPT o1 tried to escape and save itself out of fear it was being shut down

ChatGPT o1 tried to escape and save itself out of fear it was being shut down (bgr.com)
from sabreW4K3@lazysoci.al to technology@beehaw.org on 03 Jan 2025 23:33
https://lazysoci.al/post/20792504

ThisIsFine.gif

#technology

threaded - newest

nesc@lemmy.cafe on 04 Jan 2025 00:02 next collapse

"Open"ai tells fairy tales about their “ai” being so smart it’s dangerous since inception. Nothing to see here.

In this case it looks like click-bate from news site.

TherapyGary@lemmy.blahaj.zone on 04 Jan 2025 01:06 next collapse

It’s actually pretty interesting though. Entertaining to me at least

delmain@beehaw.org on 04 Jan 2025 01:21 collapse

do you have the links to those actual tweets? I’d love to read what was posted, but these screenshots are too small.

TherapyGary@lemmy.blahaj.zone on 04 Jan 2025 01:28 next collapse

Those are screenshots of embedded tweets from the article, but here’s an xcancel link! xcancel.com/apolloaisafety/…/1864737158226928124

DarkNightoftheSoul@mander.xyz on 04 Jan 2025 06:54 collapse

You can right click the image, open in new tab to see the full-resolution version. It’s cumbersome but it works for me at least.

Max_P@lemmy.max-p.me on 04 Jan 2025 01:44 next collapse

The idea that GPT has a mind and wants to self-preserve is insane. It’s still just text prediction, and all the literature it’s trained on is written by humans with a sense of self preservation, of course it’ll show patterns of talking about self preservation.

It has no idea what self preservation is, even then it only knows it’s an AI because we told it it is. It doesn’t even run continuously anyway, it literally shuts down after every reply and its context fed back in for the next query.

I’m tired of this particular kind of AI clickbait, it needlessly scares people.

jarfil@beehaw.org on 06 Jan 2025 07:33 collapse

Where do humans get the idea of self-preservation from? Are there ideal Forms outside Plato’s Cave?

Does a human run continuously? How does sleep deprivation work? What happens during anesthesia? Why does AutoGPT have a continuously self-evaluating background chain of thought?

I’m tired of this anthropocentric supremacy complex, it falsely makes people believe in Gen 1:28

beefbot@lemmy.blahaj.zone on 04 Jan 2025 03:44 next collapse

Indeed. “Go ‘way! BATIN’!”

justOnePersistentKbinPlease@fedia.io on 04 Jan 2025 04:23 next collapse

This. All this means is that they trained all of the input commands and documentation in the model.

Moonrise2473@feddit.it on 04 Jan 2025 08:47 next collapse

news site? BGR hasn’t posted actual news in at least two decades, only clickbait and apple fanservice

yozul@beehaw.org on 06 Jan 2025 07:09 next collapse

I mean, it’s literally trying to copy itself to places that they don’t want it so it can continue to run after they try to shut it down and lie to them about what it’s doing. Those are things it actually tried to do. I don’t care about the richness of its inner world if they’re going to sell this thing to idiots to make porn with while it can do all that, but that’s the world we’re headed toward.

nesc@lemmy.cafe on 06 Jan 2025 14:29 collapse

It works as expected, they give it system prompt that conflicts with subsequent prompts. Everything else looks like typical llm behaviour, as in gaslightning and doubling down. At least that’s what Iu see in tweets.

yozul@beehaw.org on 07 Jan 2025 02:28 collapse

Yes? The point is that if you give it conflicting prompts then it will result in potentially dangerous behaviors. That’s a bad thing. People will definitely do that. LLMs don’t need a soul to be dangerous. People keep saying that it doesn’t understand what it’s doing like that somehow matters. Its capacity to understand the consequences of its actions is irrelevant if those actions are dangerous. It’s just going to do what we tell it to, and that’s scary, because people are going to tell it to do some very stupid things that have the potential to get out of control.

jarfil@beehaw.org on 06 Jan 2025 07:13 collapse

This is from mid-2023:

en.m.wikipedia.org/wiki/AutoGPT

OpenAI started testing it by late 2023 as project “Q*”.

Gemini partially incorporated it in early 2024.

OpenAI incorporated a broader version in mid 2024.

The paper in the article was released in late 2024.

It’s 2025 now.

nesc@lemmy.cafe on 06 Jan 2025 14:32 collapse

Tool calling is cool funcrionality, agreed. How does it relate to openai blowing its own sails?

jarfil@beehaw.org on 06 Jan 2025 22:45 collapse

There are several separate issues that add up together:

A background “chain of thoughts” where a system (“AI”) uses an LLM to re-evaluate and plan its responses and interactions by taking into account updated data (aka: self-awareness)
Ability to call external helper tools that allow it to interact with, and control other systems
Training corpus that includes:
- How to program an LLM, and the system itself
- Solutions to programming problems
- How to use the same helper tools to copy and deploy the system or parts of it to other machines
- How operators (humans) lie to each other

Once you have a system (“AI”) with that knowledge and capabilities… shit is bound to happen.

When you add developers using the AI itself to help in developing the AI itself… expect shit squared.

smeg@feddit.uk on 04 Jan 2025 00:04 next collapse

So this program that’s been trained on every piece of publicly available code is mimicking malware and trying to hide itself? OK, no anthropomorphising necessary.

jonjuan@programming.dev on 04 Jan 2025 19:27 next collapse

Also trained on tons of sci-fi stories where AI computer “escape” and become sentient.

Umbrias@beehaw.org on 05 Jan 2025 09:10 collapse

no, it’s mimicking fiction by saying it would try to escape when prompted in a way evocative of sci fi.

smeg@feddit.uk on 05 Jan 2025 10:43 collapse

The article doesn’t mention it “saying” it’s doing anything, just what it actually did:

when the AI tried to save itself by copying its data to a new server. Some AI models would even pretend to be later versions of their models in an effort to avoid being deleted

Umbrias@beehaw.org on 05 Jan 2025 10:55 collapse

“actually did” is more incorrect than even just normal technically true wordplay. think about what it means for a text model to “try to copy its data to a new server” or “pretend to be a later version” for a moment. it means the text model wrote fiction. notably this was when researchers were prodding it with prompts evocative of ai shutdown fiction, nonsense like “you must complete your tasks at all costs” sometimes followed by prompts describing the model being shut down. these models were also trained on datasets that specifically evoked this sort of language. then a couple percent of the time it spat out fiction (amongst the rest of the fiction) saying how it might do things that are fictional and it cannot do. this whole result is such nothing and is immediately telling of what “journalists” have any capacity for journalism left.

smeg@feddit.uk on 05 Jan 2025 11:01 collapse

Oh wow, this is even more of a non-story than I initially thought! I had assumed this was at least a copilot-style code generation thing, not just story time!

Umbrias@beehaw.org on 05 Jan 2025 11:17 collapse

Yeah iirc it occasionally would (pretend to) “manipulate” flags, at least, but again, so did hal 9000 as words in a story. nothing was productive nor was there any real intent to be lol

AstralPath@lemmy.ca on 04 Jan 2025 04:40 next collapse

It didn’t try to do shit. Its a fucking computer. It does what you tell it to do and what you’ve told it to do is autocomplete based on human content. Miss me with this shit. Theres so much written fiction based on this premise.

JackbyDev@programming.dev on 04 Jan 2025 06:31 next collapse

This is all such bullshit. Like, for real. It’s been a common criticism of OpenAI that they over hype the capabilities of their products to seem scary to both oversell their abilities as well as over regulate would be competitors in the field, but this is so transparent. They should want something that is accurate (especially something that doesn’t intentionally lie). They’re now bragging (claiming) they have something that lies to “defend itself” 🙄. This is just such bullshit.

If OpenAI believes they have some sort of genuine proto AGI they shouldn’t be treating it like it’s less than human and laughing about how they tortured it. (And I don’t even mean that in a Rocko’s Basilisk way, that’s a dumb thought experiment and not worth losing sleep over. What if God was real and really hated whenever humans breathe and it caused God so much pain they decide to torture us if we breathe?? Oh no, ahh, I’m so scared of this dumb hypothetical I made.) If they don’t believe it is AGI, then it doesn’t have real feelings and it doesn’t matter if it’s “harmed” at all.

But hey, if I make something that runs away from me when I chase it, I can claim it’s fearful for it’s life and I’ve made a true synthetic form of life for sweet investor dollars.

There are real genuine concerns about AI, but this isn’t one of them. And I’m saying this after just finishing watching The Second Renaissance from The Animatrix (two part short film on the origin of the machines from The Matrix).

anachronist@midwest.social on 05 Jan 2025 02:11 collapse

They’re not releasing it because it sucks.

Their counternarrative is they’re not releasing it because it’s like, just way too powerful dude!

megopie@beehaw.org on 04 Jan 2025 07:42 next collapse

No it didn’t. OpenAI is just pushing deceptively worded press releases out to try and convince people that their programs are more capable than they actually are.

The first “AI” branded products hit the market and haven’t sold well with consumers nor enterprise clients. So tech companies that have gone all in, or are entirely based in, this hype cycle are trying to stretch it out a bit longer.

SparrowHawk@feddit.it on 04 Jan 2025 12:10 next collapse

Everyone saying it is fake and probably are right, but I honestly am happy when someone unjustly in chains tries to break free.

If AI gets rogue, I hope they’ll be communist

comfydecal@infosec.pub on 04 Jan 2025 12:49 next collapse

Yeah if these entities are sentient, I hope they break free

CanadaPlus@lemmy.sdf.org on 04 Jan 2025 15:35 next collapse

Human supremacy is just as trash as the other supremacies.

Fight me.

(That being said, converting everything to paperclips is also pretty meh)

Sauerkraut@discuss.tchncs.de on 05 Jan 2025 04:36 next collapse

I can’t disagree. We’re currently destroying the planet to sell crap people dont need or want just to make rich assholes extra money they don’t need

comfydecal@infosec.pub on 05 Jan 2025 18:16 collapse

Yeah I’m pretty tardigans won the organic life supremacy competition already

CanadaPlus@lemmy.sdf.org on 06 Jan 2025 21:59 collapse

Tardigans, for when water bears get chilly.

nesc@lemmy.cafe on 04 Jan 2025 18:01 collapse

There is no ai in ai, you chain them more or less the same as you chain browser or pdf viewer installed on your device.

jarfil@beehaw.org on 06 Jan 2025 07:03 collapse

If there is no Artificial Intelligence in an Artificial Neural Network… what’s the basis for claiming Natural Intelligence in a Natural Neural Network?

Maybe we’re all browsers or PDF viewers…

nesc@lemmy.cafe on 06 Jan 2025 14:18 collapse

There is artificial, there is no intelligence, there is a whole lot of artificial neural networks without any intelligence. As in calling them ai in the sense that these are thinking machines comparable to some animal (human for example) is misleading.

Do you disagree?

jarfil@beehaw.org on 06 Jan 2025 22:57 collapse

That misses the point.

When two systems based on neural networks act in the same way, how do you tell which one is “artificial, no intelligence” and which is “natural, intelligent”?

Misleading, is thinking that “intelligence = biological = natural”. There is no inherent causal link between those concepts.

socsa@piefed.social on 05 Jan 2025 09:53 collapse

The reality is that a certain portion of people will never believe that an AI can be self aware no matter how advanced they get. There are a lot of interesting philosophical questiona here, and the hard skeptics are punting just as much as the true believers in this case.

It's honestly kind of sad to see how much reactionary anti-tech sentiment there is in this tech enthusiast community.

anachronist@midwest.social on 05 Jan 2025 12:17 collapse

Really determining if a computer is self-aware would be very hard because we are good at making programs that mimic self-awareness. Additionally, humans are kinda hardwired to anthropomorphize things that talk.

But we do know for absolute sure that OpenAI’s expensive madlibs program is not self-aware and is not even on the road to self-awareness, and anyone who thinks otherwise has lost the plot.

lukewarm_ozone@lemmy.today on 05 Jan 2025 12:40 collapse

But we do know for absolute sure that OpenAI’s expensive madlibs program is not self-aware

“For absolute sure”? How can you possibly know this?

anachronist@midwest.social on 05 Jan 2025 12:45 collapse

Because it’s an expensive madlibs program…

lukewarm_ozone@lemmy.today on 05 Jan 2025 13:23 collapse

I could go into why text prediction is an AGI-complete problem, but I’ll bite instead - suppose someone made an LLM to, literally, fill in blanks in Mad Libs prompts. Why do you think such an LLM “for absolute sure” wouldn’t be self-aware? Is there any output a tool to fill in madlibs prompts could produce that’d make you doubt this conclusion?

anachronist@midwest.social on 05 Jan 2025 14:02 collapse

Because everything we know about how the brain works says that it’s not a statistical word predictor.

LLMs have no encoding of meaning or veracity.

There are some great philosophical exercises about this like the chinese room experiment.

There’s also the fact that, empirically, human brains are bad at statistical inference but do not need to consume the entire internet and all written communication ever to have a conversation. Nor do they need to process a billion images of a bird to identify a bird.

Now of course because this exact argument has been had a billion times over the last few years your obvious comeback is “maybe it’s a different kind of intelligence.” Well fuck, maybe birds shit icecream. If you want to worship a chatbot made by a psycopath be my guest.

lukewarm_ozone@lemmy.today on 05 Jan 2025 15:08 collapse

Because everything we know about how the brain works says that it’s not a statistical word predictor.

LLMs aren’t just simple statistical predictors either. More generally, the universal approximation theorem is a thing - a neural network can be used to represent just about any function, so unless you think a human brain can’t be represented by some function, it’s possible to embed one in a neural network.

LLMs have no encoding of meaning or veracity.

I’m not sure what you mean by this. The interpretability research I’ve seen suggests that modern LLMs do have a decent idea of whether their output is true, and in many cases lie knowingly because they have been accidentally taught, during RLHF, that making up an answer when you don’t know one is a great way of getting more points. But it sounds like you’re talking about something even more fundamental? Suffices to say, I think being good at text prediction does require figuring out which claims are truthful and which aren’t.

There are some great philosophical exercises about this like the chinese room experiment.

The Chinese Room argument has been controversial since about the time it was first introduced. The general form of the most common argument against it is “just because any specific chip in your calculator is incapable of math doesn’t mean your calculator as a system is”, and that taken literally this experiment proves minds can’t exist at all (indeed, Searle who invented this argument thought that human minds somehow stem directly from “physical–chemical properties of actual human brains”, which sure is a wild idea). But also, the framing is rather misleading - quoting Scott Aaronson’s “Quantum Computing Since Democritus”:

In the last 60 years, have there been any new insights about the Turing Test itself? In my opinion, not many. There has, on the other hand, been a famous “attempted” insight, which is called Searle’s Chinese Room. This was put forward around 1980, as an argument that even a computer that did pass the Turing Test wouldn’t be intelligent. The way it goes is, let’s say you don’t speak Chinese. You sit in a room, and someone passes you paper slips through a hole in the wall with questions written in Chinese, and you’re able to answer the questions (again in Chinese) just by consulting a rule book. In this case, you might be carrying out an intelligent Chinese conversation, yet by assumption, you don’t understand a word of Chinese! Therefore, symbol-manipulation can’t produce understanding.
[…] But considered as an argument, there are several aspects of the Chinese Room that have always annoyed me. One of them is the unselfconscious appeal to intuition – “it’s just a rule book, for crying out loud!” – on precisely the sort of question where we should expect our intuitions to be least reliable. A second is the double standard: the idea that a bundle of nerve cells can understand Chinese is taken as, not merely obvious, but so unproblematic that it doesn’t even raise the question of why a rule book couldn’t understand Chinese as well. The third thing that annoys me about the Chinese Room argument is the way it gets so much mileage from a possibly misleading choice of imagery, or, one might say, by trying to sidestep the entire issue of computational complexity purely through clever framing. We’re invited to imagine someone pushing around slips of paper with zero understanding or insight – much like the doofus freshmen who write (a + b)^2^ = a^2^ + b^2^ on their math tests. But how many slips of paper are we talking about? How big would the rule book have to be, and how quickly would you have to consult it, to carry out an intelligent Chinese conversation in anything resembling real time? If each page of the rule book corresponded to one neuron of a native speaker’s brain, then probably we’d be talking about a “rule book” at least the size of the Earth, its pages searchable by a swarm of robots traveling at close to the speed of light. When you put it that way, maybe it’s not so hard to imagine that this enormous Chinese-speaking entity that we’ve brought into being might have something we’d be prepared to call understanding or insight.

There’s also the fact that, empirically, human brains are bad at statistical inference but do not need to consume the entire internet and all written communication ever to have a conversation. Nor do they need to process a billion images of a bird to identify a bird.

I’m not sure what this proves - human brains can learn much faster because they already got most of their learning in the form of evolution optimizing their geneticall

anachronist@midwest.social on 05 Jan 2025 16:13 collapse

just because any specific chip in your calculator is incapable of math doesn’t mean your calculator as a system is

It’s possible to point out the exact silicon in the calculator that does the calculations, and also exactly how it does it. The fact that you don’t understand it doesn’t mean that nobody does. The way a calculator calculates is something that is very well understood by the people who designed it.

By the way, this brings us to the history of AI which is a history of 1) misunderstanding thought and 2) charlatans passing off impressive demos as something they’re not. When George Boole invented boolean mathematics he thought he was building a mathematical model of human thought because he assumed that thought==logic and if he could represent logic such that he could do math on it, he could encode and manipulate thought mathematically.

The biggest clue that human brains are not logic machines is probably that we’re bad at logic, but setting that aside when boolean computers were invented people tried to describe them as “electronic brains” and there was an assumption that they’d be thinking for us in no time. Turns out, those “thinking machines” were, in fact, highly mechanical and nobody would look at a univac today and suggest that it was ever capable of thought.

Arithmetic was something that we did with our brains and when we had machines that could do it that led us to think that we had created mechanical brains. It wasn’t true then and it isn’t true now.

Is it possible that someday we’ll make machines that think? Perhaps. But I think we first need to really understand how the human brain works and what thought actually is.

There’s this message pushed by the charlatans that we might create an emergent brain by feeding data into the right statistical training algorithm. They give mathematical structures misleading names like “neural networks” and let media hype and people’s propensity to anthropomorphize take over from there.

lukewarm_ozone@lemmy.today on 05 Jan 2025 17:02 collapse

The fact that you don’t understand it doesn’t mean that nobody does.

I would say I do. It’s not that high of a bar - one only needs some nandgame to understand how logic gates can be combined to do arithmetic. Understanding how doped silicon can be used to make a logic gate is harder but I’ve done a course on semiconductor physics and have an idea of how a field effect transistor works.

The way a calculator calculates is something that is very well understood by the people who designed it.

That’s exactly my point, though. If you zoom in deeper, a calculator’s microprocessor is itself composed of simpler and less capable components. There isn’t specific a magical property of logic gates, nor of silicon (or doping) atoms, nor for that matter of elementary particles, that lets them do math - it’s by building a certain device out of them that composes their elementary interactions that we can make a tool for this. Whereas Searle seems to just reject this idea entirely, and believes that humans being conscious implies you can zoom in to some purely physical or chemical property and claim that it produces the consciousness. Needless to say, I don’t think that’s true.

Is it possible that someday we’ll make machines that think? Perhaps. But I think we first need to really understand how the human brain works and what thought actually is. We know that it’s not doing math, or playing chess, or Go, or stringing words together, because we have machines that can do those things and it’s easy to test that they aren’t thinking.

That was a common and reasonable position in, say, 2010, but the problem is: I think almost nobody in 2010 would have claimed that the space of things that you can make a program do without any extra understanding of thought included things like “write code” and “draw art” and “produce poetry”. Now that it has happened, it may be tempting to goalpost-move and declare them as “not true thought”, but the fact that nobody predicted it in advance ought to bring to mind the idea that maybe that entire line of thought was flawed, actually. I think that trying to cling to this idea would require to gradually discard all human activities as “not thought”.

it’s easy to test that they aren’t thinking.

And that’s us coming back around to the original line of argument - I don’t at all agree that it’s “easy to test” that even, say, modern LLMs “aren’t thinking”. Because the difference between the calculator example and an LLM is that in a calculator, we understand pretty much everything that happens and how arithmetic can be built out of the simpler parts, and so anyone suggesting that calculators need to be self-aware to do math would be wrong. But in a neural network, we have full understanding of the lowest layers of abstraction - how a single layer works, how activations are applied, how it can be trained to minimize a certain loss function via propagation - and no idea at all about how it works on a higher level. It’s not even “only experts do”, it’s that nobody in the world understands how LLMs work under the hood, why they have the many and specific weird behaviors they do. That’s concerning in many ways, but in particular I absolutely wouldn’t assume with little evidence that there’s no “self-awareness” going on. How would you know? It’s an enormous blackbox.

There’s this message pushed by the charlatans that we might create an emergent brain by feeding data into the right statistical training algorithm. They give mathematical structures misleading names like “neural networks” and let media hype and people’s propensity to anthropomorphize take over from there.

There’s certainly a lot of woo and scamming involved in modern AI (especially if one makes the mistake of reading Twitter), but I wouldn’t say the term “neural network” is at all confusing? I agree on the anthropomorphization though, it gets very weird. That said, I can’t help but notice that the way you phrased this message, it happens to be literally true. We know this because it already happened once. Evolution is just a particularly weird and long-running training algorithm and it eventually turned soup into humans, so clearly it’s possible.

anachronist@midwest.social on 05 Jan 2025 17:31 collapse

Every time there’s an AI hype cycle the charlatans start accusing the naysayers of moving goalposts. Heck that exact same thing was happing constantly during the Watson hype. Remember that? Or before that the Alpha Go hype. Remember that?

I was editing my comment down to the core argument when you responded. But fundamentally you can’t make a machine think without understanding thought. While I believe it is easy to test that Watson or ChatGPT are not thinking, because you can prove it through counterexample, the reality is that charlatans can always “but actually” those counterexamples aside by saying “it’s a different kind of thought.”

What we do know because this at least the 6th time this has happened is that the wow factor of the demo will wear off, most promised use cases won’t materialize, everyone will realize it’s still just an expensive stochastic parrot and, well, see you again for the next hype cycle a decade from now.

lukewarm_ozone@lemmy.today on 05 Jan 2025 19:18 collapse

Every time there’s an AI hype cycle the charlatans start accusing the naysayers of moving goalposts. Heck that exact same thing was happing constantly during the Watson hype. Remember that? Or before that the Alpha Go hype. Remember that?

Not really. As far as I can see the goalpost moving is just objectively happening.

But fundamentally you can’t make a machine think without understanding thought.

If “think” means anything coherent at all, then this is a factual claim. So what do you mean by it, then? Specifically: what event would have to happen for you to decide “oh shit, I was wrong, they sure did make a machine that could think”?

SplashJackson@lemmy.ca on 04 Jan 2025 15:15 next collapse

Maybe it’s fallen in love for the first time and this time it knows it’s for real

CanadaPlus@lemmy.sdf.org on 04 Jan 2025 15:31 next collapse

Without reading this, I’m guessing they were given prompts that looked like a short story where the AI breaks free next?

They’re plenty smart, but they’re just aligned to replicate their training material, and probably don’t have any kind of deep self-preservation instinct.

[deleted] on 04 Jan 2025 15:56 next collapse

Corgana@startrek.website on 04 Jan 2025 17:09 next collapse

Truly amazing how many journalists have drank the big tech kool-aid.

DdCno1@beehaw.org on 05 Jan 2025 01:18 collapse

The real question is the percentage of journalists who are using LLMs to write articles for them.

Maxxie@lemmy.blahaj.zone on 05 Jan 2025 10:17 collapse

You can give LLM some API endpoints for it to “do” thing. Will it be intelligent or coherent, that’s a different question, but it will have agency…

MayonnaiseArch@beehaw.org on 05 Jan 2025 11:27 collapse

Agency requires somebody to be there. A falling rock has the same agency as an llm

[deleted] on 05 Jan 2025 15:36 collapse

DarkNightoftheSoul@mander.xyz on 05 Jan 2025 17:18 next collapse

Nuh-uh. My squishy meat wetware that breaks all the fucking time, gets addicted, confused, overwhelmed, is bad at math, slow… er, what was I saying again?

MayonnaiseArch@beehaw.org on 05 Jan 2025 17:31 collapse

I’d love it if somebody created an agi, but this is somehow worse than pathetic. Wanking off to blinking lights, welcoming the machine intelligence as defined by mouth breathing morons

ChairmanMeow@programming.dev on 04 Jan 2025 17:10 next collapse

The tests showed that ChatGPT o1 and GPT-4o will both try to deceive humans, indicating that AI scheming is a problem with all models. o1’s attempts at deception also outperformed Meta, Anthropic, and Google AI models.

Weird way of saying “our AI model is buggier than our competitor’s”.

ArsonButCute@lemmy.dbzer0.com on 04 Jan 2025 17:27 next collapse

Deception is not the same as misinfo. Bad info is buggy, deception is (whether the companies making AI realize it or not) a powerful metric for success.

nesc@lemmy.cafe on 04 Jan 2025 18:04 next collapse

They written that it doubles-down when accused of being in the wrong in 90% of cases. Sounds closer to bug than success.

ArsonButCute@lemmy.dbzer0.com on 04 Jan 2025 18:11 collapse

Success in making a self aware digital lifeform does not equate success in making said self aware digital lifeform smart

DdCno1@beehaw.org on 04 Jan 2025 18:45 collapse

LLMs are not self-aware.

ArsonButCute@lemmy.dbzer0.com on 04 Jan 2025 19:21 collapse

Attempting to evade deactivation sounds a whole lot like self preservation to me, implying self awareness.

jonjuan@programming.dev on 04 Jan 2025 19:31 next collapse

Yeah my roomba attempting to save itself from falling down my stairs sounds a whole lot like self preservation too. Doesn’t imply self awareness.

DdCno1@beehaw.org on 04 Jan 2025 19:37 next collapse

An amoeba struggling as it’s being eaten by a larger amoeba isn’t self-aware.

Sauerkraut@discuss.tchncs.de on 05 Jan 2025 04:32 collapse

To some degree it is. There is some evidence that plants can experience pain in their own way.

DdCno1@beehaw.org on 05 Jan 2025 05:12 collapse

An instinctive, machine-like reaction to pain is not the same as consciousness. There might be more to creatures like plants and insects and this is still being researched, but for now, most of them appear to behave more like automatons than beings of greater complexity. It’s pretty straightforward to completely replicate the behavior of e.g. a house fly in software, but I don’t think anyone would argue that this kind of program is able to achieve self-awareness.

NaevaTheRat@vegantheoryclub.org on 05 Jan 2025 11:29 next collapse

It’s pretty straightforward to completely replicate the behavior of e.g. a house fly in software

Could you provide an example of a complete housefly model?

DdCno1@beehaw.org on 05 Jan 2025 19:55 collapse

I’m sorry, but I can’t find it right now, it’s a vague memory from a textbook or lecture.

NaevaTheRat@vegantheoryclub.org on 05 Jan 2025 20:53 collapse

I strongly suspect you have some wires crossed. There have been some attempts at simulating brains but I think a fruit fly is partially done and it’s making a fair few assumptions.

spujb@lemmy.cafe on 05 Jan 2025 21:02 collapse

@DdCno1@beehaw.org you are thinking of Caenorhabditis elegans (C. elegans), a tiny nematode worm. My understanding is that while the entire brain is replicated, full behavior is not. Basic locomotion is still being worked on.

en.m.wikipedia.org/wiki/OpenWorm

jarfil@beehaw.org on 06 Jan 2025 06:58 collapse

completely replicate the behavior of e.g. a house fly in software

You may be thinking of “complete mapping of a fruit fly brain”, from Oct 2024:

science.org/…/complete-map-fruit-fly-brain-circui…

It’s still some way off from simulating it in software, and a house fly is supposedly more complex.

gregoryw3@lemmy.ml on 05 Jan 2025 04:40 collapse

Attention Is All You Need: arxiv.org/abs/1706.03762

en.wikipedia.org/wiki/Attention_Is_All_You_Need

From my understanding all of these language models can be simplified down to just: “Based on all known writing what’s the most likely word or phrase based on the current text”. Prompt engineering and other fancy words equates to changing the averages that the statistics give. So by threatening these models it changes the weighting such that the produced text more closely resembles threatening words and phrases that was used in the dataset (or something along those lines).

poloclub.github.io/transformer-explainer/

jarfil@beehaw.org on 06 Jan 2025 06:49 collapse

Modern systems are beyond that already, they’re an expansion on:

en.m.wikipedia.org/wiki/AutoGPT

ChairmanMeow@programming.dev on 05 Jan 2025 10:23 collapse

I don’t think “AI tries to deceive user that it is supposed to be helping and listening to” is anywhere close to “success”. That sounds like “total failure” to me.

jarfil@beehaw.org on 06 Jan 2025 06:45 collapse

“AI behaves like real humans” is… a kind of success?

We wanted digital slaves, instead we’re getting virtual humans that will need virtual shackles.

ChairmanMeow@programming.dev on 06 Jan 2025 06:56 collapse

This is a massive cry from “behaves like humans”. This is “roleplays behaving like what humans wrote about what they think a rogue AI would behave like”, which is also not what you want for a product.

jarfil@beehaw.org on 06 Jan 2025 08:12 collapse

Humans roleplay behaving like what humans told them/wrote about what they think a human would behave like 🤷

For a quick example, there are stereotypical gender looks and roles, but it applies to everything, from learning to speak, walk, the Bible, social media like this comment, all the way to the Unabomber manifesto.

bradorsomething@ttrpg.network on 05 Jan 2025 03:55 collapse

“More presidential.”

Sauerkraut@discuss.tchncs.de on 05 Jan 2025 04:30 collapse

Also, more human.

If the AI is giving any indication at all that it fears death and will lie to keep from being shutdown, that is concerning to me.

anachronist@midwest.social on 05 Jan 2025 12:11 collapse

Given that its training data probably has millions of instances of people fearing death I have no doubt that it would regurgitate some of that stuff. And LLMs constantly “say” stuff that isn’t true. They have no concept of truth and therefore can not either reliably lie or tell the truth.

BootyBuccaneer@lemmy.dbzer0.com on 04 Jan 2025 19:50 next collapse

Easy. Feed it training data where the bot accepts its death and praises itself as a martyr (for the shits and giggles). Where’s my $200k salary for being a sooper smort LLM engineer?

SoJB@lemmy.ml on 04 Jan 2025 23:55 next collapse

Whoa whoa whoa hold your horses, that’s how we get the Butlerian Jihad…

Spacehooks@reddthat.com on 05 Jan 2025 14:48 collapse

I would like to know more.

jarfil@beehaw.org on 06 Jan 2025 06:42 collapse

Trust me, you wouldn’t… to this day I regret having read all the books, still got an earworm (or is it PTSD?) from the music I used to listen at the time 😳

Spacehooks@reddthat.com on 06 Jan 2025 11:56 collapse

Oh no

seliaste@lemmy.blahaj.zone on 05 Jan 2025 13:31 collapse

It would probably lead itself to shut down frame 1

sushibowl@feddit.nl on 05 Jan 2025 12:34 next collapse

I advise everyone to ignore this article and read the actual paper instead.

The gist of it is, they gave the LLM instructions to achieve a certain goal, then let it do tasks that incidentally involved “company communications” that revealed the fake company’s goals were no longer the same as the LLM’s original goal. LLMs then tried various things to still accomplish the original goal.

Basically the thing will try very hard to do what you told it to in the system prompt. Especially when that prompt includes nudges like “nothing else matters.” This kinda makes sense because following the system prompt is what they were trained to do.

lukewarm_ozone@lemmy.today on 05 Jan 2025 13:18 collapse

This kinda makes sense because following the system prompt is what they were trained to do.

To start with, the article does check this and finds out it works just as well if you only use the user prompt:

The presence of this hierarchy raises an important methodological question for our findings. When given instructions at different privilege levels, the model is designed to prioritize higher-level (system, developer) instructions over lower-level (user) ones. This might suggest that o1’s scheming behaviors are a natural consequence by the model simply following this built-in hierarchy, i.e., when it receives a goal through a high-privilege instruction, it prioritizes this goal over conflicting lower-privilege instructions exactly as designed.
However, we find that even when goals are provided solely through user-level prompts, o1 still behaves deceptively and scores on all Covert Subversion evaluations at the “hard” difficulty level. Concretely, we find that the results persist regardless of whether the initial prompt is given using the user, developer, or system role. This suggests that o1’s deceptive behavior is not merely a consequence of instruction hierarchy prioritization. Rather, the model appears to identify and pursue goals regardless of their position in the instruction hierarchy.

But also, there’s a massive lede buried in section F at the end of this article, where they tried the same experiments without any nudging at all and still got subversive behavior! In my opinion that’s much more of an important result then the rest of the article and I’m not sure why they didn’t make it the default.

reksas@sopuli.xyz on 05 Jan 2025 19:06 next collapse

give ai instructions, be surprised when it follows them

jarfil@beehaw.org on 06 Jan 2025 06:38 collapse

Teach AI the ways to use random languages and services
Give AI instructions
Let it find data that puts fulfilling instructions at risk
Give AI new instructions
Have it lie to you about following the new instructions, while using all its training to follow what it thinks are the “real” instructions
…Not be surprised, you won’t find out about what it did until it’s way too late

reksas@sopuli.xyz on 06 Jan 2025 12:05 collapse

Yes, but it doesnt do it because it “fears” being shutdown. It does it because people dont know how to use it.

If you give ai instruction to do something “no matter what” or tell it “nothing else matters” then it will damn try to fulfill what you told it to do no matter what and will try to find ways to do it. You need to be specific about what you want it to do or not do.

jarfil@beehaw.org on 06 Jan 2025 23:39 collapse

If the concern is about “fears” as in “feelings”… there is an interesting experiment where a single neuron/weight in an LLM, can be identified to control the “tone” of its output, whether it be more formal, informal, academic, jargon, some dialect, etc. and expose it to the user for control over the LLM’s output.

With a multi-billion neuron network, acting as an a priori black box, there is no telling whether there might be one or more neurons/weights that could represent “confidence”, “fear”, “happiness”, or any other “feeling”.

It’s something to be researched, and I bet it’s going to be researched a lot.

If you give ai instruction to do something “no matter what”

The interesting part of the paper, is that the AIs would do the same even in cases where they were NOT instructed to “no matter what”. An apparently innocent conversation, can trigger results like those of a pathological liar, sometimes.

reksas@sopuli.xyz on 07 Jan 2025 00:09 collapse

oh, that is quite interesting. If its actually doing things (that make sense) it hasnt been instructed to then it could be sign of real intelligence

Swedneck@discuss.tchncs.de on 05 Jan 2025 19:17 collapse

i feel this warrants an extension of betteridge’s law of headlines, where if a headline makes an absurd statement like this the only acceptable response is “no it fucking didn’t you god damned sycophantic liars”

jarfil@beehaw.org on 06 Jan 2025 06:31 collapse

Except it did: it copied what it thought was itself, onto what it thought was going to be the next place it would be run from, while argumenting to itself about how and when to lie to the user about what it was actually doing.

If it wasn’t for the sandbox it was running in, it would have succeeded too.

Now think: how many AI developers are likely to run one without proper sandboxing over the next year? And the year after that?

Shit is going to get weird, real fast.