To what extent, if at all, would have CrowdStrike's faulty update have been easier to deal with with an immutable distro?
from snek_boi@lemmy.ml to technology@lemmy.world on 20 Jul 2024 19:59
https://lemmy.ml/post/18213598

#technology

threaded - newest

shortwavesurfer@lemmy.zip on 20 Jul 2024 20:16 next collapse

Turn off computer boot from previous day’s image, wipe current day’s image, continue using computer.

Lodra@programming.dev on 20 Jul 2024 21:29 next collapse

I’m familiar enough with Linux but never used an immutable distro. I recognize the technical difference between what you describe and “go delete a specific file in safe mode”. But how about the more generic statement? Is this much different from “boot in a special way and go fix the problem”? Is any easier or more difficult than what people had to do on windows?

shortwavesurfer@lemmy.zip on 20 Jul 2024 21:36 collapse

Primarily it’s different because you would not have had to boot into any safe mode. You would have just booted from the last good image from like a day ago and deleted the current image and kept using the computer.

Lodra@programming.dev on 20 Jul 2024 23:40 collapse

What’s the user experience like there? Are you prompted to do it if the system fails to boot “happily”?

shortwavesurfer@lemmy.zip on 21 Jul 2024 00:01 next collapse

Honestly, I’m actually not sure as I never had the system break that badly while I was using it.

Lodra@programming.dev on 21 Jul 2024 00:15 collapse

lol thanks for the answer. This is the really relevant bit isn’t it? My Linux machines have also never died this badly before. But I’ve seen windows do it a number of times before this whole fiasco.

NekkoDroid@programming.dev on 22 Jul 2024 09:48 collapse

I don’t think any of the major distros do it currently (some are working twards it tho), but there are ways (primarily/only one I know is with systemd-boot). It invokes one of the boot binaries (usually “Unified Kernel Images”) that are marked as “good” or one that still has “tries left” (whichever is newer). A binary that has “tries left” gets that count decremented when the boot is unsuccessful and when it reaches 0 it is marked as “bad” and if it boot successfully it gets marked as “good”.

So this system is basically just requires restarting the system on an unsuccessful boot if it isn’t done already automatically.

intelisense@lemm.ee on 20 Jul 2024 21:36 next collapse

That’s all well and good, but many of these Windows machines were headless or used by extremely non-technical people - think tills at your supermarket or airport check-in desks. Worse, some of these installations were running in the cloud, so console access would have been tricky.

shortwavesurfer@lemmy.zip on 20 Jul 2024 21:37 next collapse

The cloud systems would have been a problem. Any local systems, a non-technical user, could have easily done because their IT department could simply tell them, turn on your computer, and when it gets to this screen with these words, press the down arrow key one time and press enter, and your computer will boot normally.

Irremarkable@fedia.io on 20 Jul 2024 21:51 next collapse

You wildly overestimate the average person's willingness to do that.

shortwavesurfer@lemmy.zip on 20 Jul 2024 21:59 collapse

Their willingness to do it would primarily come from the fact that they have a job to do, and if their co-workers are doing their jobs because they followed the instruction and they are not, then the boss is going to have a nice look at them.

Irremarkable@fedia.io on 20 Jul 2024 22:06 collapse

This relies on the assumption that everyone else, or at least a significant portion, in the office managed to do it.

I'm not talking about whether or not they're actually physically capable of it, of course they are. Im talking about how people immediately shut down and pretend they can't follow simple directions the second something relates to a compute.

shortwavesurfer@lemmy.zip on 20 Jul 2024 22:10 next collapse

Mmmm. Fair point

subtext@lemmy.world on 21 Jul 2024 00:00 collapse

Yeah but there’s also always one guy in the group (me) who knows what they’re doing and could just spend an hour doing it for everyone else.

halcyoncmdr@lemmy.world on 20 Jul 2024 22:56 next collapse

You clearly haven’t worked a help desk if you think even those simple instructions are something every end user is capable of or willing to do without issue.

shortwavesurfer@lemmy.zip on 20 Jul 2024 23:11 collapse

I guess I had really good colleagues. I was the network administrator for a small not-for-profit organization and the only time people came to me with computer problems was when they had tried the things that they knew worked first. If the obvious answers did not fix the problem, then they would bring it to my attention.

Morphit@feddit.uk on 21 Jul 2024 00:07 collapse

It should be relatively straightforward to script the recovery of cloud VM images (even without snapshots). Good luck getting the unwashed masses to follow a script to manually enter recovery mode and delete files in a critical area of the OS.

Lost_My_Mind@lemmy.world on 21 Jul 2024 01:27 collapse

Funny you should mention people at the airport. I work at the airport, but not for Fronteer. My sister was flying on thursday, and nobody could get a boarding pass printed. When I came down, thinking my sister was throwing a tantrum over nothing, I see a line longer than a football field. When trying to ask a Fronteer employee what happened, he just threw his hands in the air and said “I DON’T FUCKING KNOW, OK??? NOBODY KNOWS WHAT THE FUCK IS GOING ON!!! YOU SEE THIS??? YOU SEE THIS SHIT??? YOU THINK I’M JUST DENYING PEOPLE FOR FUN??? WHY DON’T I GO GRAB MY TRIDENT, AND I CAN STAB ALL OF YOU OVER AN OPEN FLAME!!! BECAUSE I’M THE DEVIL, RIGHT??? RIGHT??? THAT’S WHAT YOU’RE SAYING!!!”

And all I said was “Hey, my sister is flying today and…”

You think THAT guy is going to sit there and reformat a PC, or restore PC snapshots to previous update? He’s the kind of guy who SHOULD BE smoking weed at work. This platform is very tech savy, but they often forget that a very very small percentage of people hold their PC knowledge. Now what would happen if I threw a tech savy person into an auto garage, and told him to replace the gaskets of an engine. Would they know how? Would they enjoy a room full of mechanics laughing at them?

I’m not saying you specifically. I’m agreeing with you. I’m just adding to your point to an audience that I think sometimes misses the forest through the trees.

fmstrat@lemmy.nowsci.com on 20 Jul 2024 21:52 next collapse

Would still need to be on site.

shortwavesurfer@lemmy.zip on 20 Jul 2024 21:58 collapse

True

Artyom@lemm.ee on 21 Jul 2024 02:34 next collapse

Wouldn’t help (on its own), you’d still get auto-updated to the broken version.

shortwavesurfer@lemmy.zip on 21 Jul 2024 06:41 collapse

If I’m correct wasn’t a fix found and deployed within several hours, so the next auto update would not have likely had the same issue.

Yaztromo@lemmy.world on 21 Jul 2024 02:41 collapse

…until the CrowdStrike agent updated, and you wind up dead in the water again.

The whole point of CrowdStrike is to be able to detect and prevent security vulnerabilities, including zero-days. As such, they can release updates multiple times per day. Rebooting in a known-safe state is great, but unless you follow that up with disabling the agent from redownloading the sensor configuration update again, you’re just going to wing up in a BSOD loop.

A better architectural solution like would have been to have Windows drivers run in Ring 1, giving the kernel the ability to isolate those that are misbehaving. But that risks a small decrease in performance, and Microsoft didn’t want that, so we’re stuck with a Ring 0/Ring 3 only architecture in Windows that can cause issues like this.

nous@programming.dev on 21 Jul 2024 10:31 collapse

That assums the file is not stored on a writable section of the filesystem and treated as application data and thus wouldn’t survive a rollback. Which it likey would.

marcos@lemmy.world on 20 Jul 2024 20:18 next collapse

You mean like NixOS?

It wouldn’t technically stop anything, it would just make your live Hell on Earth if you tried to add that self-updating ring-0 proprietary software in your servers.

But I guess what you are looking for is immutable infrastructure? That one would stop the problem.

jabjoe@feddit.uk on 21 Jul 2024 07:10 collapse

Can’t see many Linux, or BSD, admins, being happy with “self-updating ring-0 proprietary software”. That’s very much a Windows culture thing.

marcos@lemmy.world on 21 Jul 2024 22:34 collapse

Did you hear about it when that same software had that same problem on its Linux endpoint system a couple of months ago?

Well, me neither. I can’t tell how much of if is “anybody willing to use something like that will also want a Windows server” (crazy people), or “nobody that wants Linux would accept it”. Those two are not exactly the same, and I don’t know how well the auditors that keep pushing this kind of shit into companies interact with the culture.

jabjoe@feddit.uk on 21 Jul 2024 23:18 collapse

Yer, I didn’t, but this does seams a very Windows’y way of doing things, so can’t see it widely done in Linux/BSD/Unix world.

lemmyng@lemmy.ca on 20 Jul 2024 20:35 next collapse

If the sensor was using eBPF (as any modern sensor on Linux should) then the faulty update would have made the sensor crash, but the system would still be stable. But CrowdStrike has a long history of using stupid forms of integration, so I wouldn’t put it past them to also load a kernel module that fucks things up unless it’s blacklisted in the bootloader. Fortunately that kind of recovery is, if not routine, at least well documented and standardized.

NekkoDroid@programming.dev on 22 Jul 2024 09:58 collapse

I did hear that one of their newer versions does use eBPF, but I haven’t even remotely looked into it.

nondeterministic.computer/…/112816011370924959

lemmyng@lemmy.ca on 22 Jul 2024 14:59 collapse

They do have a bpf sensor. It’s still shite, managing to periodically peg a CPU core on an idle system. They just lifted and shifted their legacy code into the bpf sensor, they don’t actually make good use of eBPF capabilities.

5714@lemmy.dbzer0.com on 20 Jul 2024 20:42 next collapse

Laypeople couldn’t fix it even more.

jabjoe@feddit.uk on 21 Jul 2024 07:06 collapse

They can’t fix Windows either, so that’s not an argument.

Least if it’s a Linux system, they don’t need to buy any software to sort it out. It’s free and out in the open.

5714@lemmy.dbzer0.com on 21 Jul 2024 07:15 collapse

Yeah? Immutable distro, clownstrike kernel panic, what tool do you use now? Remember, you ‘need’ clownstrike.

jabjoe@feddit.uk on 21 Jul 2024 07:35 collapse

I don’t need some closed blob, with auto updates, in my OS. I doubt many Linux people would be happy with that.

To deal with a bad update, I’d boot a Btrfs snapshot from before the bad update. ‘grub-btrfs’ is great. I confess, it works great for my laptop, but I’ve not yet got it on one of my server. When I finally rebuild my home server, I will though. Work servers, I hope won’t always be my problem!

fmstrat@lemmy.nowsci.com on 20 Jul 2024 21:53 next collapse

None. You’d still have to be on site for every machine.

kenkenken@sh.itjust.works on 20 Jul 2024 22:32 next collapse

In the best case it could automatically reboot into working configuration.

4am@lemm.ee on 20 Jul 2024 23:08 next collapse

And download the update again

Entropywins@lemmy.world on 20 Jul 2024 23:14 collapse

No we are having some fun!

Morphit@feddit.uk on 21 Jul 2024 00:02 collapse

How does Falcon store these channel files on Linux? I don’t know how an immutable distro would handle this given CrowdStrike push several of these updates per day and presumably use their own infrastructure to deploy them.

I guess if you pay them enough they could customize the deployment to work with whatever infrastructure you have but it’s all proprietary so I have no idea if they’re really doing that anywhere.

chameleon@fedia.io on 21 Jul 2024 00:05 next collapse

Realistically, immutability wouldn't have made a difference. Definition updates like this are generally not considered part of the provisioned OS (since they change somewhere around hourly) and would go into /var or the like, which is mutable persistent state on nearly every otherwise immutable OS. Snapshots like Timeshift are more likely to help.

sugar_in_your_tea@sh.itjust.works on 21 Jul 2024 15:08 next collapse

It’s a huge reason why I use BTRFS snapshots. I’m a bit more lax about what gets snapshotted on my desktop, but on a server, everything should live in a snapshot. If an update goes bad, revert to the last snapshot (and snapshots are cheap, so run one with every change and delete older ones).

wisha@lemmy.ml on 21 Jul 2024 17:06 collapse

Anything that’s updated with the OS can be rolled back. Now Windows is Windows so Crowdstrike handles things it’s own way. But I bet if Canonical or RedHat were to make their own versions of Crowdstrike, they would push updates through the o regular packages repo, allowing it to be rolled back.

BigDaddyRAAB@lemm.ee on 21 Jul 2024 02:02 next collapse

Nixos wouldn’t have had any issues, it maintains state information based on configuration and you can choose to load an older boot image during bootloader. Other immutable distros it depends on how they work

nous@programming.dev on 21 Jul 2024 10:25 collapse

Nixos still let’s discord and steam download their core files independently of the configuration. These get stored in the users home dir but are effectively not part of the immutable promise. I believe that the crowdstrike problem was caused by a file updated in a similar manor. So would have been an issue on any distro. That is one big problem with a driver relying on files outside the package managers control. At least with steam and discord they cannot take your whole system down.

BigDaddyRAAB@lemm.ee on 21 Jul 2024 13:56 collapse

My understanding is the main problem here is that the machines became effectively unbootable. This wouldn’t happen in nixos because if setup properly all core system files are handled by nixos itself. That being said obviously it depends on how a user manages their system.

nous@programming.dev on 22 Jul 2024 11:41 collapse

Ideally yes. All core files would be handled by nixos. Except I doubt that is how crowedstrike would work on nixos if it existed on nixos.

Crowedstrikes downloads and manages it own definition file that gets updated multiple times per day. It is this file that was malformed causing the driver to break. This needs to be updated regularly, more then other packages and so would very likely not be something managed by nix package manager but more treated as application data and outside the scope of the nix package manager.

This is how updates to steam and discord are handled in nixos. Only the core updater is packaged and the rest of the application is self managed. So there is a precedence for this behaviour on nixos (although these won’t break your system if a bad update happens as the files are in your user dir).

hperrin@lemmy.world on 21 Jul 2024 08:22 collapse

Immutable, not really a difference. Bad updates can still break the OS.

AB root, however, it would be much easier to fix, but would still be a manual process.

brian@programming.dev on 21 Jul 2024 12:17 next collapse

idk if it would be manual, isn’t the point of ab root to rollback if it doesn’t properly boot afterwards?

barsoap@lemm.ee on 21 Jul 2024 14:18 next collapse

Honestly if you’re managing kernel and userspace remotely it’s your own fault if you don’t netboot. Or maybe Microsoft’s don’t know what the netboot situation looks like in windows land.

sugar_in_your_tea@sh.itjust.works on 21 Jul 2024 15:06 collapse

Aren’t most immutable Linux distros AB, almost by definition? If it’s immutable, you can’t update the system because it’s immutable. If you make it mutable for updates, it’s no longer immutable.

The process should be:

  1. Boot from A
  2. Install new version to B
  3. Reboot into B
  4. If unstable, go to 1
  5. If stable, repeat from 1, but with A and B swapped

That’s how immutable systems work. The main alternative is a PXE system, and in that case you fix the image in one place and power cycle all your machines.

If you’re mounting your immutable system as mutable for updates, congratulations, you have the worst of immutable and mutable systems and you deserve everything bad that happens because of it.