System freezes at boot and I'm not sure if it's a software or hardware problem
from HarvesterOfEyes@piefed.social to linux@lemmy.ml on 02 May 23:04
https://piefed.social/post/716107

Hi!

I’ve already posted in the Arch Linux community on lemmy.ml but I’m also posting it here for additional visibility. I’d cross-post it but I don’t think PieFed has that option yet. Hopefully it’s okay.

Anyway, a few hours ago today, when I turned on my computer, went to the systemd-boot boot loader, chose “Arch Linux” from the list of boot entries, I was faced with a system that is stuck at boot as seen from the image I uploaded.

So far, I’ve tried disabling Overdrive by editing the kernel parameters at boot, and by booting an Arch Linux live ISO to no avail. As in, I’m stuck at the same stage of the booting process, even when using the aforementioned live ISO. Which means I can’t really boot into the system.

This happened before, like, a few months ago. I either booted with a live ISO and executed mkinitcpio -P, or just did a hard reset, as I waited for a kernel, GPU drivers or mesa update. About a month ago, it stopped happening and the system booted fine. I don’t really know what fixed it, sorry. Until today, that is.

I’m at a loss of what to do aside from either reinstalling Arch Linux or installing a different distro. I really don’t want to do that, though, as I haven’t really done any backups of my config files, and I’m generally happy with how I’ve set up my system. The fact that the live ISO didn’t work also made me think of a hardware problem, namely the GPU, which complicates things even more, as I don’t have a spare one.

Some information about my hardware:

I ran # pacman -Syu last night so everything is up to date. Not sure how relevant this is but I’m using the radeon open-source drivers.

Hopefully all of this was somewhat clear and if there’s something I missed, please let me know.

Thanks in advance!

Photo taken of a monitor, with boot messages from Arch Linux, and graphical artifacts

#linux

threaded - newest

redxef@feddit.org on 02 May 23:16 next collapse

Do you still have the live iso you used to install arch? Does it work? Do other distros work (just the live systems are enough)?

Edit:

Some more things: Did you try disconnecting the pc from mains, pressing the power button (to discharge all capacitors) and reconnecting. Reseat the button cell for the bios?

HarvesterOfEyes@piefed.social on 02 May 23:20 next collapse

No, I had to use the latest one. Nope, tried the Ubuntu live ISO but it also didn't work.

ReversalHatchery@beehaw.org on 03 May 01:57 collapse

perhaps systemrescue? It’s an arch based distro, but maybe built differently for better stability. it also does not attempt to start real graphics until you type startx

HarvesterOfEyes@piefed.social on 02 May 23:27 next collapse

Regarding your edit: no, I haven't tried that, but I will keep those suggestions in mind, thanks!

BombOmOm@lemmy.world on 03 May 00:02 next collapse

Reseat the button cell for the bios?

This is a good one too! And if you have a volt-meter, see if it’s low (or just replace it if you have a spare).

ReversalHatchery@beehaw.org on 03 May 01:55 collapse

pressing the power button (to discharge all capacitors)

I think that does not happen anymore in modern PCs. I still always do it, but then I also wait a minute or more after pulling the plug

giacomo@lemm.ee on 02 May 23:21 next collapse

if you can boot a live iso, its probably not hardware. if you can’t boot a live iso, it might be hardware.

HarvesterOfEyes@piefed.social on 02 May 23:25 collapse

Yeah, it might be the dreaded hardware problem, then.

BombOmOm@lemmy.world on 02 May 23:31 collapse

Since it is something with the computer itself and not the OS, some things to try:

  • Check for any motherboard status lights.
  • Reseat your RAM.
  • Run a memtest. Let it do a full pass, takes ~3 hours. If you see anything more than a single error, it’s the RAM.
  • Reset your BIOS to factory settings.
  • Update your BIOS.
  • Reset your CMOS.
  • As redxef said, unplug from the wall, hit the power button a few times to fully drain the system, then plug back in.
  • Unplug everything you possibly can. Leave just a single monitor, a single stick of RAM, the cpu, and the power cable plugged in. Literally nothing else, not even a keyboard. (You will need to keep your graphics card plugged in as the 2700x doesn’t have onboard graphics)
  • Swap to a different single stick of RAM and put it in a different slot.
  • Visually inspect for any exploded or bulging capacitors.
  • If you have gotten to here, swap in any spare parts you have from the prior list. Different graphics card, different ram stick, different monitor, different cpu or mobo if you have one.
  • Unplug/replug your internal power cables, and unplug any unnecessary internal cables (fans, rgb, etc) (Is it this? Probably not, but we are getting to the desperate part of the list.)
  • Reseat your CPU (don’t forget to clean off and re-apply thermal paste)
  • Cry a little

The goal is to narrow down which piece of hardware is failing.

HarvesterOfEyes@piefed.social on 02 May 23:37 collapse

Will do it tomorrow, thanks!

redxef@feddit.org on 02 May 23:25 next collapse

I think I remember some weird power bugs in the 2700x, though I never encountered them myself. The best thing I could find was this reddit thread reddit.com/…/ryzen_freezes_in_linux_even_if_linux…

HarvesterOfEyes@piefed.social on 02 May 23:33 collapse

I tried adding the kernel parameter mentioned in that thread but it didn't work. But thank you anyway!

Xanza@lemm.ee on 02 May 23:26 next collapse

Run a live disk. If everything runs fine, it’s not hardware. If not, then it’s very likely hardware.

Lemmchen@feddit.org on 02 May 23:55 collapse

Ideally, run a different live system than what’s installed right now. Otherwise it’s easy to misinterpret software issues for hardware ones.

LandedGentry@lemmy.zip on 03 May 02:01 next collapse

imgflip.com/memegenerator/…/Quit-Having-Fun

mvirts@lemmy.world on 03 May 04:55 next collapse

Pretty sus :P

I would start by removing the graphics card if you have integrated graphics available (or disable the PCI port in your bios)

This reminds me of the kinds of issues I would get when setting up overclocking and getting just past the limit of stable operation. If you have overclocking set up definitely try disabling it.

If removing the GPU does nothing don’t forget to check removing each ram stick separately, or make sure your bios runs a full memory check.

LeLachs@lemmy.ml on 03 May 08:56 next collapse

Looks like your PC tells you to disable overdrive. Have you tried that?

floquant@lemmy.dbzer0.com on 03 May 09:21 collapse

Have you tried reading the post? They already did

HarvesterOfEyes@piefed.social on 03 May 14:07 collapse

PieFed isn't letting me edit the OP due to an unexpected error. The errors keep piling up, haha!

Just wanted to thank all of you wonderful people for all the help you've given me. I love each and everyone of you (even the ones who skimmed through my post :p). A user on the other thread I created in the Arch Linux community suggested I add the nomedeset parameter, with which I managed to boot into the system. I updated it and installed linux-lts along with linux-lts-headers. Adjusted /boot/loader/entries/arch_linux.conf to switch to the lts kernel by default and rebooted the PC. Unfortunately, didn't work but I got logs! Here's the relevant part, I think:

mai 03 11:04:23 arch kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff  
mai 03 11:04:23 arch kernel: amdgpu: [powerplay] Failed message: 0x4, input parameter: 0x2000000, error code: 0xffffffff  
mai 03 11:04:23 arch kernel: [drm:resource_construct [amdgpu]] *ERROR* DC: unexpected audio fuse!  
mai 03 11:04:23 arch kernel: [drm] Display Core v3.2.316 initialized on DCE 12.0  
mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: [drm] *ERROR* No EDID read.  
mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: [drm] *ERROR* No EDID read.  
mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: [drm] *ERROR* No EDID read.  
mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: [drm] *ERROR* No EDID read.  
mai 03 11:04:23 arch kernel: [drm] Timeout wait for RLC serdes 0,0  
mai 03 11:04:23 arch kernel: [drm] kiq ring mec 2 pipe 1 q 0  
mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)  
mai 03 11:04:23 arch kernel: [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed  
mai 03 11:04:23 arch kernel: [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init of IP block <gfx_v9_0> failed -110  
mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed  
mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init  
mai 03 11:04:23 arch kernel: amdgpu 0000:0a:00.0: amdgpu: amdgpu: finishing device.  

I did a search and it seems like it's the GPU's fault due to the ring errors. I think. I remembered I have an old nvidia GPU laying around so I'm going to try to reseat the current GPU and, if that doesn't work, try the old one. Not sure if I have to uninstall the amd drivers or if it's ok to have both the amd and nvidia drivers installed. If that doesn't work, I'm going to go through all the other suggestions y'all gave me to try and pinpoint the problem.

Again, thank you so much!