Using Mac M2 Ultra 192GB to Self-Host LLMs?
from shaserlark@sh.itjust.works to selfhosted@lemmy.world on 14 Jan 13:43
https://sh.itjust.works/post/31082266

I’m doing a lot of coding and what I would ideally like to have is a long context model (128k tokens) that I can use to throw in my whole codebase.

I’ve been experimenting e.g. with Claude and what usually works well is to attach e.g. the whole architecture of a CRUD app along with the most recent docs of the framework I’m using and it’s okay for menial tasks. But I am very uncomfortable sending any kind of data to these providers.

Unfortunately I don’t have a lot of space so I can’t build a proper desktop. My options are either renting out a VPS or going for something small like a MacStudio. I know speeds aren’t great, but I was wondering if using e.g. RAG for documentation could help me get decent speeds.

I’ve read that especially on larger contexts Macs become very slow. I’m not very convinced but I could get a new one probably at 50% off as a business expense, so the Apple tax isn’t as much an issue as the concern about speed.

Any ideas? Are there other mini pcs available that could have better architecture? Tried researching but couldn’t find a lot

Edit: I found some stats on GitHub on different models: github.com/ggerganov/llama.cpp/issues/10444

Based on that I also conclude that you’re gonna wait forever if you work with a large codebase.

#selfhosted

threaded - newest

just_another_person@lemmy.world on 14 Jan 14:01 next collapse

I’ve not run such things on Apple hardware, so can’t speak to the functionality, but you’d definitely be able to do it cheaper with PC hardware.

The problem with this kind of setup is going to be heat. There are definitely cheaper minipcs, but I wouldn’t think they have the space for this much memory AND a GPU, so you’d be looking for an AMD APU/NPU combo maybe. You could easily build something about the size of a game console that does this for maybe $1.5k.

awesomesauce309@midwest.social on 14 Jan 14:11 next collapse

For context length, vram is important, you can’t break contexts across memory pools so it would be limited to maybe 16gb. With m series you can have a lot more space since ram/vram are the same, but its ram at apple prices. You can get a +24gb setup way cheaper than some nvidia server card though

shaserlark@sh.itjust.works on 14 Jan 14:17 collapse

Yeah the VRAM of Mac M series is very attractive for running models at full context length and the memory bandwidth is quite good for token generation compared to the price, power consumption and heat generation of NVidia GPUs.

Since I’ll have to put this in my kitchen/living room that’d be a big plus but idk how well prompt processing would work if I send over like 80k tokens.

shaserlark@sh.itjust.works on 14 Jan 14:23 next collapse

I’d honestly be open for that but would an AMD setup not take up a lot of space and consume lots of power / be loud?

It seems like in terms of price & speed, the Macs suck compared to other options, but if you don’t have a lot of space and don’t want to hear an airplane engine constantly I’m wondering if there are options.

just_another_person@lemmy.world on 14 Jan 14:46 collapse

I just looked, and the MM maxes out at 24G anyway. Not sure where you got the thought of 196GB at. NVM you said m2 ultra

Look, you have two choices. Just pick one. Whichever is more cost effective and works for you is the winner. Talking it down to the Nth degree here isn’t going to help you with the actual barriers to entry you’ve put in place.

shaserlark@sh.itjust.works on 14 Jan 15:04 next collapse

I understand what you’re saying but I’m coming to this community because I like having more input, hear about the experience of others and potentially learn about things I didn’t know about. I wouldn’t ask specifically in this community if I wouldn’t want to optimize my setup as much as I can.

just_another_person@lemmy.world on 14 Jan 15:14 next collapse

You can have a slightly bigger package in PC form and doing 4x the work for half the price. That’s the gist.

just_another_person@lemmy.world on 14 Jan 15:23 collapse

Here’s a quick idea of what you’d want in a PC build newegg.io/2d410e4

shaserlark@sh.itjust.works on 14 Jan 15:59 collapse

Thanks, that’s very helpful! Will look into that type of build

windowsphoneguy@feddit.org on 14 Jan 15:11 collapse

Mac Mini M4 Pro can be ordered with up to 64GB shared memory

BorgDrone@lemmy.one on 14 Jan 21:48 collapse

you’d definitely be able to do it cheaper with PC hardware.

You can get a GPU with 192GB VRAM for less than a Mac? Sign me up please.

just_another_person@lemmy.world on 14 Jan 23:24 collapse

AMD APU uses whatever system RAM is as VRAM, so…yeah. NPU as well.

OhVenus_Baby@lemmy.ml on 15 Jan 00:10 next collapse

Up to half of system RAM*

BorgDrone@lemmy.one on 15 Jan 00:55 collapse

And what is the memory bandwidth on these APUs?

just_another_person@lemmy.world on 15 Jan 03:19 collapse

As fast as it gets to the CPU. That should be pretty obvious.

BorgDrone@lemmy.one on 15 Jan 08:01 collapse

Which is how fast?

0x01@lemmy.ml on 14 Jan 14:23 next collapse

I do this on my ultra, token speed is not great, depending on the model of course, a lot of source code sets are optimized for Nvidia and don’t even use native Mac gpu without modifying the code, defaulting to cpu. I’ve had to modify about half of what I run

Ymmv but I find it’s actually cheaper to just use a hosted service

If you want some specific numbers lmk

shaserlark@sh.itjust.works on 14 Jan 14:29 collapse

Interesting, is there any kind of model you could run at reasonable speed?

I guess over time it could amortize but if the usability sucks that may make it not worth it. OTOH really don’t want to send my data to any company.

Boomkop3@reddthat.com on 14 Jan 15:38 next collapse

If you enjoy waiting around, sure

shaserlark@sh.itjust.works on 14 Jan 16:00 collapse

Meh, ofc I don’t.

Boomkop3@reddthat.com on 14 Jan 16:52 collapse

Then don’t go with an Apple chip. They’re impressive for how little power they consume. But any 50 watt chip will get absolutely destroyed by a 500 watt gpu, even one from almost a decade ago will beat it.

And you’ll save money to boot, if you don’t count your power bill

jacksilver@lemmy.world on 14 Jan 18:12 next collapse

The power bill side is also not even clear cut. The longer processing time for slower chips sometimes ends up resulting in higher costs. It’s surprisingly not as simple as lower wattage chip is cheaper to operate.

Boomkop3@reddthat.com on 15 Jan 04:19 collapse

Good point!

GenderNeutralBro@lemmy.sdf.org on 14 Jan 18:51 collapse

But any 50 watt chip will get absolutely destroyed by a 500 watt gpu

If you are memory-bound (and since OP’s talking about 192GB, it’s pretty safe to assume they are), then it’s hard to make a direct comparison here.

You’d need 8 high-end consumer GPUs to get 192GB. Not only is that insanely expensive to buy and run, but you won’t even be able to support it on a standard residential electrical circuit, or any consumer-level motherboard. Even 4 GPUs (which would be great for 70B models) would cost more than a Mac.

The speed advantage you get from discrete GPUs rapidly disappears as your memory requirements exceed VRAM capacity. Partial offloading to GPU is better than nothing, but if we’re talking about standard PC hardware, it’s not going to be as fast as Apple Silicon for anything that requires a lot of memory.

This might change in the near future as AMD and Intel catch up to Apple Silicon in terms of memory bandwidth and integrated NPU performance. Then you can sidestep the Apple tax, and perhaps you will be able to pair a discrete GPU and get a meaningful performance boost even with larger models.

Boomkop3@reddthat.com on 15 Jan 04:22 collapse

Again, you’d be waiting around all day

shaserlark@sh.itjust.works on 15 Jan 11:46 collapse

Yeah I found some stats now and indeed you’re gonna wait like an hour to process if you throw like 80-100k token into a powerful model. With APIs that kinda works instantly, not surprising but just to give a comparison. Bummer.

Boomkop3@reddthat.com on 15 Jan 13:56 next collapse

Application Programming Interface, are you talking about something on the internet? On a gpu driver? On your phone?

Then also, what’s the size model you’re using? Define with int32? fp4? Somewhere in between? That’s where ram requirements come in

I get that you’re trying to do a mic drop or something, but you’re not being very clear

shaserlark@sh.itjust.works on 16 Jan 20:00 collapse

Are you drunk?

Boomkop3@reddthat.com on 17 Jan 04:01 collapse

No, just calling your bluff. git gud m8

shaserlark@sh.itjust.works on 17 Jan 08:49 collapse

You’re aware that there’s the OpenAI API library right? github.com/openai/openai-python

It’s really nothing fancy especially on Lemmy where like 99% of people are software engineers…

Boomkop3@reddthat.com on 17 Jan 09:42 collapse

Eyy, a web api! You could’ve just said that right away. There’s more than just web api’s.

How is this web api relevant in your choice of hardware to locally run these models?

shaserlark@sh.itjust.works on 19 Jan 10:00 collapse

Congrats on being that guy

Boomkop3@reddthat.com on 19 Jan 12:03 collapse

Throwing money at a problem works, next time try to know what you’re doing

Boomkop3@reddthat.com on 15 Jan 14:03 collapse

Anyways, the important thing is the “TOPS” aka trillions of operations per second. Having enough ram in important, but if you don’t have a fast processor than you’re wasting ram while you can just stream it from a fast ssd.

One such cases is when your system can’t handle more than 50 tops, like the apple m systems. Try an old gpu, and enjoy 1000’s of tops

KoalaUnknown@lemmy.world on 15 Jan 03:32 next collapse

There are some videos on youtube of people running local LLMs on the newer M4 chips which have pretty good AI performance. Obviously, a 5090 is going to destroy it in raw compute power, but the large unified memory on Apple Silicon is nice.

That being said, there are plenty of small ITX cases at about 13-15L that can fit a large nvidia GPU.

shaserlark@sh.itjust.works on 15 Jan 07:17 collapse

Thanks! Hadn’t thought of YouTube at all but it’s super helpful. I guess that’ll help me decide if the extra Ram is worth it considering that inference will be much slower if I don’t go NVIDIA.

tehnomad@lemm.ee on 15 Jan 05:20 next collapse

The context cache doesn’t take up too much memory compared to the model. The main benefit of having a lot of VRAM is that you can run larger models. I think you’re better off buying a 24 GB Nvidia card from a cost and performance standpoint.

shaserlark@sh.itjust.works on 15 Jan 06:54 collapse

Yeah I was thinking about running something like Code Qwen 72B which apparently requires 145GB Ram to run the full model. But if it’s super slow especially with large context and I can only run small models at acceptable speed anyway it may be worth going NVIDIA alone for CUDA.

tehnomad@lemm.ee on 15 Jan 19:32 collapse

I found a VRAM calculator for LLMs here: huggingface.co/…/LLM-Model-VRAM-Calculator

Wow it seems like for 128K context size you do need a lot of VRAM (~55 GB). Qwen 72B will take up ~39 GB so you would either need 4x 24GB Nvidia cards or the Mac Pro 192 GB RAM. Probably the cheapest option would be to deploy GPU instances on a service like Runpod. I think you would have to do a lot of processing before you get to the breakeven point of your own machine.

RandomlyRight@sh.itjust.works on 16 Jan 13:13 next collapse

Take a look at NVIDIA Project Digits. It’s supposed to release in May for 3k usd and will be kind of the only sensible way to host LLMs then:

www.nvidia.com/en-us/project-digits/

wise_pancake@lemmy.ca on 18 Jan 17:29 collapse

That actually seems attractive to me, but I’m unsure where I stand yet. It’s pricey, I just want a box I can put in the basement and then connect everything to over wifi.

brucethemoose@lemmy.world on 17 Jan 15:29 collapse

Late to this post, but shoot for and AMD Strix Halo or Nvidia Digits mini PC.

Prompt processing is just too slow on Apple, and the Nvidia/AMD backends are so much faster with long context.

Otherwise, your only sane option for 128K context in a server with a bunch of big GPUs.

Also… what model are you trying to use? You can fit Qwen coder 32B with like 70K context on a single 3090, but honestly its not good above 32K tokens anyway.

shaserlark@sh.itjust.works on 19 Jan 10:04 collapse

Thanks for the reply, still reading here. Yeah thanks to the comments and reading some benchmarks I abandoned the idea of getting an Apple, it’s just too slow.

I was hoping to test Qwen 32B or llama 70b for running longer contexts, hence the apple seemed appealing.

brucethemoose@lemmy.world on 19 Jan 13:04 collapse

Honestly, most LLMs suck at the full 128K. Look up benchmarks like RULER.

In my personal tests over API, LLama 70B is bad out there. Qwen (and any fine tune based on Qwen Instruct, with maybe an exception or two) not only sucks, but is impractical past 32K once its internal rope scaling kicks in. Even GPT-4 is bad out there, with Gemini and some other very large models being the only usable ones I found.

So, ask yourself… Do you really need 128K? Because 32K-64K is a boatload of code with modern tokenizers, and that is perfectly doable on a single 24G GPU like a 3090 or 7900 XTX, and that’s where models actually perform well.