so you want to run llms locally

finding the right hardware is difficult

Running models locally is one of those ideas that sounds simple until you try to figure out what you actually need. You hear people on r/LocalLLaMA talking about how they’re running Llama 3.3 70B on their Mac Studio, or how their RTX 4090 handles 7B models at 140 tokens per second, and you think great, I’ll do the same. Then you open your wallet and realize you have no idea what you’re buying or whether it will work for your use case.

I’ve written before about open-source alternatives to paid models and why you should consider moving to privacy-first tools. The common thread is that owning your own infrastructure, whether for privacy, cost, or just avoiding another pricing rug, is increasingly attractive. The problem is figuring out what hardware to buy. The landscape is confusing, full of contradictory advice, and the benchmarks that exist are scattered across Reddit threads and GitHub repos with no easy way to compare.

So I did the research and had an AI build a tool. It’s called Best for AI, an open-source browser-based reference for choosing the right hardware to run local LLMs. This post covers the key findings and some things I learned while putting it together.

model size determines everything

The first thing to understand is that the model you want to run dictates everything else. Model size is measured in billions of parameters (7B, 13B, 70B, etc.), and each tier requires a different amount of memory to load. A 7B model quantized to Q4 needs about 4-5 GB of VRAM. A 70B model at the same quantization needs around 40 GB. If the model doesn’t fit in memory, it either won’t load or it will run at a fraction of the speed because parts of it are being swapped to system RAM or disk.

This is why people say “VRAM is king.” Memory bandwidth and capacity are what actually limit local inference, and compute speed is rarely the constraint. An RTX 4090 with 24 GB of VRAM runs 7B models at around 140 tokens per second, which is genuinely fast. Try to load a 70B model on that same card and you hit a wall because the model simply does not fit. At that point you’re either offloading layers to CPU RAM and watching your speed collapse, or you’re shopping for a multi-GPU setup that costs several times more.

Here’s a rough mapping of model size to memory:

Model Size	Quantization	Memory Required	Example Systems
7B	Q4	4-6 GB	RTX 4060, Mac mini M4, any 8GB+ GPU
13B	Q4	8-10 GB	RTX 4060 Ti, MacBook Air M4 32GB
30B	Q4	18-20 GB	RTX 4090, MacBook Pro M4 Pro 48GB
70B	Q4	38-42 GB	Mac Studio M3 Ultra, dual RTX A6000
120B+	Q4	70-80 GB	Mac Studio M3 Ultra 256GB, quad A100

The column you care about is “Memory Required.” If your GPU or unified memory meets that threshold, the model fits. If it doesn’t, performance degrades significantly or the model won’t load at all.

unified memory changes the game

Apple Silicon introduced something genuinely different: unified memory. On a Mac, the GPU, CPU, and neural engine share a single memory pool, which means a MacBook Pro M4 Max with 128 GB of unified memory can load a 70B model entirely without the CPU-GPU transfer bottleneck that kills performance on traditional systems. An RTX 4090 with 24 GB of GDDR6X at 1,008 GB/s bandwidth will generate tokens faster than an M4 Max with 400 GB/s bandwidth, assuming the model actually fits in 24 GB. The moment you need more memory than that, the calculus changes. A 70B model that requires partial offloading on an NVIDIA card runs smoothly on a Mac because the whole model stays in the unified memory pool. That speed advantage evaporates quickly when the GPU is waiting on system RAM.

This is why Apple Silicon is popular with ML engineers who run inference on large models. You pay a premium for memory, but you get a machine that handles 70B models smoothly where a desktop with a single high-end GPU cannot. The Mac Studio M3 Ultra with 256 GB unified memory can run 120B+ quantized models, something that would require four enterprise GPUs on the NVIDIA side.

The catch is that Apple Silicon RAM is soldered. Whatever you buy is what you have for the life of the machine, which is a very different commitment than a desktop where you can swap in more RAM next year when prices drop. I have a MacBook Pro M4 Max with 36 GB, which handles 30B models well but can’t comfortably run 70B. Buy more memory than you think you need.

nvidia, amd, and apple

There are really three ecosystems for local AI:

NVIDIA has the most mature software stack. CUDA is the default for most AI workloads, and almost every framework (PyTorch, TensorFlow, llama.cpp) supports it out of the box. The RTX 4090 with 24 GB VRAM is the community favorite for prosumer work. The RTX 5090 just launched with 32 GB VRAM and ~1,792 GB/s bandwidth, which makes it a significant upgrade if you can deal with the heat and power draw. For serious 70B+ work, you’re looking at professional cards like the RTX A6000 (48 GB) or data center hardware like the A100 (40-80 GB). These cost $4,000-$15,000 per card.

AMD is cheaper but their software story is bleak. ROCm exists but support is inconsistent across cards and frameworks. The new Ryzen AI Max+ 395 APU with 128 GB unified memory is a wildcard because it’s the cheapest x86 path to running 70B models (~$1,600 for a full system), but you need to manually configure BIOS settings and use the Vulkan backend in llama.cpp rather than ROCm for best results. The Asus ROG Flow Z13 with this APU is one of the more interesting budget options for people who want large model inference without spending $8,000 on a Mac Studio.

Apple has MLX, a framework designed for Apple Silicon that can be 20-30% faster than Ollama on the same hardware. The trade-offs are locked-in memory, higher upfront cost, and no upgrade path. If your budget is $4,000-$8,000 and you want a portable machine that handles 70B models, the MacBook Pro M4 Max with 128 GB is hard to beat. If portability doesn’t matter and you want maximum capacity, the Mac Studio M3 Ultra with 256 GB or 512 GB unified memory is the macOS ceiling.

diy versus pre-built

Building your own system is usually cheaper for NVIDIA-based workstations. A DIY RTX 4090 build with 64 GB of system RAM runs around $2,500-$3,000, while pre-built systems with the same GPU start at $4,000. The savings cover better components, more storage, and a case that actually cools the card properly.

That math changes once you move to enterprise cards. Dual RTX A6000 setups require expensive motherboards with multiple PCIe x16 slots, high-wattage PSUs, and careful physical fitment to make sure everything actually clears. The labor and risk involved start to justify the premium on pre-builts like the Lenovo ThinkStation P620, especially if you value your time. Apple Silicon has no DIY path at all, so that decision is made for you.

I’ve built enough systems to know when it’s worth the effort and when it’s not. Single-GPU NVIDIA builds under $4,000 are straightforward enough that most people with a screwdriver and patience can handle them. Beyond that, or if you’d rather not troubleshoot driver issues at midnight, buy something pre-assembled.

tokens per second

You’ll see benchmark numbers expressed as tokens per second (t/s), which is how fast the model generates output. For reference, most people read at 200-250 words per minute, roughly 4-5 tokens per second. Anything above 20 t/s feels instantaneous for conversational use, and coding workflows benefit from 50+ t/s when you’re generating larger blocks of code and iterating quickly.

Here are some rough benchmarks for 7B models at Q4 quantization:

System	GPU	7B t/s
HP Z8 G4 (Quad A100)	4x A100 40GB	~540
Dell Precision 7960	4x RTX 5000 Ada	~450
DIY RTX 5090	RTX 5090 32GB	~170
DIY RTX 4090	RTX 4090 24GB	~140
Mac Studio M3 Ultra	80-core GPU	~130
Minisforum AI Max+ 395	AMD iGPU	~90
MacBook Pro M4 Max	40-core GPU	~75

These numbers come from community benchmarks on r/LocalLLaMA and llama.cpp’s llama-bench tool. Your mileage will vary based on quantization format, model architecture, batch size, and dozens of other variables. The point is relative comparison.

see for yourself

Best for AI puts all of this into a single interface for free. You’re welcome (>ᴗ•)

Some fo the features include:

Filter pre-built systems by OS, form factor, and maximum model size
Compare DIY vs pre-built costs across 11 build tiers from $850 to $30,000
Rank GPUs by tokens per second, memory bandwidth, and VRAM
Find your best match based on budget, model size, and priorities (speed, value, capacity, portability)
Compare up to 3 systems side by side with spec breakdowns
Read optimization tips for quantization, GPU offloading, and framework selection
Benchmark your current system with Mac-aware upgrade paths

The benchmark feature is particularly useful if you already own hardware and want to know what you can actually run. It detects your CPU, GPU, RAM, and VRAM via browser APIs, shows model compatibility, estimates tokens per second, and suggests upgrade paths. For Mac users, it understands that you can’t swap a GPU and recommends the next tier of Apple Silicon instead, which is the kind of platform-aware detail I couldn’t find anywhere else. The source is on GitHub if you want to contribute.

recommendations by budget

If you’re reading this and just want a straight answer, here’s what I’d recommend based on budget and target model size:

Under $1,500 (7B-13B models)

Mac mini M4 Pro 48GB ($1,999) if you’re on macOS
DIY RTX 4060 Ti 16GB build if you’re on Windows/Linux

$1,500-$3,000 (13B-30B models)

Minisforum AI Max+ 395 128GB ($1,600) if you want the cheapest path to 70B (slow but works)
MacBook Pro M4 Pro 48GB ($2,599) if you need portability
DIY RTX 4090 build ($2,800) for maximum speed on 30B

$3,000-$5,000 (30B-70B models)

MacBook Pro M4 Max 128GB ($4,200) for portable 70B
DIY RTX 5090 build (~$3,500) for fast 30B on Windows

$5,000-$10,000 (70B+ models)

Mac Studio M3 Ultra 256GB (~$8,500) for reliable 120B
DIY dual RTX A6000 build (~$8,000) for maximum NVIDIA performance

$10,000+ (120B-405B models)

Mac Studio M3 Ultra 512GB (~$11,000)
HP Z8 G4 Quad A100 (~$30,000) if enterprise infrastructure is an option

Pick the tier that matches your budget and target model, then use the best match finder tool to narrow down specifics. Hint: when the best match build is generated, click on it to view details such as what models and frameworks are most optimal.

why this matters

I started caring about local inference after Anthropic’s pricing changes last year made me rethink how dependent I had become on a single provider. The appeal of running models locally is that you control the entire stack. Your prompts stay on your machine. Your usage is not metered by someone else’s business model. The terms of service are whatever you decide they are.

I wrote before that open-source AI will eventually win, and I still believe that. The bottleneck right now is hardware. Inference chips are getting cheaper and more efficient every quarter, and consumer-grade hardware capable of running frontier-quality models locally is probably two to three years away. We’re in the awkward middle period where running 70B+ requires either a significant investment in Apple Silicon or something that looks like a small data center in your office.

I built this tool because I needed it for myself. The information exists but it’s scattered across Reddit threads, GitHub repos, and spec sheets that require too much context to interpret. Hopefully it is helpful to you, dear reader, as well!

Check it out: https://97115104.github.io/bestforai/

GitHub: https://github.com/97115104/bestforai