TL;DR
Thorsten Meyer AI published a 2026 local-inference cost breakdown arguing that buyers should size hardware around VRAM capacity, not the newest GPU. The report says a model that fits in GPU memory can run tens of tokens per second, while spilling into system RAM can make the same setup too slow for regular work.
Thorsten Meyer AI has published a 2026 local-inference rig cost analysis that says the main buying decision is not the newest GPU, but whether a model fits inside VRAM. The report matters for developers, creators and small teams weighing cloud AI costs against owning hardware for steady private workloads.
The report’s central finding is the VRAM cliff: if model weights fit in GPU memory, inference can be fast; if they spill into system RAM, performance can collapse. Thorsten Meyer AI cites community benchmarks showing an RTX 5090 running a 70B model fully in VRAM at about 40 to 50 tokens per second, while the same card and model can fall to roughly 1 to 2 tokens per second when it spills into system memory.
The analysis says local LLM inference is mainly memory-bandwidth-bound, which makes VRAM capacity the hard constraint. On the report’s Q4 sizing map, 7B to 8B models need about 6GB to 8GB, 26B to 32B models need around 20GB, and 70B models need about 43GB. Larger 100B-plus and frontier-class models can require 60GB to 130GB or more, making them multi-GPU or large unified-memory jobs.
For cost, the report argues that VRAM per dollar matters more than buying the newest card. It says a used RTX 3090 24GB, priced around $600 to $850 in late June 2026, offers roughly five times the VRAM-per-dollar of an RTX 5090. Four used 3090s would provide 96GB of pooled VRAM for under about $3,200, though used hardware may carry warranty, power and reliability tradeoffs.
The real cost of a local-inference rig
Owning beats renting for steady AI work — so what does a local rig cost in 2026? The unintuitive, good news: the most expensive build is almost never the smartest one. It all comes down to one rule.
The difference is only whether the weights fit. LLM inference is memory-bandwidth-bound — VRAM capacity is the hard limit you build around. Compute specs are mostly noise.
The squeeze reframes the rig like everything else in this series: discipline beats maximalism. VRAM is exactly the memory under most pressure, so over-buying it is the 128GB-“to-be-safe” trap, only worse per gigabyte. Take the cheap, high-value step to 24GB (the gateway to the 30B class), reach for used 3090s and MoE models, and use quantization to climb a tier without buying silicon. Sized right, the rig pays for itself against the cloud’s ever-rising hidden bill. Next: Apple Silicon’s quiet memory advantage.
VRAM Sets the Budget
The finding matters because local AI buyers often compare cards by raw compute specs, while the report says the real limit is whether the model fits in fast memory. For people running steady inference jobs, that changes the purchase question from “What is the fastest GPU?” to “What model class do I need?”
The report also reframes the cloud-versus-local calculation. Thorsten Meyer AI says owning can beat renting for high-utilization workloads, but only when buyers avoid overbuilding. A disciplined setup aimed at 24GB, 48GB or 96GB memory targets may make more financial sense than paying for a premium single card that does not match the actual model workload.
NVIDIA RTX 3090 24GB GPU
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
The Memory Squeeze Series
The article is Part 7 of Thorsten Meyer AI’s Memory Squeeze series. The prior installment argued that cloud rentals can hide long-term costs, while this installment prices the alternative: running models locally for privacy, ownership and more predictable usage economics.
The report points to quantization as a major reason local rigs remain viable. It says Q4 compression can cut model memory needs to about a quarter of full precision with modest quality loss for many uses. It also highlights Mixture-of-Experts models, saying models such as Qwen3’s 30B MoE can activate a smaller share of parameters per token and deliver stronger quality-per-speed tradeoffs.
“The most expensive local-inference rig is almost never the smartest one.”
— Thorsten Meyer AI
used high VRAM graphics card
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Prices May Move Fast
Several details remain market-dependent. The report says prices are point-in-time figures from late June 2026, and GPU availability can change quickly. Used RTX 3090 cards may also differ widely by condition, warranty status, power draw and prior use.
The performance figures are based on community benchmarks, not a single standardized lab test. Actual speed can vary by model, quantization level, software stack, drivers, cooling, CPU offload and whether a workload uses single-GPU, dual-GPU or unified-memory hardware.
GPU with 20GB VRAM for AI inference
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Apple Silicon Comes Next
The series is set to continue with Apple Silicon’s memory advantage, according to the source material. That next installment is expected to compare large unified-memory Macs with PC GPU builds for users trying to run bigger local models without crossing the VRAM cliff.
multi-GPU AI inference setup
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
What is the main takeaway for local AI hardware buyers in 2026?
The report says buyers should start with model size and required VRAM capacity, then choose hardware. Buying the newest GPU may not be the best value for inference.
Why does VRAM matter more than raw GPU speed?
Thorsten Meyer AI says LLM inference is memory-bandwidth-bound. If the weights do not fit in GPU memory, performance can fall sharply when the system relies on slower RAM.
Is a used RTX 3090 still useful for local AI?
According to the report, a used RTX 3090 24GB can be a strong value for inference because it offers high VRAM per dollar. Buyers still need to account for used-card risks.
Can a single GPU run 70B models locally?
The report says a 70B Q4 model needs about 43GB, so it usually requires more than one 24GB card, a high-memory GPU, a large unified-memory Mac or more aggressive quantization.
Are the listed prices final buying advice?
No. The source labels the numbers as late June 2026 point-in-time prices and says they are not financial advice. Buyers should verify current pricing and hardware condition.
Source: Thorsten Meyer AI