TECH | | 7 MIN READ

Run Llama 4 Locally With Ollama: 2026 Hardware & Setup Guide

7 min read
Photo by Andrey Matveev on Pexels
A

Install Ollama (v0.6+), run ‘ollama pull llama4:scout’ to download the 67 GB model, then ‘ollama run llama4:scout’ to start. You need an NVIDIA GPU with 24 GB VRAM for usable performance.

What Llama 4 Actually Is (and Why It Matters for Local AI)

Meta released Llama 4 in April 2025. Two models shipped: Scout and Maverick. Both use a mixture-of-experts (MoE) architecture, which means only a fraction of the total parameters activate per token. Scout has 109B total parameters but only 17B active. Maverick scales to 400B total with the same 17B active.

The MoE approach is what makes local deployment realistic. You’re not loading 109B parameters into VRAM. You’re loading routing logic plus 17B active weights, and the router picks which of Scout’s 16 experts handles each token. Maverick does the same with 128 experts.

Scout is the one you want for local inference. It fits on a single GPU with quantization, supports a 10 million token context window, and handles both text and image inputs natively. Maverick’s 400B total weight makes it a server-room model.

Hardware Requirements: What You Actually Need

The real bottleneck is VRAM. Everything else is secondary.

Setup GPU VRAM Quantization Speed Quality
Production NVIDIA H100 80 GB FP16 ~120 tok/s Full
Enthusiast RTX 4090 24 GB int8 ~105 tok/s Near-lossless
Budget GPU RTX 4090 24 GB int4 ~85 tok/s Minimal loss
Aggressive RTX 4090 24 GB 1.78-bit ~20 tok/s Noticeable loss
CPU only None 64 GB RAM Q2_K ~0.5 tok/s Degraded

The RTX 4090 with int8 quantization is the sweet spot. You get 105 tokens per second with negligible quality loss. That’s faster than most API responses. If you have 24 GB of VRAM and can tolerate slightly lower quality, int4 drops the memory footprint to about 11 GB while keeping speeds above 80 tok/s.

For the aggressive quantizers: Unsloth’s 1.78-bit quantization squeezes Scout into any 24 GB card, but at 20 tok/s with visible quality degradation on nuanced tasks. Fine for casual chat. Not production-grade.

CPU-only inference works but barely. At 0.5 tok/s, you’re waiting minutes for a paragraph. Only viable for testing or if you genuinely cannot access a GPU.

System Requirements Checklist

Before you install anything, verify these minimums:

  • GPU route: NVIDIA GPU with 24+ GB VRAM (RTX 3090, RTX 4090, A6000, or better). AMD ROCm support exists but is less stable.
  • CPU route: 64 GB system RAM minimum. 128 GB recommended for reasonable context lengths.
  • Storage: 70 GB free for the Q4_K_M quantized model (67 GB download). SSD strongly recommended.
  • OS: macOS 13+, Linux (Ubuntu 22.04+), or Windows 11 with WSL2.
  • Ollama version: 0.6+ (Llama 4 MoE support was added in later releases).

Step 1: Install Ollama

If you already have Ollama, update it. Llama 4 support requires a recent version with MoE architecture handling.

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from the Ollama website. WSL2 with the Linux install script also works.

Verify the installation:

ollama --version

You need version 0.6 or later. If you’re on an older version, the install script handles the upgrade automatically.

Step 2: Pull Llama 4 Scout

The official model tag is llama4:scout. This pulls the Q4_K_M quantized version at 67 GB.

ollama pull llama4:scout

On a 500 Mbps connection, expect 15-20 minutes. The model downloads in chunks and checksums automatically. If your connection drops, re-running the command resumes where it left off.

For the community-quantized versions with different bit depths:

# Ultra-low VRAM (Q2 quantization, ~35 GB download)
ollama pull compcj/llama4-scout-ud-q2-k-xl

Step 3: Run Your First Prompt

Start an interactive session:

ollama run llama4:scout

The first run takes 30-60 seconds to load the model into VRAM. Subsequent prompts respond in under 200ms for first token.

To use it via the API (for integration with scripts, apps, or other tools):

curl http://localhost:11434/api/generate -d '{
  "model": "llama4:scout",
  "prompt": "Explain mixture-of-experts architecture in three sentences.",
  "stream": false
}'

The API runs on port 11434 by default. It’s the same OpenAI-compatible format that most LLM tooling already supports.

Step 4: Configure for Your Hardware

Ollama auto-detects your GPU and allocates VRAM. But you can tune it.

Set GPU layers manually (useful if you want to split between GPU and CPU):

OLLAMA_NUM_GPU=999 ollama run llama4:scout

Setting OLLAMA_NUM_GPU=999 forces all layers onto the GPU. If your VRAM can’t hold the full model, Ollama automatically spills excess layers to system RAM.

Limit context length (saves VRAM at the cost of shorter conversations):

ollama run llama4:scout --num-ctx 4096

Scout’s 10M context window is theoretical. In practice, context length is limited by your available VRAM. A 4096-token context uses far less memory than the default and is enough for most single-turn tasks.

Multi-GPU setup: If you have two GPUs, Ollama can split the model across them. Set CUDA_VISIBLE_DEVICES=0,1 before launching.

Step 5: Use Llama 4 With Your Tools

Once Ollama is running, Llama 4 Scout works with any tool that supports the OpenAI API format.

Open WebUI (browser-based chat interface):

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Point it at http://host.docker.internal:11434 and select llama4:scout from the model dropdown.

Python integration:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="llama4:scout",
    messages=[{"role": "user", "content": "What is mixture-of-experts?"}]
)
print(response.choices[0].message.content)

LangChain, LlamaIndex, Autogen: All support Ollama as a backend. Point the base URL to http://localhost:11434 and set the model name to llama4:scout.

Scout vs Maverick: Which One to Run Locally

Spec Scout Maverick
Total parameters 109B 400B
Active parameters 17B 17B
Experts 16 128
Context window 10M tokens 1M tokens
Min VRAM (quantized) ~11 GB (int4) ~200 GB
Local viability Yes (single consumer GPU) No (multi-GPU server)
Multimodal Text + images Text + images
Languages 12 12

Scout is the only realistic choice for local deployment. Maverick requires roughly 200 GB of VRAM even with aggressive quantization. That’s a multi-GPU server rack, not a desktop.

The tradeoff is quality depth. Maverick’s 128 experts give it broader specialization than Scout’s 16. For general-purpose tasks, coding, writing, and analysis, Scout handles them well. For highly specialized domain tasks, Maverick pulls ahead, but you’ll need cloud infrastructure to run it.

Performance Tuning and Troubleshooting

Out of memory errors: Reduce context length with --num-ctx 2048, or switch to a more aggressive quantization. If you’re on a 24 GB card with int4, you have roughly 13 GB of headroom for KV cache.

Slow first response: Model loading takes 30-60 seconds on first prompt. Keep the Ollama server running in the background to avoid cold starts. The model stays in VRAM until you explicitly unload it or run a different model.

Quality degradation: If responses feel noticeably worse than the API version, you’re likely on too aggressive a quantization. Move from Q2 to Q4, or from int4 to int8. The jump from 2-bit to 4-bit quantization recovers most of the quality loss.

Image inputs: Llama 4 is natively multimodal. In Ollama, pass images directly:

ollama run llama4:scout "Describe this image" --image ./photo.jpg

Meta tested image understanding with up to 5 input images per prompt. Beyond that, results become unreliable.

When to Run Locally vs Use an API

Local inference makes sense when you need privacy (no data leaves your machine), zero latency variability, no per-token costs, or offline access. A single RTX 4090 running Scout at int8 produces 105 tok/s. That’s competitive with most hosted API endpoints and costs nothing per request after the hardware investment.

APIs make more sense if you need Maverick-level quality, don’t want to manage hardware, or need burst capacity beyond what a single GPU handles. The cost equation flips once you’re running more than roughly 10,000 requests per day, at which point self-hosting is cheaper.

Sources: Meta AI Blog, Ollama Library, Hugging Face, Unsloth, NVIDIA Developer Blog

Share
?

FAQ

How much VRAM do I need to run Llama 4 locally?
24 GB of VRAM is the practical minimum for Llama 4 Scout with quantization. An RTX 4090 with int8 quantization runs Scout at 105 tokens per second with negligible quality loss. With int4 quantization, the model fits in about 11 GB of VRAM.
Can I run Llama 4 on CPU only?
Yes, but performance is extremely slow. CPU-only inference requires 64 GB of system RAM and produces roughly 0.5 tokens per second. It's usable for testing but not practical for regular use.
What is the difference between Llama 4 Scout and Maverick?
Scout has 109B total parameters with 16 experts and fits on a single consumer GPU. Maverick has 400B total parameters with 128 experts and requires multi-GPU server infrastructure. Both have 17B active parameters per token.
Does Ollama support Llama 4 multimodal features?
Yes. Llama 4 in Ollama handles both text and image inputs natively. You can pass images directly with the --image flag. Meta tested image understanding with up to 5 input images per prompt.
How long does it take to download Llama 4 Scout?
The Q4_K_M quantized model is 67 GB. On a 500 Mbps connection, expect 15-20 minutes. Downloads resume automatically if interrupted.