Host the Right Model Through Ollama

Intro

Although hard to imagine before AI, by 2026 I rarely “hand-write code” at work and instead use AI to assist development more efficiently. I believe in an era where many AI models compete, you don’t need to be obsessed with a particular model, pick a free model and you’ll be fine for most of the tasks. Models will only get cheaper and more efficient.

Start by prioritizing free model services: AI Agent Tools I’m Using in 2026. Sometimes you still need a local LLM becuase:

  • Offline
  • Rate limits
  • Latency
  • Policy
  • Privacy
  • $$

The Easiest Way to Run a Local LLM

OLLAMA🔗 is one of the most active free tools in the open-source community. If you download the latest OLLAMA🔗 you’ll find it’s simple enough to integrate common chat and coding tools and can launch local models with a single command:

Terminal window
ollama launch opencode
ollama
Ollama 0.20.3
Chat with a model
Start an interactive chat with a model
Launch Claude Code
Anthropic's coding tool with subagents
Launch Codex
OpenAI's open-source coding agent
Launch OpenClaw (install)
Personal AI with 100+ skills
Launch OpenCode
Anomaly's open-source coding agent
Launch Droid (not installed)
Factory's coding agent across terminal and IDEs
Launch Pi (install)
Minimal AI agent toolkit with plugin support

Go to OLLAMA Models🔗 and pick a model to run directly. For example, the latest gemma4🔗 installs with:

Terminal window
ollama run gemma4

In practice, use CanIRun.ai🔗 to select your device specs and filter suitable models.

Understanding Model Configurations

Parameters

It’s a bit like the number of brain cells

Model parameter counts are in billions (e.g., 7B or 70B). More parameters usually mean a stronger model, higher memory requirements, and often slower inference. 7B suits basic tasks, 13B–34B is a balance, and 70B+ can offer higher peak quality but needs powerful hardware.

Quantization

It’s like compressing memory

Quantization in models is indicated by GGUF quantization naming, such as Q4_K_M, Q8_0, or F16 (16-bit, the highest original precision). Quantization reduces precision to shrink model size and speed up execution, at the cost of some quality.

  • Q = Quantization
  • F = Float
  • K = K-means quantization algorithm
  • M = Medium (size tier)

VRAM

VRAM is GPU memory. Quantized model files usually need to fit entirely in VRAM (or Apple unified memory); otherwise they won’t run properly or will fall back to slow CPU inference (e.g., an 8GB model typically needs at least 8GB VRAM).

Dense vs MoE (Mixture of Experts) architectures

  • Dense models are simple and predictable; all parameters are activated for each operation.

  • MoE models split parameters into expert groups and only activate a few experts per token (e.g., Mixtral 8x7B totals 46.7B but only uses 12.9B), delivering large-model quality while requiring full memory load, achieving high effective parameter scale with less compute per token.

Context Length

The maximum number of tokens a model can process at once.

Tokens per Second (tok/s)

How many tokens the LLM generates per second.

Memory Bandwidth

The speed of reading data from VRAM.

Inference bottlenecks are often about reading model weights from memory, so higher bandwidth means more tokens can be read per second. This explains why Macs with Apple chips (high unified memory bandwidth) run large models well, and why an RTX 4090 can generate text faster than an RTX 4060 even with the same VRAM usage.

Further Reading