Intro
Although hard to imagine before AI, by 2026 I rarely “hand-write code” at work and instead use AI to assist development more efficiently. I believe in an era where many AI models compete, you don’t need to be obsessed with a particular model, pick a free model and you’ll be fine for most of the tasks. Models will only get cheaper and more efficient.
Start by prioritizing free model services: AI Agent Tools I’m Using in 2026. Sometimes you still need a local LLM becuase:
- Offline
- Rate limits
- Latency
- Policy
- Privacy
- $$
The Easiest Way to Run a Local LLM
OLLAMA is one of the most active free tools in the open-source community. If you download the latest OLLAMA you’ll find it’s simple enough to integrate common chat and coding tools and can launch local models with a single command:
ollama launch opencodeollamaOllama 0.20.3
Chat with a model Start an interactive chat with a model
Launch Claude Code Anthropic's coding tool with subagents
Launch Codex OpenAI's open-source coding agent
Launch OpenClaw (install) Personal AI with 100+ skills
▸ Launch OpenCode Anomaly's open-source coding agent
Launch Droid (not installed) Factory's coding agent across terminal and IDEs
Launch Pi (install) Minimal AI agent toolkit with plugin supportGo to OLLAMA Models and pick a model to run directly. For example, the latest gemma4 installs with:
ollama run gemma4In practice, use CanIRun.ai to select your device specs and filter suitable models.
Understanding Model Configurations
Parameters
It’s a bit like the number of brain cells
Model parameter counts are in billions (e.g., 7B or 70B). More parameters usually mean a stronger model, higher memory requirements, and often slower inference. 7B suits basic tasks, 13B–34B is a balance, and 70B+ can offer higher peak quality but needs powerful hardware.
Quantization
It’s like compressing memory
Quantization in models is indicated by GGUF quantization naming, such as Q4_K_M, Q8_0, or F16 (16-bit, the highest original precision). Quantization reduces precision to shrink model size and speed up execution, at the cost of some quality.
- Q = Quantization
- F = Float
- K = K-means quantization algorithm
- M = Medium (size tier)
VRAM
VRAM is GPU memory. Quantized model files usually need to fit entirely in VRAM (or Apple unified memory); otherwise they won’t run properly or will fall back to slow CPU inference (e.g., an 8GB model typically needs at least 8GB VRAM).
Dense vs MoE (Mixture of Experts) architectures
-
Dense models are simple and predictable; all parameters are activated for each operation.
-
MoE models split parameters into expert groups and only activate a few experts per token (e.g., Mixtral 8x7B totals
46.7Bbut only uses12.9B), delivering large-model quality while requiring full memory load, achieving high effective parameter scale with less compute per token.
Context Length
The maximum number of tokens a model can process at once.
Tokens per Second (tok/s)
How many tokens the LLM generates per second.
Memory Bandwidth
The speed of reading data from VRAM.
Inference bottlenecks are often about reading model weights from memory, so higher bandwidth means more tokens can be read per second. This explains why Macs with Apple chips (high unified memory bandwidth) run large models well, and why an RTX 4090 can generate text faster than an RTX 4060 even with the same VRAM usage.
Further Reading
- What is Ollama? Running Local LLMs Made Simple - IBM Technology
- Quantization - Hugging Face
- Docs - CanIRun.ai