Why Ollama (And Why Now)?
If you want production‑like experiments without cloud keys or per‑call fees, Ollama gives you a local‑first developer path:
- Zero friction: Install once; pull models on demand; everything runs on
localhostby default. - One API, two runtimes: The same API works for local and (optional) cloud models, so you can start on your laptop and scale later with minimal code changes.
- Batteries included: Simple CLI (
ollama run,ollama pull), a clean REST API, an official Python client, embeddings, and vision support. - Repeatability: A
Modelfile(think: Dockerfile for models) captures system prompts and parameters so teams get the same behaviour.
What’s New in Late 2025 (at a Glance)
- Cloud models (preview): Run larger models on managed GPUs with the same API surface; develop locally, scale in the cloud without code changes.
- OpenAI‑compatible endpoints: Point OpenAI SDKs at Ollama (
/v1) for easy migration and local testing. - Windows desktop app: Official GUI for Windows users; drag‑and‑drop, multimodal inputs, and background service management.
- Safety/quality updates: Recent safety‑classification models and runtime optimizations (e.g., flash‑attention toggles in select backends) to improve performance.
How Ollama Works (Architecture in 90 Seconds)
- Runtime: A lightweight server listens on
localhost:11434and exposes REST endpoints for chat, generate, and embeddings. Responses stream token‑by‑token. - Model format (GGUF): Models are packaged in quantized
.ggufbinaries for efficient CPU/GPU inference and fast memory‑mapped loading. - Inference engine: Built on the
llama.cppfamily of kernels with GPU offload via Metal (Apple Silicon), CUDA (NVIDIA), and others; choose quantization for your hardware. - Configuration:
Modelfilepins base model, system prompt, parameters, adapters (LoRA), and optional templates — so your team’s runs are reproducible.
Install in 60 Seconds
macOS / Windows / Linux
1. Download and install Ollama from the official site (choose your OS).