Ollama — The Complete Guide
Smart language models (like ChatGPT) running directly on your own machine — no cloud required
Ollama is an open-source platform that lets you run powerful AI language models — LLMs (Large Language Models, the engines behind ChatGPT, Claude, and friends) — directly on your own machine. No internet connection required, no data shipped off to OpenAI or Google, everything stays with you in full privacy. The platform is written in Go and knows how to run dozens of well-known models including Google's Gemma, Meta's Llama, Alibaba's Qwen, and DeepSeek — all completely free. For me (Elad), Ollama mostly serves as a safety net: when cloud models get too expensive or hit rate limits, my agents (like Kami, Kaylee, and CrewAI) automatically fall back to a local model — saving a lot of money on routine tasks. For you it can be much more than that: a full AI environment that works offline, a solution for organizations with strict privacy requirements (healthcare, legal, security), or simply a way to explore the world of open language models without spending a dollar.
What this guide covers
So what actually is Ollama?
The simplest way into the world of local AI
Ollama was born as a project that challenges one assumption: that using advanced AI means connecting to some giant cloud vendor and paying them. It provides a single simple tool that downloads a model, loads it into memory, and opens it up for conversation — just like ChatGPT, but without OpenAI ever knowing anything about you.
Installation — every platform
Mac, Linux, Windows, Docker
Installing Ollama is a very simple operation that's supported on every major OS. My recommendation: install directly on your machine (Mac and Linux) — that gives you immediate access to your GPU and accelerates performance significantly. Docker — the system that runs software inside isolated 'boxes' — is reserved for people who truly need separation between servers or work in a production environment.
Which model should you pick?
Breakdown by use case — small vs large, chat vs code
Picking a model can feel complicated — the Ollama library has hundreds of models with names packed full of technical acronyms. The simple truth is that for each kind of task only five or six models actually matter, and in practice most people get by with two or three. Here's the practical guide to making a smart choice based on your task and your hardware.
Using the REST API
OpenAI-compatible — easy swap for existing integrations
The API is how software talks to Ollama from code. The default is port 11434 (the number the service listens on locally), and the API supports a range of paths: /api/generate for simple text generation, /api/chat for conversations with history, /api/embeddings to turn text into numbers, and /v1/chat/completions — a path that's fully compatible with OpenAI's API. That last one is the magic — any software that already knows how to talk to ChatGPT can switch to Ollama without changing almost anything.
Performance — what to expect and how to improve it
tokens/sec, latency, and throughput
Performance is the first question every Ollama newcomer asks: how fast will this be on my machine? The answer depends on three main factors — the size of the model (how 'smart' it is), your hardware (CPU alone, or a GPU that accelerates the compute), and the quantization level (compression). Here are the typical numbers in 2026, so you know what to expect up front — and how to improve things if the numbers don't satisfy you.
Integrating with the agent network
How Ollama fits with Kami, CrewAI, Delegator
Integration is the point where Ollama goes from being a nice local tool to becoming a beating part of a larger system. In my agent network, Ollama plays the role of a safety net (fallback — a backup plan) as well as a background worker for routine tasks that don't justify paying the cloud. Thanks to the OpenAI-compatible endpoint, every model in the network can swap from Claude or Gemini to Ollama with just a URL change. This is especially useful for classification tasks inside Adopter and for triaging intakes in Box.

