Local Models (Ollama & LM Studio)
EdgeCrab works with local LLMs through Ollama and LM Studio — no API keys, no internet, no billing. This is the recommended setup for air-gapped environments, privacy-sensitive work, or learning without cost.
Ollama
Section titled “Ollama”Ollama is the easiest way to run LLMs locally. It handles model downloads, quantization, GPU acceleration, and serves a local OpenAI-compatible API.
Installation
Section titled “Installation”# macOSbrew install ollama
# Linuxcurl -fsSL https://ollama.com/install.sh | sh
# Windows# Download from https://ollama.com/downloadStart the Server
Section titled “Start the Server”ollama serve# Listening on http://127.0.0.1:11434Keep this running in a terminal (or as a background service) while using EdgeCrab.
Pull Models
Section titled “Pull Models”ollama pull llama3.3 # General-purpose flagship (4.9 GB, 4-bit)ollama pull codestral # Code-specialized model (4.1 GB)ollama pull gemma3:27b # Google Gemma 3 27B (16 GB)ollama pull qwen2.5-coder:7b # Qwen 2.5 Coder 7B (4.4 GB, fast)ollama pull phi4 # Microsoft Phi-4 (8.1 GB)Check available models:
ollama listUsing with EdgeCrab
Section titled “Using with EdgeCrab”# One-off sessionedgecrab --model ollama/llama3.3
# Make Ollama the defaultedgecrab setup # Choose "ollama" when promptedIn config.yaml:
provider: ollamamodel: llama3.3Recommended Models by Use Case
Section titled “Recommended Models by Use Case”| Use Case | Model | Size | Notes |
|---|---|---|---|
| General coding | codestral | 4.1 GB | Best code model for Ollama |
| General purpose | llama3.3 | 4.9 GB | Meta’s latest, excellent quality |
| Fast, lightweight | qwen2.5-coder:7b | 4.4 GB | Great performance for size |
| High quality (needs 16+ GB) | gemma3:27b | 16 GB | Excellent reasoning |
| Fastest response | phi4 | 8.1 GB | Microsoft’s efficient model |
GPU Acceleration
Section titled “GPU Acceleration”Ollama automatically uses your GPU if available:
- Apple Silicon (M1/M2/M3/M4): uses Metal via llama.cpp — full GPU acceleration
- NVIDIA: CUDA support automatic with compatible drivers
- AMD: ROCm support on Linux
Check GPU utilization:
# macOSsudo powermetrics --samplers gpu_power -i1000 -n1
# Linux (NVIDIA)nvidia-smiLM Studio
Section titled “LM Studio”LM Studio is a desktop application for downloading and running local models with a GUI.
- Download LM Studio from lmstudio.ai
- Install and open it
- Search for and download a model (e.g. “Llama 3.3”, “Codestral”)
- Click “Start Server” in the Local Server tab
The server starts on http://localhost:1234 with an OpenAI-compatible API.
Using with EdgeCrab
Section titled “Using with EdgeCrab”# Use whatever model is currently loaded in LM Studioedgecrab --model lmstudio/local-modelIn config.yaml:
provider: lmstudiomodel: local-model # Any string — LM Studio ignores the model name and uses the loaded modelPerformance Tips
Section titled “Performance Tips”Context Length
Section titled “Context Length”Local models typically support 8K–32K context windows (much less than cloud models). EdgeCrab automatically compresses history when approaching the limit, but you can also reduce it:
session: max_context_tokens: 8000 # Match your local model's context windowModel Quantization
Section titled “Model Quantization”Smaller quantizations (Q4, Q5) are faster with slightly lower quality. Larger (Q8, fp16) are slower but more accurate. For coding tasks, Q5_K_M or Q6_K are a good balance.
Threads
Section titled “Threads”For CPU-only inference, set the number of threads to your CPU core count:
# Ollama — set via environmentOLLAMA_NUM_THREAD=8 ollama serveCold Start
Section titled “Cold Start”Ollama keeps the model loaded in memory after first use. Cold start (first request after pulling the model) can take 5–30 seconds depending on model size and hardware. Subsequent requests are fast.
Offline Mode
Section titled “Offline Mode”When using local models, you may want to disable web tools to ensure no outbound connections:
edgecrab --model ollama/llama3.3 --toolset file,terminal,memory,skillsOr in config.yaml:
tools: enabled: - file - terminal - memory - skills # web and session omittedPro Tips
Section titled “Pro Tips”- Choose model size by VRAM/RAM: A rule of thumb is 6 GB VRAM for 7B models at Q4, 12 GB for 13B, 24 GB for 34B. On CPU, double these values as RAM requirements.
- Set a low max_iterations for local testing: Local models are slower, so
EDGECRAB_MAX_ITERATIONS=10keeps experiments snappy while you pick the right model. - Use Ollama’s model aliases:
ollama pull codestral:latestpins to the latest release. Omit the tag if you wantollama pull codestralto auto-upgrade. - LM Studio model name doesn’t matter: EdgeCrab sends
model: local-modelbut LM Studio uses whatever model it has loaded. Only the server URL matters. - Keep the Ollama server warm: Use
OLLAMA_KEEP_ALIVE=24h ollama serveso the model stays loaded in VRAM between sessions. - Monitor GPU memory:
ollama psshows which models are loaded and their VRAM usage.
Which model should I start with?
For coding: codestral (Ollama) is the best all-round local coding model at 4 GB. For general tasks: llama3.3 gives the best quality. For low-RAM machines: qwen2.5-coder:7b fits in 4 GB.
My model fits in RAM but generation is slow. Why?
The model is running on CPU because the GPU isn’t detected. Check ollama ps — a (CPU) label means no GPU offloading. Install CUDA drivers (NVIDIA) or verify Metal is enabled (macOS).
Can I use a local model for some tasks and a cloud model for others?
Yes. Use --model ollama/codestral for local tasks and override per-session. No global config change needed.
Does EdgeCrab support OpenAI-format custom base URLs?
Yes. Set base_url: http://localhost:11434/v1 in config.yaml to point at any OpenAI-compatible local server.
My local model outputs garbage with tool calls. What’s wrong?
Not all local models support function calling (tool use). Stick to models with :tools variants in Ollama (e.g. llama3.1:8b-instruct-q4_K_M) or models tested with OpenAI function calling format.
See Also
Section titled “See Also”- Provider Overview — full list of supported providers
- Environment Variables —
EDGECRAB_MODELandOLLAMA_*vars - Offline Mode toolset config — disable web tools for air-gapped use