Local AI Coding — GPU-Accelerated Setup (Intel Arc via Vulkan)
A guide to running open-weight LLMs locally on this machine with Intel Arc GPU acceleration, using them as coding/sysadmin assistants — without sending tokens to Anthropic.
For the full story of how this setup was built — including what failed and why — see 2026-03-29-start.
Status (Apr 2026): llama.cpp + Vulkan is the working GPU path. Ollama runs CPU-only (CUDA binary, no Intel support). IPEX-LLM SYCL path has a runtime version incompatibility with the installed compute runtime.
Measured Performance (qwen2.5-coder:7b on Arc GPU)
| Mode | Prompt speed | Generation speed |
|---|---|---|
| Ollama (CPU only) | ~10 t/s | ~2 t/s |
| llama.cpp + Vulkan (Arc GPU) | 28.8 t/s | 5.6 t/s |
This Machine’s Specs (phoenix)
| Component | Detail |
|---|---|
| CPU | Intel Core Ultra 9 185H (Meteor Lake) |
| GPU | Intel Arc (integrated, Meteor Lake-P) |
| RAM | 30 GB |
| OS | Ubuntu + KDE |
No discrete GPU. You can run 7B–13B parameter models comfortably in CPU mode. Intel Arc integrated graphics has limited shared VRAM, so CPU inference is the most reliable path here.
Why Use Local Models?
- No token costs — runs entirely on your hardware
- Privacy — your code and prompts never leave your machine
- Offline — works without internet
- Tradeoff — noticeably less capable than Claude Sonnet/Opus for complex reasoning and multi-step tasks
WORKING SETUP: llama.cpp + Vulkan + Aider
This is the recommended path for GPU-accelerated local AI on this machine.
Quick Start (everything already installed)
# Start the GPU server with the coding model
llama-serve # uses qwen2.5-coder:7b by default
llama-serve llama # use llama3.1:8b instead
llama-serve nano # use nemotron-3-nano:4b (fastest)
# In another terminal, launch Aider connected to the local server
aider-local # auto-starts server if not runningThe server runs at http://localhost:8081 (OpenAI-compatible API).
Model Files (symlinked from Ollama blobs)
| Alias | Model | File |
|---|---|---|
coder | qwen2.5-coder:7b | ~/models/qwen2.5-coder-7b.gguf |
llama | llama3.1:8b | ~/models/llama3.1-8b.gguf |
qwen | qwen2.5:7b | ~/models/qwen2.5-7b.gguf |
nano | nemotron-3-nano:4b | ~/models/nemotron-3-nano-4b.gguf |
Scripts
~/bin/llama-serve— startsllama-serverwith Vulkan GPU offloading~/bin/aider-local— launches Aider connected to the local server
Adding New Models
# Download via Ollama
ollama pull <modelname>
# Find the blob hash
cat /usr/share/ollama/.ollama/models/manifests/registry.ollama.ai/library/<name>/<tag> \
| python3 -c "import sys,json; d=json.load(sys.stdin); \
[print(l['digest']) for l in d['layers'] if 'model' in l['mediaType']]"
# Symlink it
ln -s /usr/share/ollama/.ollama/models/blobs/sha256-<hash> ~/models/<name>.gguf
# Add it to ~/bin/llama-serve in the MODELS arrayUse with Aider Directly (without the wrapper script)
# Start server first
llama-serve coder
# In another terminal
aider \
--model openai/qwen2.5-coder-7b.gguf \
--openai-api-base http://localhost:8081/v1 \
--openai-api-key local \
--no-auto-commits \
--no-gitignoreHow It Was Built
Dependencies installed
sudo apt install -y glslc glslang-tools
sudo apt install -y intel-oneapi-runtime-dpcpp-cpp intel-oneapi-runtime-mkl intel-oneapi-runtime-dnnlllama.cpp build
git clone https://github.com/ggml-org/llama.cpp ~/llama.cpp
cd ~/llama.cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)To rebuild after updates:
cd ~/llama.cpp && git pull
cmake --build build --config Release -j$(nproc)Part 1: Install Ollama (background model manager)
Ollama is a tool that makes downloading and running open LLMs as simple as ollama run modelname.
curl -fsSL https://ollama.com/install.sh | shThis installs the ollama binary and sets up a systemd service that runs in the background.
Verify it’s running:
ollama --version
systemctl status ollamaPart 2: Choose and Pull a Model
Models are downloaded with ollama pull. They’re stored in ~/.ollama/models/.
Recommended models for this machine (30GB RAM, no discrete GPU)
| Model | Size | Best for |
|---|---|---|
qwen2.5-coder:7b | ~4.7 GB | Code generation, debugging |
qwen2.5:7b | ~4.7 GB | General purpose, sysadmin |
llama3.1:8b | ~4.9 GB | General purpose, good reasoning |
deepseek-coder-v2:16b | ~9 GB | Stronger coding, slower |
qwen2.5-coder:14b | ~9 GB | Best coding without discrete GPU |
Note: Models up to ~14B run at acceptable speed on your CPU. 32B+ models will be very slow (minutes per response).
Pull a model:
ollama pull qwen2.5-coder:7b
ollama pull llama3.1:8bList downloaded models:
ollama listPart 3: Run a Model Interactively
ollama run qwen2.5-coder:7bYou’ll get a >>> prompt. Type your question and press Enter. Type /bye to exit.
Example session:
>>> Write a bash script that backs up my home directory to /mnt/backup
... (model responds with script)
>>> Now add error handling if /mnt/backup doesn't exist
... (model continues)
/bye
Useful / commands inside the session:
/help— list all commands/clear— clear conversation history/show info— show model details/bye— exit
Part 4: Use Ollama as an API
Ollama runs a local REST API at http://localhost:11434. This is how other tools (Aider, Continue.dev, etc.) connect to it.
Test it:
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5-coder:7b",
"prompt": "What is a symlink?",
"stream": false
}'The API is OpenAI-compatible at /v1/, meaning any tool that supports OpenAI’s API can point at Ollama.
Part 5: Aider — CLI Coding Assistant (Claude Code equivalent)
Aider is the closest equivalent to Claude Code for local models. It reads your code files, understands the codebase, and makes edits based on your instructions — all from the terminal.
Install Aider
pip install aider-chat
# or, if you prefer pipx to avoid polluting global Python:
pipx install aider-chatRun Aider with Ollama
# In your project directory:
aider --model ollama/qwen2.5-coder:7b
# Or for a stronger model:
aider --model ollama/qwen2.5-coder:14bAider will read files you specify and let you give it instructions in plain English:
> Add input validation to the login function in auth.py
> Fix the bug on line 42 of server.py
> Refactor the database module to use connection pooling
It shows you a diff before applying changes and commits to git automatically (optional).
Useful Aider flags
aider --model ollama/qwen2.5-coder:7b \
--no-auto-commits \ # don't auto-commit changes
--dark-mode \ # better for dark terminals
src/main.py src/utils.py # open specific filesPart 6: Continue.dev — VS Code / IDE Integration
If you use VS Code or a JetBrains IDE, Continue.dev gives you an in-editor chat panel, inline completions, and codebase-aware Q&A — all powered by your local Ollama models.
Install
In VS Code: Extensions → search “Continue” → Install
Configure for Ollama
Open ~/.continue/config.json and add:
{
"models": [
{
"title": "Qwen2.5 Coder 7B (local)",
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Qwen2.5 Coder 7B",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}Restart VS Code. You’ll see the Continue panel on the left sidebar.
Part 7: Open WebUI (Optional — Browser Chat Interface)
If you want a ChatGPT-style web UI for Ollama:
docker run -d \
--network=host \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:mainThen open http://localhost:8080 in your browser.
Part 8: Intel Arc GPU Acceleration (Advanced)
Ollama supports Vulkan-based GPU acceleration, which works with Intel Arc. This can speed up inference significantly.
Check if Vulkan is available:
vulkaninfo --summary 2>/dev/null | grep "GPU id"If Vulkan is working, Ollama should auto-detect and use the Arc GPU. You can verify:
ollama run qwen2.5-coder:7b
# In another terminal:
ollama psollama ps shows which layers are loaded on GPU vs CPU.
For better Intel Arc support, you can also look into IPEX-LLM (Intel’s optimized inference engine), though it requires more setup than Ollama.
Quick Reference
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull models
ollama pull qwen2.5-coder:7b
ollama pull llama3.1:8b
# Chat interactively
ollama run qwen2.5-coder:7b
# List models
ollama list
# Remove a model
ollama rm modelname
# Check what's running
ollama ps
# Stop the Ollama service
systemctl stop ollama
# Aider (coding assistant)
aider --model ollama/qwen2.5-coder:7bLimitations vs Claude Code
| Feature | Claude Code (Anthropic) | Aider + Ollama (local) |
|---|---|---|
| Code quality | Excellent | Good (7B), Better (14B+) |
| Reasoning | Excellent | Moderate |
| Context window | 200K tokens | 8K–32K (model dependent) |
| Tool use / shell exec | Yes | Limited |
| Cost | Per token | Free (electricity) |
| Privacy | Sent to Anthropic | Stays on your machine |
| Speed | Fast | Moderate (CPU-bound) |
For sysadmin tasks, shell scripting, and targeted code edits, local 7B–14B models perform well. For architecture-level reasoning, debugging complex multi-file issues, or long context tasks, Claude Sonnet is noticeably better.