Local AI Coding — GPU-Accelerated Setup (Intel Arc via Vulkan)

A guide to running open-weight LLMs locally on this machine with Intel Arc GPU acceleration, using them as coding/sysadmin assistants — without sending tokens to Anthropic.

For the full story of how this setup was built — including what failed and why — see 2026-03-29-start.

Status (Apr 2026): llama.cpp + Vulkan is the working GPU path. Ollama runs CPU-only (CUDA binary, no Intel support). IPEX-LLM SYCL path has a runtime version incompatibility with the installed compute runtime.

Measured Performance (qwen2.5-coder:7b on Arc GPU)

Mode	Prompt speed	Generation speed
Ollama (CPU only)	~10 t/s	~2 t/s
llama.cpp + Vulkan (Arc GPU)	28.8 t/s	5.6 t/s

This Machine’s Specs (phoenix)

Component	Detail
CPU	Intel Core Ultra 9 185H (Meteor Lake)
GPU	Intel Arc (integrated, Meteor Lake-P)
RAM	30 GB
OS	Ubuntu + KDE

No discrete GPU. You can run 7B–13B parameter models comfortably in CPU mode. Intel Arc integrated graphics has limited shared VRAM, so CPU inference is the most reliable path here.

Why Use Local Models?

No token costs — runs entirely on your hardware
Privacy — your code and prompts never leave your machine
Offline — works without internet
Tradeoff — noticeably less capable than Claude Sonnet/Opus for complex reasoning and multi-step tasks

WORKING SETUP: llama.cpp + Vulkan + Aider

This is the recommended path for GPU-accelerated local AI on this machine.

Quick Start (everything already installed)

# Start the GPU server with the coding model
llama-serve           # uses qwen2.5-coder:7b by default
llama-serve llama     # use llama3.1:8b instead
llama-serve nano      # use nemotron-3-nano:4b (fastest)
 
# In another terminal, launch Aider connected to the local server
aider-local           # auto-starts server if not running

The server runs at http://localhost:8081 (OpenAI-compatible API).

Model Files (symlinked from Ollama blobs)

Alias	Model	File
`coder`	qwen2.5-coder:7b	`~/models/qwen2.5-coder-7b.gguf`
`llama`	llama3.1:8b	`~/models/llama3.1-8b.gguf`
`qwen`	qwen2.5:7b	`~/models/qwen2.5-7b.gguf`
`nano`	nemotron-3-nano:4b	`~/models/nemotron-3-nano-4b.gguf`

Scripts

~/bin/llama-serve — starts llama-server with Vulkan GPU offloading
~/bin/aider-local — launches Aider connected to the local server

Adding New Models

# Download via Ollama
ollama pull <modelname>
 
# Find the blob hash
cat /usr/share/ollama/.ollama/models/manifests/registry.ollama.ai/library/<name>/<tag> \
  | python3 -c "import sys,json; d=json.load(sys.stdin); \
    [print(l['digest']) for l in d['layers'] if 'model' in l['mediaType']]"
 
# Symlink it
ln -s /usr/share/ollama/.ollama/models/blobs/sha256-<hash> ~/models/<name>.gguf
 
# Add it to ~/bin/llama-serve in the MODELS array

Use with Aider Directly (without the wrapper script)

# Start server first
llama-serve coder
 
# In another terminal
aider \
  --model openai/qwen2.5-coder-7b.gguf \
  --openai-api-base http://localhost:8081/v1 \
  --openai-api-key local \
  --no-auto-commits \
  --no-gitignore

How It Was Built

Dependencies installed

sudo apt install -y glslc glslang-tools
sudo apt install -y intel-oneapi-runtime-dpcpp-cpp intel-oneapi-runtime-mkl intel-oneapi-runtime-dnnl

llama.cpp build

git clone https://github.com/ggml-org/llama.cpp ~/llama.cpp
cd ~/llama.cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

To rebuild after updates:

cd ~/llama.cpp && git pull
cmake --build build --config Release -j$(nproc)

Part 1: Install Ollama (background model manager)

Ollama is a tool that makes downloading and running open LLMs as simple as ollama run modelname.

curl -fsSL https://ollama.com/install.sh | sh

This installs the ollama binary and sets up a systemd service that runs in the background.

Verify it’s running:

ollama --version
systemctl status ollama

Part 2: Choose and Pull a Model

Models are downloaded with ollama pull. They’re stored in ~/.ollama/models/.

Recommended models for this machine (30GB RAM, no discrete GPU)

Model	Size	Best for
`qwen2.5-coder:7b`	~4.7 GB	Code generation, debugging
`qwen2.5:7b`	~4.7 GB	General purpose, sysadmin
`llama3.1:8b`	~4.9 GB	General purpose, good reasoning
`deepseek-coder-v2:16b`	~9 GB	Stronger coding, slower
`qwen2.5-coder:14b`	~9 GB	Best coding without discrete GPU

Note: Models up to ~14B run at acceptable speed on your CPU. 32B+ models will be very slow (minutes per response).

Pull a model:

ollama pull qwen2.5-coder:7b
ollama pull llama3.1:8b

List downloaded models:

ollama list

Part 3: Run a Model Interactively

ollama run qwen2.5-coder:7b

You’ll get a >>> prompt. Type your question and press Enter. Type /bye to exit.

Example session:

>>> Write a bash script that backs up my home directory to /mnt/backup
... (model responds with script)

>>> Now add error handling if /mnt/backup doesn't exist
... (model continues)

/bye

Useful / commands inside the session:

/help — list all commands
/clear — clear conversation history
/show info — show model details
/bye — exit

Part 4: Use Ollama as an API

Ollama runs a local REST API at http://localhost:11434. This is how other tools (Aider, Continue.dev, etc.) connect to it.

Test it:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-coder:7b",
  "prompt": "What is a symlink?",
  "stream": false
}'

The API is OpenAI-compatible at /v1/, meaning any tool that supports OpenAI’s API can point at Ollama.

Part 5: Aider — CLI Coding Assistant (Claude Code equivalent)

Aider is the closest equivalent to Claude Code for local models. It reads your code files, understands the codebase, and makes edits based on your instructions — all from the terminal.

Install Aider

pip install aider-chat
# or, if you prefer pipx to avoid polluting global Python:
pipx install aider-chat

Run Aider with Ollama

# In your project directory:
aider --model ollama/qwen2.5-coder:7b
 
# Or for a stronger model:
aider --model ollama/qwen2.5-coder:14b

Aider will read files you specify and let you give it instructions in plain English:

> Add input validation to the login function in auth.py
> Fix the bug on line 42 of server.py
> Refactor the database module to use connection pooling

It shows you a diff before applying changes and commits to git automatically (optional).

Useful Aider flags

aider --model ollama/qwen2.5-coder:7b \
  --no-auto-commits \       # don't auto-commit changes
  --dark-mode \             # better for dark terminals
  src/main.py src/utils.py  # open specific files

Part 6: Continue.dev — VS Code / IDE Integration

If you use VS Code or a JetBrains IDE, Continue.dev gives you an in-editor chat panel, inline completions, and codebase-aware Q&A — all powered by your local Ollama models.

Install

In VS Code: Extensions → search “Continue” → Install

Configure for Ollama

Open ~/.continue/config.json and add:

{
  "models": [
    {
      "title": "Qwen2.5 Coder 7B (local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5 Coder 7B",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

Restart VS Code. You’ll see the Continue panel on the left sidebar.

Part 7: Open WebUI (Optional — Browser Chat Interface)

If you want a ChatGPT-style web UI for Ollama:

docker run -d \
  --network=host \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:8080 in your browser.

Part 8: Intel Arc GPU Acceleration (Advanced)

Ollama supports Vulkan-based GPU acceleration, which works with Intel Arc. This can speed up inference significantly.

Check if Vulkan is available:

vulkaninfo --summary 2>/dev/null | grep "GPU id"

If Vulkan is working, Ollama should auto-detect and use the Arc GPU. You can verify:

ollama run qwen2.5-coder:7b
# In another terminal:
ollama ps

ollama ps shows which layers are loaded on GPU vs CPU.

For better Intel Arc support, you can also look into IPEX-LLM (Intel’s optimized inference engine), though it requires more setup than Ollama.

Quick Reference

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
 
# Pull models
ollama pull qwen2.5-coder:7b
ollama pull llama3.1:8b
 
# Chat interactively
ollama run qwen2.5-coder:7b
 
# List models
ollama list
 
# Remove a model
ollama rm modelname
 
# Check what's running
ollama ps
 
# Stop the Ollama service
systemctl stop ollama
 
# Aider (coding assistant)
aider --model ollama/qwen2.5-coder:7b

Limitations vs Claude Code

Feature	Claude Code (Anthropic)	Aider + Ollama (local)
Code quality	Excellent	Good (7B), Better (14B+)
Reasoning	Excellent	Moderate
Context window	200K tokens	8K–32K (model dependent)
Tool use / shell exec	Yes	Limited
Cost	Per token	Free (electricity)
Privacy	Sent to Anthropic	Stays on your machine
Speed	Fast	Moderate (CPU-bound)

For sysadmin tasks, shell scripting, and targeted code edits, local 7B–14B models perform well. For architecture-level reasoning, debugging complex multi-file issues, or long context tasks, Claude Sonnet is noticeably better.

The Netyeti's Journal

Explorer

2026-03-22