Ollama and My System — Getting Local AI Working on Intel Arc

Written for future me, students, and peers who are trying to do the same thing.

This is the full story of getting open-weight LLMs running locally on phoenix (my ASUS Zenbook Duo UX8406MA), including GPU acceleration via Intel Arc — what worked, what didn’t, and why. If you’re on similar hardware (Intel integrated graphics, no NVIDIA), this will save you several hours.

For a quick-start reference guide and model recommendations, see 2026-03-22.

The Goal

Run open-weight LLMs locally so I can use AI-assisted coding and sysadmin tools without sending data to any cloud service and without burning through API tokens. Specifically, I wanted something that works with Aider — the closest open equivalent to Claude Code.

My Hardware (phoenix)

Component	Detail
Machine	ASUS Zenbook Duo UX8406MA
CPU	Intel Core Ultra 9 185H (Meteor Lake, 6P+8E+2LP-E cores, 22 threads)
GPU	Intel Arc (integrated, Meteor Lake-P) — ~23GB shared VRAM
RAM	30 GB
OS	Ubuntu 24.04 LTS + KDE

The key thing to understand about this machine: there is no discrete GPU. The Intel Arc is an integrated GPU that shares system RAM. This matters a lot for how inference works — more on that below.

Step 1: Install Ollama

Ollama is the easiest way to download and manage open LLMs. One command installs it and sets up a background systemd service:

curl -fsSL https://ollama.com/install.sh | sh

Verify it installed and is running:

ollama --version
systemctl status ollama

Ollama runs a background server on port 11434. You never need to start it manually — it starts on boot. The service runs as a dedicated ollama system user.

Pulling Models

ollama pull qwen2.5-coder:7b    # best for coding tasks
ollama pull llama3.1:8b         # good general purpose
ollama pull qwen2.5:7b          # general purpose, great for sysadmin
ollama pull nemotron-3-nano:4b  # smallest/fastest option

How Ollama Stores Models

This tripped me up later, so worth knowing upfront: Ollama does not store models as .gguf files with human-readable names. It stores them as content-addressed blobs:

/usr/share/ollama/.ollama/models/
├── manifests/
│   └── registry.ollama.ai/library/
│       ├── qwen2.5-coder/7b    ← JSON manifest, references blobs
│       └── llama3.1/8b
└── blobs/
    └── sha256-60e05f210...     ← the actual model weights (4.7GB)

The manifest links a model name/tag to its blob hash. To use a model outside of Ollama, you need to dig out that hash (see Extracting Models from Ollama Blobs below).

Running Models Interactively

ollama run qwen2.5-coder:7b

This drops you into a >>> prompt. Type naturally. /bye to exit, /clear to reset context.

Ollama’s API

Ollama also exposes a local API at http://localhost:11434 which is OpenAI-compatible at /v1/. This is what tools like Aider and Continue.dev use:

# Generate (streaming by default, add "stream":false for a complete response)
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-coder:7b",
  "prompt": "What is a symlink?",
  "stream": false
}'

Note on streaming: Without "stream": false, the response is a stream of JSON chunks. The terminal output looks garbled. Always add "stream": false when testing manually.

Step 2: The GPU Problem

After getting Ollama running, I noticed everything was slow. Running ollama ps confirmed it:

NAME                ID    SIZE   PROCESSOR    UNTIL
qwen2.5-coder:7b    ...   4.6GB  100% CPU     ...

Ollama was not using the GPU at all. I investigated why.

The Ollama binary ships with only a CUDA backend (NVIDIA GPUs). On Linux, the pre-built binary contains no Vulkan, no OpenCL, and no Intel support. Confirmed by inspecting the binary:

ls /usr/local/lib/ollama/
# Output: cuda_v12

That’s it. One backend. For NVIDIA only.

Lesson learned: Ollama’s official pre-built binary is effectively CPU-only on any non-NVIDIA system. The docs hint at GPU support but the binary doesn’t include it. This is not obvious from the website.

The practical consequence: on my machine with Intel Arc, Ollama will always run at CPU speeds regardless of settings.

What CPU Speed Actually Looks Like

For reference, CPU-only inference on the Core Ultra 9 185H (which has 6 Performance cores that are actually useful for matrix math):

~2 t/s generation for a 7B model
Prompt processing at ~10 t/s
A typical multi-paragraph response takes 1–3 minutes

Usable for occasional queries. Not usable as a real-time coding assistant.

Step 3: Finding a GPU Path

I explored two options before finding what works.

What I Tried First: IPEX-LLM

Intel maintains IPEX-LLM, an optimized inference library that uses Intel’s SYCL/oneAPI stack to run on Intel GPUs. It ships pre-built llama.cpp binaries compiled for Intel hardware.

I installed it:

# Install Intel oneAPI runtime (SYCL + MKL + oneDNN)
wget -qO- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
  | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] \
  https://apt.repos.intel.com/oneapi all main" \
  | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install -y intel-oneapi-runtime-dpcpp-cpp intel-oneapi-runtime-mkl intel-oneapi-runtime-dnnl
 
# Install IPEX-LLM in a venv
python3 -m venv ~/ipex-llm-env
source ~/ipex-llm-env/bin/activate
pip install --pre --upgrade 'ipex-llm[cpp]'

It installed successfully, including a llama-ls-sycl-device binary for listing SYCL-accessible GPUs. But when I ran it, it crashed immediately:

terminate called after throwing an instance of 'sycl::_V1::exception'
  what(): No device of requested type available.

The crash was in dpct::dev_mgr::dev_mgr() inside libggml-sycl.so — the DPCT device manager failed to initialize before it even tried to list devices.

Lesson learned: IPEX-LLM’s pre-built binaries (November 2025 build) are incompatible with newer Intel GPU compute runtimes (v26.x, released 2026). The SYCL binary was compiled against an older driver ABI. This is a version pinning problem that Intel hasn’t caught up with yet. Do not try to downgrade the compute runtime — it’s the same driver your display uses. The fix is to wait for IPEX-LLM to publish a compatible build, or use a different path entirely.

What Actually Works: llama.cpp + Vulkan

llama.cpp is the underlying inference engine that Ollama wraps. When built from source with -DGGML_VULKAN=ON, it uses the Vulkan graphics/compute API for GPU acceleration — and Vulkan works on Intel Arc.

The key insight: build from source so the binary matches the exact Vulkan driver already installed on the system. No version mismatch possible.

Step 4: Building llama.cpp with Vulkan

Prerequisites

# Vulkan development headers (may already be installed)
sudo apt install -y libvulkan-dev
 
# GLSL shader compiler (required by llama.cpp's Vulkan backend)
sudo apt install -y glslc glslang-tools
 
# Standard build tools
sudo apt install -y cmake build-essential git

Clone and Build

git clone https://github.com/ggml-org/llama.cpp ~/llama.cpp
cd ~/llama.cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

The -j$(nproc) uses all CPU cores. On the 185H this takes about 3–4 minutes.

Verify the GPU is detected:

~/llama.cpp/build/bin/llama-cli --list-devices
# Output:
# Available devices:
#   Vulkan0: Intel(R) Arc(tm) Graphics (MTL) (23585 MiB, 10557 MiB free)

That 23585 MiB is shared system RAM that the Arc GPU can address. With 30GB total RAM and the OS using ~11GB, there’s about 10.5GB free for the model.

Keeping llama.cpp Updated

cd ~/llama.cpp && git pull
cmake --build build --config Release -j$(nproc)

Step 5: Extracting Models from Ollama Blobs

Since Ollama stores models as hashed blobs rather than .gguf files, I needed to extract them for use with llama.cpp directly. The cleanest approach is symlinking — no disk space wasted.

# Find the blob hash for a model from its manifest
sudo cat /usr/share/ollama/.ollama/models/manifests/registry.ollama.ai/library/qwen2.5-coder/7b \
  | python3 -c "
import sys, json
data = json.load(sys.stdin)
for layer in data['layers']:
    if 'model' in layer['mediaType']:
        print(layer['digest'])
"
# Output: sha256:60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463
 
# Symlink it as a named .gguf file
mkdir -p ~/models
ln -sf /usr/share/ollama/.ollama/models/blobs/sha256-60e05f210... \
       ~/models/qwen2.5-coder-7b.gguf

I did this for all four installed models:

BLOBS=/usr/share/ollama/.ollama/models/blobs
ln -sf $BLOBS/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463 ~/models/qwen2.5-coder-7b.gguf
ln -sf $BLOBS/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 ~/models/llama3.1-8b.gguf
ln -sf $BLOBS/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 ~/models/qwen2.5-7b.gguf
ln -sf $BLOBS/sha256-527db2cf6c705d8fabb95693d038d9c06b4a2b0b8b0a4bbdbd01212d37242970 ~/models/nemotron-3-nano-4b.gguf

Note: If you run ollama rm on a model, the blob it pointed to will be deleted. The symlink will break. Re-pull with ollama pull if that happens, and re-symlink with the new hash from the manifest.

Step 6: Running the GPU Server

llama.cpp includes llama-server, an OpenAI-compatible HTTP server. This is the bridge between the GPU inference engine and tools like Aider.

The key flag is -ngl 99 (n-gpu-layers = 99), which offloads all model layers to the GPU.

~/llama.cpp/build/bin/llama-server \
  --model ~/models/qwen2.5-coder-7b.gguf \
  --n-gpu-layers 99 \
  --ctx-size 4096 \
  --host 127.0.0.1 \
  --port 8081 \
  --api-key local

The server logs will show:

llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Arc(tm) Graphics (MTL))
load_tensors: offloading output layer to GPU
load_tensors: offloading 27 repeating layers to GPU

All 27 transformer layers offloaded. The API is live at http://localhost:8081/v1.

Why Port 8081?

Port 8080 is taken by Open WebUI on this machine. Worth checking ss -tlnp | grep 8080 before using any port.

Measured Performance

Mode	Prompt processing	Token generation
Ollama CPU-only	~10 t/s	~2 t/s
llama.cpp + Vulkan (Arc GPU)	28.8 t/s	5.6 t/s

~3× faster generation than CPU-only. For a coding assistant, this is the difference between a response taking 90 seconds vs 30 seconds. Still not as fast as cloud APIs, but genuinely usable for interactive work.

Step 7: The Wrapper Scripts

I wrote two scripts in ~/bin/ to make daily use frictionless.

`~/bin/llama-serve`

Starts llama-server with the right flags for any of my installed models:

llama-serve           # qwen2.5-coder:7b (default)
llama-serve llama     # llama3.1:8b
llama-serve qwen      # qwen2.5:7b
llama-serve nano      # nemotron-3-nano:4b (fastest, for quick tasks)

`~/bin/aider-local`

Launches Aider connected to the local server, auto-starting it if needed:

aider-local           # uses coder model
aider-local llama     # uses llama model
aider-local coder src/main.py  # open specific files

Both scripts are in ~/bin/ which is on $PATH.

Step 8: Aider Integration

Aider is a CLI coding assistant — the open-source equivalent of Claude Code. It reads your project files, understands git history, and edits code based on natural language instructions.

Installation

pipx install aider-chat

Using pipx keeps it isolated from system Python. The aider binary lands in ~/.local/bin/.

Connecting to the Local Server

Aider supports any OpenAI-compatible API endpoint. Point it at llama-server:

aider \
  --model openai/qwen2.5-coder-7b.gguf \
  --openai-api-base http://localhost:8081/v1 \
  --openai-api-key local \
  --no-auto-commits \
  --no-gitignore

The model name (openai/qwen2.5-coder-7b.gguf) uses the openai/ prefix to tell Aider to use the OpenAI-compatible API path. The model name after the slash must match what llama-server reports at /v1/models.

Typical Aider Workflow

# Terminal 1: start the GPU server
llama-serve
 
# Terminal 2: launch Aider in your project
cd ~/myproject
aider-local src/api.py src/models.py
 
# Inside Aider:
> add input validation to the create_user function
> write a test for the login endpoint
> fix the TypeError on line 47 of models.py

Aider shows a diff before applying any changes. Use git to review or revert.

Summary: What’s Installed and Where

Thing	Location
Ollama binary	`/usr/local/bin/ollama`
Ollama models (blobs)	`/usr/share/ollama/.ollama/models/`
llama.cpp source + build	`~/llama.cpp/`
llama-server binary	`~/llama.cpp/build/bin/llama-server`
Model symlinks	`~/models/`
IPEX-LLM venv (unused)	`~/ipex-llm-env/`
Intel oneAPI runtime	`/opt/intel/oneapi/redist/`
Wrapper scripts	`~/bin/llama-serve`, `~/bin/aider-local`
Aider	`~/.local/bin/aider` (via pipx)

Adding a New Model in the Future

# 1. Download via Ollama
ollama pull <modelname>:<tag>
 
# 2. Get the blob hash from the manifest
sudo cat /usr/share/ollama/.ollama/models/manifests/registry.ollama.ai/library/<name>/<tag> \
  | python3 -c "import sys,json; d=json.load(sys.stdin); \
    [print(l['digest'].replace('sha256:','sha256-')) for l in d['layers'] \
    if 'model' in l['mediaType']]"
 
# 3. Symlink it
ln -sf /usr/share/ollama/.ollama/models/blobs/sha256-<hash> ~/models/<name>.gguf
 
# 4. Add an entry to ~/bin/llama-serve (the MODELS associative array near the top)

Lessons for Students and Peers

1. Ollama is not GPU-accelerated on Intel. The pre-built binary ships CUDA-only. If you’re on Intel Arc, AMD integrated, or any non-NVIDIA GPU, you’re running CPU inference whether you know it or not. Check ollama ps to confirm.

2. Pre-built AI binaries have tight version dependencies. IPEX-LLM’s SYCL binaries crashed because they were compiled against an older Intel driver ABI. When pre-built binaries fail with cryptic GPU errors, building from source (as with llama.cpp) is often the most reliable fix — you get a binary that matches exactly what’s on your system.

3. Vulkan is the practical cross-vendor GPU compute layer on Linux. CUDA is NVIDIA-only. OpenCL works but the ecosystem is fragmented. Vulkan has broad driver support, including Intel Arc, and llama.cpp’s Vulkan backend is actively maintained.

4. Integrated GPU ≠ bad for inference. The Intel Arc on Meteor Lake has ~23GB of addressable memory (shared system RAM) and enough compute to run 7B models 3× faster than CPU-only. It won’t beat a discrete RTX 4090, but it’s a real speedup for interactive use.

5. Model blobs are reusable across tools. Ollama, llama.cpp, llama-server, and others all speak the same GGUF format. You don’t need separate downloads for each tool — symlink Ollama’s blobs and use them everywhere.

Last updated: April 2026 — phoenix (ASUS Zenbook Duo UX8406MA, Ubuntu 24.04)

The Netyeti's Journal

Explorer

2026-03-29-start