Ollama and My System — Getting Local AI Working on Intel Arc
Written for future me, students, and peers who are trying to do the same thing.
This is the full story of getting open-weight LLMs running locally on phoenix (my ASUS Zenbook Duo UX8406MA), including GPU acceleration via Intel Arc — what worked, what didn’t, and why. If you’re on similar hardware (Intel integrated graphics, no NVIDIA), this will save you several hours.
For a quick-start reference guide and model recommendations, see 2026-03-22.
The Goal
Run open-weight LLMs locally so I can use AI-assisted coding and sysadmin tools without sending data to any cloud service and without burning through API tokens. Specifically, I wanted something that works with Aider — the closest open equivalent to Claude Code.
My Hardware (phoenix)
| Component | Detail |
|---|---|
| Machine | ASUS Zenbook Duo UX8406MA |
| CPU | Intel Core Ultra 9 185H (Meteor Lake, 6P+8E+2LP-E cores, 22 threads) |
| GPU | Intel Arc (integrated, Meteor Lake-P) — ~23GB shared VRAM |
| RAM | 30 GB |
| OS | Ubuntu 24.04 LTS + KDE |
The key thing to understand about this machine: there is no discrete GPU. The Intel Arc is an integrated GPU that shares system RAM. This matters a lot for how inference works — more on that below.
Step 1: Install Ollama
Ollama is the easiest way to download and manage open LLMs. One command installs it and sets up a background systemd service:
curl -fsSL https://ollama.com/install.sh | shVerify it installed and is running:
ollama --version
systemctl status ollamaOllama runs a background server on port 11434. You never need to start it manually — it starts on boot. The service runs as a dedicated ollama system user.
Pulling Models
ollama pull qwen2.5-coder:7b # best for coding tasks
ollama pull llama3.1:8b # good general purpose
ollama pull qwen2.5:7b # general purpose, great for sysadmin
ollama pull nemotron-3-nano:4b # smallest/fastest optionHow Ollama Stores Models
This tripped me up later, so worth knowing upfront: Ollama does not store models as .gguf files with human-readable names. It stores them as content-addressed blobs:
/usr/share/ollama/.ollama/models/
├── manifests/
│ └── registry.ollama.ai/library/
│ ├── qwen2.5-coder/7b ← JSON manifest, references blobs
│ └── llama3.1/8b
└── blobs/
└── sha256-60e05f210... ← the actual model weights (4.7GB)
The manifest links a model name/tag to its blob hash. To use a model outside of Ollama, you need to dig out that hash (see Extracting Models from Ollama Blobs below).
Running Models Interactively
ollama run qwen2.5-coder:7bThis drops you into a >>> prompt. Type naturally. /bye to exit, /clear to reset context.
Ollama’s API
Ollama also exposes a local API at http://localhost:11434 which is OpenAI-compatible at /v1/. This is what tools like Aider and Continue.dev use:
# Generate (streaming by default, add "stream":false for a complete response)
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5-coder:7b",
"prompt": "What is a symlink?",
"stream": false
}'Note on streaming: Without
"stream": false, the response is a stream of JSON chunks. The terminal output looks garbled. Always add"stream": falsewhen testing manually.
Step 2: The GPU Problem
After getting Ollama running, I noticed everything was slow. Running ollama ps confirmed it:
NAME ID SIZE PROCESSOR UNTIL
qwen2.5-coder:7b ... 4.6GB 100% CPU ...
Ollama was not using the GPU at all. I investigated why.
The Ollama binary ships with only a CUDA backend (NVIDIA GPUs). On Linux, the pre-built binary contains no Vulkan, no OpenCL, and no Intel support. Confirmed by inspecting the binary:
ls /usr/local/lib/ollama/
# Output: cuda_v12That’s it. One backend. For NVIDIA only.
Lesson learned: Ollama’s official pre-built binary is effectively CPU-only on any non-NVIDIA system. The docs hint at GPU support but the binary doesn’t include it. This is not obvious from the website.
The practical consequence: on my machine with Intel Arc, Ollama will always run at CPU speeds regardless of settings.
What CPU Speed Actually Looks Like
For reference, CPU-only inference on the Core Ultra 9 185H (which has 6 Performance cores that are actually useful for matrix math):
- ~2 t/s generation for a 7B model
- Prompt processing at ~10 t/s
- A typical multi-paragraph response takes 1–3 minutes
Usable for occasional queries. Not usable as a real-time coding assistant.
Step 3: Finding a GPU Path
I explored two options before finding what works.
What I Tried First: IPEX-LLM
Intel maintains IPEX-LLM, an optimized inference library that uses Intel’s SYCL/oneAPI stack to run on Intel GPUs. It ships pre-built llama.cpp binaries compiled for Intel hardware.
I installed it:
# Install Intel oneAPI runtime (SYCL + MKL + oneDNN)
wget -qO- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] \
https://apt.repos.intel.com/oneapi all main" \
| sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install -y intel-oneapi-runtime-dpcpp-cpp intel-oneapi-runtime-mkl intel-oneapi-runtime-dnnl
# Install IPEX-LLM in a venv
python3 -m venv ~/ipex-llm-env
source ~/ipex-llm-env/bin/activate
pip install --pre --upgrade 'ipex-llm[cpp]'It installed successfully, including a llama-ls-sycl-device binary for listing SYCL-accessible GPUs. But when I ran it, it crashed immediately:
terminate called after throwing an instance of 'sycl::_V1::exception'
what(): No device of requested type available.
The crash was in dpct::dev_mgr::dev_mgr() inside libggml-sycl.so — the DPCT device manager failed to initialize before it even tried to list devices.
Lesson learned: IPEX-LLM’s pre-built binaries (November 2025 build) are incompatible with newer Intel GPU compute runtimes (v26.x, released 2026). The SYCL binary was compiled against an older driver ABI. This is a version pinning problem that Intel hasn’t caught up with yet. Do not try to downgrade the compute runtime — it’s the same driver your display uses. The fix is to wait for IPEX-LLM to publish a compatible build, or use a different path entirely.
What Actually Works: llama.cpp + Vulkan
llama.cpp is the underlying inference engine that Ollama wraps. When built from source with -DGGML_VULKAN=ON, it uses the Vulkan graphics/compute API for GPU acceleration — and Vulkan works on Intel Arc.
The key insight: build from source so the binary matches the exact Vulkan driver already installed on the system. No version mismatch possible.
Step 4: Building llama.cpp with Vulkan
Prerequisites
# Vulkan development headers (may already be installed)
sudo apt install -y libvulkan-dev
# GLSL shader compiler (required by llama.cpp's Vulkan backend)
sudo apt install -y glslc glslang-tools
# Standard build tools
sudo apt install -y cmake build-essential gitClone and Build
git clone https://github.com/ggml-org/llama.cpp ~/llama.cpp
cd ~/llama.cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)The -j$(nproc) uses all CPU cores. On the 185H this takes about 3–4 minutes.
Verify the GPU is detected:
~/llama.cpp/build/bin/llama-cli --list-devices
# Output:
# Available devices:
# Vulkan0: Intel(R) Arc(tm) Graphics (MTL) (23585 MiB, 10557 MiB free)That 23585 MiB is shared system RAM that the Arc GPU can address. With 30GB total RAM and the OS using ~11GB, there’s about 10.5GB free for the model.
Keeping llama.cpp Updated
cd ~/llama.cpp && git pull
cmake --build build --config Release -j$(nproc)Step 5: Extracting Models from Ollama Blobs
Since Ollama stores models as hashed blobs rather than .gguf files, I needed to extract them for use with llama.cpp directly. The cleanest approach is symlinking — no disk space wasted.
# Find the blob hash for a model from its manifest
sudo cat /usr/share/ollama/.ollama/models/manifests/registry.ollama.ai/library/qwen2.5-coder/7b \
| python3 -c "
import sys, json
data = json.load(sys.stdin)
for layer in data['layers']:
if 'model' in layer['mediaType']:
print(layer['digest'])
"
# Output: sha256:60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463
# Symlink it as a named .gguf file
mkdir -p ~/models
ln -sf /usr/share/ollama/.ollama/models/blobs/sha256-60e05f210... \
~/models/qwen2.5-coder-7b.ggufI did this for all four installed models:
BLOBS=/usr/share/ollama/.ollama/models/blobs
ln -sf $BLOBS/sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463 ~/models/qwen2.5-coder-7b.gguf
ln -sf $BLOBS/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 ~/models/llama3.1-8b.gguf
ln -sf $BLOBS/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 ~/models/qwen2.5-7b.gguf
ln -sf $BLOBS/sha256-527db2cf6c705d8fabb95693d038d9c06b4a2b0b8b0a4bbdbd01212d37242970 ~/models/nemotron-3-nano-4b.ggufNote: If you run
ollama rmon a model, the blob it pointed to will be deleted. The symlink will break. Re-pull withollama pullif that happens, and re-symlink with the new hash from the manifest.
Step 6: Running the GPU Server
llama.cpp includes llama-server, an OpenAI-compatible HTTP server. This is the bridge between the GPU inference engine and tools like Aider.
The key flag is -ngl 99 (n-gpu-layers = 99), which offloads all model layers to the GPU.
~/llama.cpp/build/bin/llama-server \
--model ~/models/qwen2.5-coder-7b.gguf \
--n-gpu-layers 99 \
--ctx-size 4096 \
--host 127.0.0.1 \
--port 8081 \
--api-key localThe server logs will show:
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Arc(tm) Graphics (MTL))
load_tensors: offloading output layer to GPU
load_tensors: offloading 27 repeating layers to GPU
All 27 transformer layers offloaded. The API is live at http://localhost:8081/v1.
Why Port 8081?
Port 8080 is taken by Open WebUI on this machine. Worth checking ss -tlnp | grep 8080 before using any port.
Measured Performance
| Mode | Prompt processing | Token generation |
|---|---|---|
| Ollama CPU-only | ~10 t/s | ~2 t/s |
| llama.cpp + Vulkan (Arc GPU) | 28.8 t/s | 5.6 t/s |
~3× faster generation than CPU-only. For a coding assistant, this is the difference between a response taking 90 seconds vs 30 seconds. Still not as fast as cloud APIs, but genuinely usable for interactive work.
Step 7: The Wrapper Scripts
I wrote two scripts in ~/bin/ to make daily use frictionless.
~/bin/llama-serve
Starts llama-server with the right flags for any of my installed models:
llama-serve # qwen2.5-coder:7b (default)
llama-serve llama # llama3.1:8b
llama-serve qwen # qwen2.5:7b
llama-serve nano # nemotron-3-nano:4b (fastest, for quick tasks)~/bin/aider-local
Launches Aider connected to the local server, auto-starting it if needed:
aider-local # uses coder model
aider-local llama # uses llama model
aider-local coder src/main.py # open specific filesBoth scripts are in ~/bin/ which is on $PATH.
Step 8: Aider Integration
Aider is a CLI coding assistant — the open-source equivalent of Claude Code. It reads your project files, understands git history, and edits code based on natural language instructions.
Installation
pipx install aider-chatUsing pipx keeps it isolated from system Python. The aider binary lands in ~/.local/bin/.
Connecting to the Local Server
Aider supports any OpenAI-compatible API endpoint. Point it at llama-server:
aider \
--model openai/qwen2.5-coder-7b.gguf \
--openai-api-base http://localhost:8081/v1 \
--openai-api-key local \
--no-auto-commits \
--no-gitignoreThe model name (openai/qwen2.5-coder-7b.gguf) uses the openai/ prefix to tell Aider to use the OpenAI-compatible API path. The model name after the slash must match what llama-server reports at /v1/models.
Typical Aider Workflow
# Terminal 1: start the GPU server
llama-serve
# Terminal 2: launch Aider in your project
cd ~/myproject
aider-local src/api.py src/models.py
# Inside Aider:
> add input validation to the create_user function
> write a test for the login endpoint
> fix the TypeError on line 47 of models.pyAider shows a diff before applying any changes. Use git to review or revert.
Summary: What’s Installed and Where
| Thing | Location |
|---|---|
| Ollama binary | /usr/local/bin/ollama |
| Ollama models (blobs) | /usr/share/ollama/.ollama/models/ |
| llama.cpp source + build | ~/llama.cpp/ |
| llama-server binary | ~/llama.cpp/build/bin/llama-server |
| Model symlinks | ~/models/ |
| IPEX-LLM venv (unused) | ~/ipex-llm-env/ |
| Intel oneAPI runtime | /opt/intel/oneapi/redist/ |
| Wrapper scripts | ~/bin/llama-serve, ~/bin/aider-local |
| Aider | ~/.local/bin/aider (via pipx) |
Adding a New Model in the Future
# 1. Download via Ollama
ollama pull <modelname>:<tag>
# 2. Get the blob hash from the manifest
sudo cat /usr/share/ollama/.ollama/models/manifests/registry.ollama.ai/library/<name>/<tag> \
| python3 -c "import sys,json; d=json.load(sys.stdin); \
[print(l['digest'].replace('sha256:','sha256-')) for l in d['layers'] \
if 'model' in l['mediaType']]"
# 3. Symlink it
ln -sf /usr/share/ollama/.ollama/models/blobs/sha256-<hash> ~/models/<name>.gguf
# 4. Add an entry to ~/bin/llama-serve (the MODELS associative array near the top)Lessons for Students and Peers
1. Ollama is not GPU-accelerated on Intel. The pre-built binary ships CUDA-only. If you’re on Intel Arc, AMD integrated, or any non-NVIDIA GPU, you’re running CPU inference whether you know it or not. Check ollama ps to confirm.
2. Pre-built AI binaries have tight version dependencies. IPEX-LLM’s SYCL binaries crashed because they were compiled against an older Intel driver ABI. When pre-built binaries fail with cryptic GPU errors, building from source (as with llama.cpp) is often the most reliable fix — you get a binary that matches exactly what’s on your system.
3. Vulkan is the practical cross-vendor GPU compute layer on Linux. CUDA is NVIDIA-only. OpenCL works but the ecosystem is fragmented. Vulkan has broad driver support, including Intel Arc, and llama.cpp’s Vulkan backend is actively maintained.
4. Integrated GPU ≠ bad for inference. The Intel Arc on Meteor Lake has ~23GB of addressable memory (shared system RAM) and enough compute to run 7B models 3× faster than CPU-only. It won’t beat a discrete RTX 4090, but it’s a real speedup for interactive use.
5. Model blobs are reusable across tools. Ollama, llama.cpp, llama-server, and others all speak the same GGUF format. You don’t need separate downloads for each tool — symlink Ollama’s blobs and use them everywhere.
Last updated: April 2026 — phoenix (ASUS Zenbook Duo UX8406MA, Ubuntu 24.04)