projectstemplatesedge-computing

5 Starter Projects for Raspberry Pi 5 + AI HAT+ 2 (Code, Models, and Templates)

UUnknown

2026-01-22

10 min read

Five practical Pi 5 + AI HAT+ 2 starter templates — chatbots, voice assistants, image captioning, IoT monitors, and offline search to ship prototypes fast.

Ship working AI prototypes on a Raspberry Pi 5 + AI HAT+ 2 — fast

Struggling to move from idea to prototype? You’re not alone. Developers and IT teams want small, reproducible projects they can demo, iterate on, and port to production. In 2026 the Raspberry Pi 5 combined with the new AI HAT+ 2 is a practical edge platform for real-world prototypes: low-cost, local-first, and capable of running quantized multimodal models. This guide gives you five ready-to-run starter projects — chatbots, voice assistants, image captioners, IoT monitors, and offline search — with concrete setup steps, model recommendations, code snippets, and deployable templates so you can ship prototypes quickly.

Why this matters in 2026 — trends driving edge AI

Recent trends (late 2025 → early 2026) accelerate the value of edge-first AI:

Offline AI parity: quantized GGUF/int8/int4 models and optimized runtimes (llama.cpp, whisper.cpp, ONNX Runtime Mobile) have made on-device inference practical for many tasks.
Privacy and compliance: regulations and customer demand push models to run locally when possible. See field guidance for deploying privacy-first edge kits in the Field Playbook 2026.
Multimodal everywhere: smaller multimodal models now handle image+text pipelines efficiently on ARM devices — useful when combining a vision encoder and a small decoder as described in lightweight multimodal notes like hybrid clip architectures.
Standardized formats: GGUF and ONNX are the de-facto formats for fast model loading on edge devices.

Combine those with the Pi 5 + AI HAT+ 2 and you get a low-cost prototyping environment for real products.

Quick prerequisites — hardware and base setup

Before you start the projects below, get the base platform ready. These steps assume a fresh Raspberry Pi 5 with a connected AI HAT+ 2.

Minimum hardware & accessories

Raspberry Pi 5 (64-bit OS)
AI HAT+ 2 mounted and drivers installed (vendor SDK)
Fast storage: NVMe SSD via adapter or UHS-II/III microSD for models — portable network and media kits are covered in field reviews like portable network & COMM kits.
USB microphone (or ReSpeaker HAT), USB camera / Pi Camera v4, and a small speaker
Active cooling and a 5A USB-C PSU

Base software

Install a 64-bit Raspberry Pi OS (2026 release or newer) or Ubuntu 24.04/26.04 arm64.
Install vendor runtime/SDK for AI HAT+ 2 per the manufacturer. Usually: apt repo + apt install ai-hat-sdk or a pip wheel for the runtime. Restart and run vendor diagnostics.
Install common runtimes: llama.cpp / ggml for GGUF models, whisper.cpp for offline ASR (see practical ASR integration patterns in omnichannel transcription workflows), and ONNX Runtime (arm64 with NNAPI/ARM Compute support).
Install Python 3.11+, virtualenv, and common packages: numpy, pillow, aiohttp, fastapi, uvicorn, tflite-runtime (if needed).

Tip: Store large models on SSD and symlink them into /home/pi/models. SSDs dramatically reduce page-in latency vs slower microSD cards.

Project templates overview

Each project below includes: a short overview, recommended models and runtimes, quick setup commands, a minimal code snippet, and deployment tips. Use these as templates — copy, tweak models, and iterate. For real-time collaboration and field workflows that rely on low-latency local inference, see practical notes on edge-assisted live collaboration.

1) Local Chatbot — small LLM conversational server

Overview

Run a lightweight conversational model on-device to power demos, kiosks, or local support agents without cloud dependencies. The server exposes a simple HTTP API for sending prompts and receiving responses.

What to use

Model: GGUF quantized 1B–3B (examples: Mistral tiny, Mistral-1-Small, or community 3B GGUF). Pick the largest that fits memory.
Runtime: llama.cpp or a vendor-optimized runtime supporting GGUF.
Framework: small FastAPI server to wrap the runtime. For deployment patterns and observability, review guidance on observability for workflow microservices.

Quick install

# install dependencies (example)
sudo apt update && sudo apt install -y build-essential git cmake python3-venv python3-pip
# build llama.cpp with ARM optimizations
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && make -j4
# create Python environment
python3 -m venv venv && . venv/bin/activate
pip install fastapi uvicorn httpx

Minimal server (template)

from fastapi import FastAPI
import subprocess

app = FastAPI()
MODEL_PATH = '/home/pi/models/chat-gguf.gguf'

@app.post('/api/chat')
def chat(prompt: dict):
    text = prompt.get('text','')
    # call llama.cpp server example - replace with real binary + params
    cmd = ['./main', '-m', MODEL_PATH, '-p', text, '--n_predict', '256']
    out = subprocess.check_output(cmd, text=True)
    return { 'response': out }

Deployment tips

Run as a systemd service for reliability.
Cache recent contexts on disk; rotate to avoid memory pressure.
Quantize aggressively (int8→int4) to fit bigger models — test for quality drop. For broader cost/efficiency tradeoffs and when to keep work local vs. cloud, see cloud cost notes in cloud cost optimization.

2) Voice Assistant — offline speech in/out

Overview

Build a local voice assistant that handles wake words, local ASR, on-device intent parsing with a small LLM, and TTS. This pattern is ideal for privacy-first voice UIs or kiosks.

What to use

Wake word: Porcupine (Picovoice) or VOSK wake-word engine.
ASR: whisper.cpp or a small VOSK model for offline speech-to-text — practical integration patterns are covered in on-device voice interface guidance.
LLM: GGUF 1B–3B local model for intent handling.
TTS: Coqui TTS or lightweight WaveRNN variant quantized for ARM.

Quick setup

# install whisper.cpp (sample)
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp && make -j4
# install Porcupine via pip or vendor package
pip install pvporcupine sounddevice

Assistant loop (skeleton)

import pvporcupine, sounddevice as sd, subprocess

# wake-word init
porcupine = pvporcupine.create(keywords=['hey pi'])
# stream audio, on wake: record buffer, pass to whisper.cpp for ASR
# then call local LLM server (from project 1) and pipe response to TTS

Practical tips

Use voice activity detection (VAD) to reduce false captures.
Keep ASR models small (whisper-small-v2 quantized) for near-real-time results.
Pre-generate TTS voices for templated responses to reduce TTS latency. See integration tips in on-device voice write-ups for web UI tradeoffs.

3) Image Captioner — on-device multimodal demo

Overview

Create a Pi camera-based image captioning service to turn photos into textual descriptions. This is useful for accessibility demos, inventory tagging prototypes, or visual search seeds.

What to use

Vision encoder: a compact ViT or MobileNet backbone converted to ONNX or GGUF (quantized).
Captioner head: small Flan-T5 or BloomZ distilled to 1B–3B and quantized.
Runtime: ONNX Runtime (for vision) + llama.cpp/ggml (for text) or an integrated multimodal GGUF model if one fits. For batching and repurposing media at the edge, see hybrid clip architectures.

Quick capture + caption script

from PIL import Image
import requests, io

# capture using raspistill or PiCamera (example uses file)
img = Image.open('/tmp/capture.jpg')
# preprocess -> send to vision encoder; receive vector -> prompt to local LLM
# Simplified pseudo-call:
# 1) call vision encoder: python vision_encode.py /tmp/capture.jpg -> vector.npy
# 2) pass vector to captioner: ./caption_local --vec vector.npy

Template repo structure

vision_encode.py — image preprocessing + ONNX runtime call
caption_server.py — loads GGUF caption model and accepts vectors
camera_daemon.service — triggers on motion or schedule

Optimization tips

Use integer quantization for vision backbones to reduce memory.
Batch images only for offline tagging; single-shot inference for interactive captions.
Consider a two-stage flow: small on-device caption + optional cloud refine for high-quality captions. For practical on-site edge micro-event deployments, see the Field Playbook.

4) IoT Monitor — smart edge telemetry and anomaly detection

Overview

Turn the Pi into an intelligent IoT gateway that ingests sensor data, runs an on-device anomaly detector, and exposes a dashboard and alerts. Use it for factory floor monitoring, environmental sensors, and predictive maintenance prototypes.

What to use

Data ingestion: MQTT (mosquitto) + lightweight collectors (python paho-mqtt).
Anomaly detection: small LSTM/Transformer quantized to TFLite or ONNX, or an on-device autoencoder.
Dashboard: Grafana/InfluxDB or a lightweight FastAPI + Vue/React front end served locally. Observability and runtime validation patterns are detailed in observability for workflow microservices.

Quick setup (example)

# install mosquitto
sudo apt install -y mosquitto mosquitto-clients
pip install paho-mqtt numpy scipy

# simple subscriber that runs detection
from paho.mqtt import client as mqtt
import numpy as np

def on_message(client, userdata, msg):
    data = np.frombuffer(msg.payload, dtype=np.float32)
    # call anomaly detector model
    is_anom = detect_anomaly(data)
    if is_anom:
        client.publish('alerts', b'ANOMALY')

Deployment & scaling

Keep the model tiny and use TFLite/ONNX for fast inferencing.
Aggregate telemetry to a central server only when necessary — rely on local rules for immediate actions.
Implement secure device identity (certs) and rate-limit telemetry to save bandwidth. For portable field kits and network considerations, see portable network kits.

5) Offline Search — semantic local file search

Overview

Build a private, offline semantic search engine for documents, code, or audio transcripts. Use a small embedding model to create vectors and a lightweight nearest-neighbor index for fast retrieval on-device.

What to use

Embeddings model: compact GGUF or ONNX embedding model (e.g., 384–1024 dims)
Index: HNSW via nmslib or FAISS (arm64 build) with disk-backed storage
Search API: FastAPI with simple ranking + local LLM for answer synthesis

Quick build steps

# index pipeline (pseudo)
# 1. extract text (pdf, docs) -> store corpus
# 2. embed using model -> vector.npy
# 3. build HNSW index
import nmslib, numpy as np
index = nmslib.init(method='hnsw', space='cosinesimil')
index.addDataPointBatch(np.stack(vectors))
index.createIndex({'post': 2}, print_progress=True)
index.saveIndex('search.idx')

Search endpoint (template)

from fastapi import FastAPI
app = FastAPI()

@app.get('/search')
def search(q: str):
    qvec = embed_text(q)
    ids, distances = index.knnQuery(qvec, k=5)
    results = [corpus[i] for i in ids]
    # optionally synthesize short answer using local LLM
    return { 'results': results }

Practical tips

Use lower-dim embeddings (384–512) to save memory and index size.
Persist index periodically and keep incremental indexing for new docs.
Combine semantic with lightweight keyword filters for speed-sensitive queries.

Advanced strategies and troubleshooting

Quantization & model formats

Quantization is the most powerful lever to make larger models fit. In 2026, GGUF + int8/int4 quantization is mature and supported by most edge runtimes. Always compare int8 versus int4 for accuracy. Test on real conversation/data before committing to a quantization level.

Memory & storage

Store models on SSD; use swap cautiously and avoid heavy swapping during inference.
Use memory-mapped model loading when supported by the runtime to reduce peak RAM usage.

Performance tips

Pin threads and set environment vars (e.g., OMP_NUM_THREADS) to optimize CPU usage.
Use small context windows; trim long histories for chatbots to keep latency bounded.
Use batching for vision or embedding tasks when throughput is more important than latency. For strategies on repurposing edge media and batching, see hybrid clip repurposing.

Security & privacy

Keep sensitive data local whenever possible. Use encrypted storage for models and secure transports (MQTT w/ TLS, HTTPS) for remote communications. Implement role-based access for admin endpoints.

Template repo & file patterns (starter)

Use this simple repo layout across projects. It’s compatible with Docker or native systemd deployment. For organizing templates and modular delivery, review modular publishing workflows.

README.md — setup + run steps
requirements.txt — lock Python deps
/models — symlink to your SSD-hosted models
/services — FastAPI servers, systemd unit files
/scripts — capture, onboarding, maintenance scripts (model download, quantize)

Real-world checklist (ship-ready prototyping)

Proof-of-concept: one feature working end-to-end (speech → text → intent → response)
Stability: run continuous 24–72h stress tests
Monitoring: CPU/temp logs + simple heartbeats to central dashboard
Security: disable unnecessary ports, rotate keys, and use minimal permission models
Field trial: deploy 1–3 units and collect user feedback before scaling (see field deployments & micro-event kits in the Field Playbook)

Why start with these templates?

Each template here targets a common developer pain point: you want something that runs locally, is easy to reason about, and can be extended. In 2026 the ecosystem around Raspberry Pi 5 and AI HAT+ 2 is now rich with optimized runtimes and quantized models — so you can iterate on feature value instead of fighting infrastructure.

Actionable next steps — 30/90 day plan

Day 0–7: Set up Pi 5 + AI HAT+ 2, build runtimes, and run vendor diagnostics.
Day 7–21: Pick one project (chatbot or voice assistant). Get a minimal end-to-end demo working.
Day 21–60: Add robustness (systemd service, logging, small test harness). Quantize and measure latency/quality trade-offs.
Day 60–90: Run a small field trial (3–10 devices) and collect user telemetry for improvements.

Closing — build faster, iterate locally

By combining the Raspberry Pi 5 with the AI HAT+ 2, modern quantized models, and edge runtimes, you can get meaningful AI features into a prototype in days, not months. Use the templates above to avoid common pitfalls, choose the right quantization levels, and focus on product value — not just model size.

Ready to start? Clone one template, pick a small model (1B–3B), and get an end-to-end demo working in a day. Share your build with thecoding.club community or fork the starter repo to iterate with peers.

Call to action

Grab the starter templates now: clone the repo, flash your Pi, and post your prototype on our community channel for feedback. Need help selecting models or tuning quantization? Ask for a quick configuration review — we’ll help you optimize for latency, memory, and accuracy.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.