Custom Models (DXNN SDK)
Custom Models (DXNN SDK)
Take any ONNX model, trained in PyTorch, TensorFlow, Keras, or Ultralytics, compile it for the DEEPX NPU on the Sixfab Edge AI Expansion Board, and run it on Raspberry Pi 5 with Python or C++. No cloud service. No GPU at inference time.
The DXNN SDK converts a trained ONNX model into a .dxnn binary
that the DEEPX NPU can execute. The compiler (dx-compiler) runs once on an
Ubuntu x86_64 machine; the resulting .dxnn file then runs offline on the
Edge AI Expansion Board via the dxrt-runtime package on Raspberry Pi 5.
Quantization to INT8 is automatic, with approximately 2 % accuracy loss versus the
original FP32 model.
This page is the deepest custom-model reference for the Sixfab Edge AI Expansion Board for Raspberry Pi 5. It documents the conversion path end to end, calls out the constraints that catch most teams on their first deployment, and points to the partnership track that takes you from a labelled dataset to a deployed custom model in days.
The two-machine workflow
Custom-model deployment on the Edge AI Expansion Board is split between two machines because compilation is expensive and inference is not. Compilation runs once per model, on a desktop. Inference runs every frame, on the Pi.
The DEEPX compiler (dx-compiler, sometimes referred to as DX-COM) requires
Ubuntu x86_64 with at least 16 GB of RAM. The dxrt-runtime on Raspberry Pi 5
only loads and executes the compiled .dxnn file. No compiler toolchain is
needed on the Pi. Once a model is compiled, it runs on the Edge AI Expansion Board
indefinitely without an internet connection.
| Component | Runs on | What it does |
|---|---|---|
DXNN compiler (dx-compiler) |
Ubuntu x86_64 host | Converts an ONNX model into an optimized .dxnn file for the DEEPX NPU. Quantizes automatically to INT8. |
DEEPX runtime (dxrt-runtime) |
Raspberry Pi 5 with Edge AI Expansion Board | Loads .dxnn files and runs inference on the NPU. Provides Python (dx_engine) and C++ (dxrt) APIs. Installed via the sixfab-dx APT package. |
Compiler machine requirements
The dx-compiler must run on a separate Ubuntu x86_64 machine, not on the
Raspberry Pi itself. ARM / aarch64 is not supported for compilation.
ldd --version
Compilation is the expensive step. Run it once per model revision on the Ubuntu host
(locally, on a CI server, or overnight) and treat the produced .dxnn file
as the deployment artifact. Once compiled, the .dxnn file loads in
milliseconds on the Pi.
Conversion path: ONNX → DXNN → deployed artifact
Three stages take a trained model from a development machine onto the NPU. Each stage runs in a different environment and produces a different artifact.
Export the trained model to ONNX
Train the model in any major framework, then export to ONNX on the development machine.
ONNX is the only input format the dx-compiler accepts.
Vision architectures (object detection, classification, segmentation) are the supported scope on current DEEPX silicon. Supported families include YOLO, MobileNet, EfficientNet, ResNet, and more; the published reference list lives at developer.deepx.ai/modelzoo. LLMs and audio-only transformer architectures are not supported today; LLMs are on the DEEPX roadmap and Sixfab will support them as the silicon enables.
PyTorch export
import torch # 1. Load your trained model and switch to eval mode model.eval() # 2. Provide a dummy input matching the production input shape dummy_input = torch.randn(1, 3, 640, 640) # 3. Export to ONNX (opset 11 is widely compatible with dx-compiler) torch.onnx.export( model, dummy_input, "my_model.onnx", opset_version=11, input_names=["images"], output_names=["output"] )
Ultralytics YOLO export
Ultralytics ships a one-command ONNX export. This is the recommended starting point for YOLOv8n, YOLOv8s, and YOLOv8m.
yolo export model=yolov8n.pt format=onnx opset=11
Compile ONNX to DXNN with dx-compiler
The DEEPX NPU compiler reads the ONNX file and produces a .dxnn file.
Quantization to INT8 happens automatically; once the input size and other parameters are
provided in config.json, no manual precision configuration is required.
Expect approximately 2 % accuracy reduction compared to the original FP32 model. This is
the published, honest accuracy envelope for INT8 inference on DEEPX silicon. Do not
represent the result as FP32-equivalent.
When compiling a YOLO model, enable PPU (Post-Processing Unit) support in
dx-compiler. This offloads non-maximum suppression and confidence
filtering to the NPU, and is the largest single FPS improvement available for
detection models on the Edge AI Expansion Board. Without PPU support, the user is
responsible for implementing NMS in application code. Refer to the
dx-compiler documentation
for the PPU flag.
Install dx-compiler and run a compilation
# 1. Clone the dx-compiler toolchain on the Ubuntu x86_64 host git clone https://github.com/DEEPX-AI/dx-compiler.git cd dx-compiler # 2. Follow the upstream install procedure for the toolchain release # matching the dxrt-runtime version installed on your Pi 5. # The compiler reads model parameters from a config.json file. # 3. Compile ONNX → DXNN with the config for your model dx-compiler --config <config.json> --model my_model.onnx --output my_model.dxnn # 4. (Optional) inspect the compiled artifact file my_model.dxnn
What the compiler does to the model
- Operator mapping. Each ONNX op is mapped to a DEEPX NPU instruction sequence. Unsupported ops either error out or fall back to CPU execution at runtime, depending on architecture.
- INT8 quantization. Weights and activations are calibrated and quantized. No developer configuration is required once
config.jsondeclares the input shape. - Memory planning. The compiler lays out tensors to fit NPU on-chip memory. Models that exceed available NPU memory fail at compile time.
- Optional PPU fusion. For YOLO architectures, the post-processing graph (NMS, score filtering) is fused onto the NPU when PPU is enabled, eliminating a CPU step per frame.
Compiler installation reference: github.com/DEEPX-AI/dx-compiler →
Deploy the .dxnn artifact and run inference
The compiled .dxnn file is the deployment artifact. Copy it to the
Raspberry Pi 5, then load it with the DEEPX runtime (dxrt-runtime) and run
inference. The Python (dx_engine) and C++ (dxrt) APIs are
functionally equivalent; C++ runs slightly more efficiently on tight inference loops.
sixfab-dx APT package must already be
installed on the Pi 5. If not, follow the
Quickstart
first — it sets up the kernel driver (via DKMS), the runtime, and the CLI tools in a
single command.
Copy the artifact to the Pi
# Replace <pi_user> and <pi_host> with your values scp my_model.dxnn <pi_user>@<pi_host>:~/models/
dx-compiler at the end of compilation, then pick
the variant (DEEPX DX-M1M for the larger envelope at 25 TOPS at INT8 precision,
DEEPX DX-M1ML for the smaller envelope at 13 TOPS at INT8 precision) that fits the
workload.
Run inference on the NPU (Python)
Load the .dxnn file with the dx_engine Python API.
Preprocessing and postprocessing run on the Pi 5 CPU; the NPU executes the model. The
following pattern is the synchronous form — one frame in, one result out.
from dx_engine import InferenceEngine import numpy as np import cv2 # 1. Load the compiled DXNN model onto the NPU engine = InferenceEngine("<model_path>/my_model.dxnn") # 2. Preprocess on the Pi 5 CPU (BGR → RGB, normalize, NCHW) frame = cv2.imread("<input_image>.jpg") rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) inp = (rgb.astype(np.float32) / 255.0) inp = np.transpose(inp, (2, 0, 1))[np.newaxis] # 3. Run inference on the NPU outputs = engine.run(inp) # 4. Parse outputs in your application code (no standard output format — # the shape matches what the ONNX graph produced) print(outputs[0].shape)
OpenCV loads images as BGR by default. Most vision models are trained on RGB.
Always convert explicitly with cv2.COLOR_BGR2RGB before feeding frames
to the NPU. A pipeline that reports correct accuracy in evaluation but loses several
percentage points in production almost always has this wrong.
Run inference asynchronously
For pipelined throughput — submit frame N while the NPU is still working on frame N−1 —
use the run_async entry point. Asynchronous calls alleviate the synchronous
bottleneck inside a single script and are the recommended pattern when the application
has other work to do (capture, preprocessing, network I/O) while the NPU runs. Full
reference:
github.com/DEEPX-AI/dx_rt →.
Run inference (C++)
The C++ dxrt API is the lower-overhead path for tight inference loops and
for applications that already live in a C++ runtime (robotics control, native video
pipelines). Build against the headers installed by the sixfab-dx package
and link libdxrt. End-to-end C++ examples live in
github.com/sixfab/sixfab-dx-examples
under the cpp_examples tree, alongside the Python examples.
Update a deployed model in the field
Push a new compiled .dxnn file over SSH and restart the inference service
without physical access to the device. There is no restriction on the delivery method —
SCP, OTA, configuration-management push — as long as it is secure.
# 1. Copy the new artifact scp my_model_v2.dxnn <pi_user>@<pi_host>:~/models/ # 2. Restart the systemd service that loads the model ssh <pi_user>@<pi_host> "sudo systemctl restart my-inference.service"
Run inside Docker
Docker containers are supported on Raspberry Pi 5. The container needs --privileged
and the NPU device node mounted from the host kernel driver:
docker run --privileged -v /dev/dxrt0:/dev/dxrt0 -v ~/models:/models my-inference-app:latest
Sixfab × Ultralytics acceleration path
For teams who want to skip building the toolchain themselves and move from a labelled dataset to a deployed custom model in days, Sixfab offers a partnership track with Ultralytics that compresses the steps above into a guided pipeline.
From labelled dataset to deployed custom model in days
Train a YOLO model on the customer dataset in the Ultralytics workflow, export to ONNX, compile to DXNN with the partnership tooling, and deploy to the Edge AI Expansion Board. The path is opinionated where it can be (it removes the choice points that take teams the longest to figure out) and explicit where the customer needs to make a call: resolution, batch size, accuracy and FPS trade-off.
This path is the right starting point for object detection, classification, and segmentation on YOLO architectures. For non-YOLO architectures, follow Stages 1–3 above directly.
Note: on-device training is not supported on the Edge AI Expansion Board. Customers train on their own infrastructure and deploy the resulting weights via the ONNX → DXNN compiler path described on this page.
Compatible Python libraries
Standard data-science and computer-vision libraries work alongside dxrt-runtime
on Raspberry Pi 5 with no conflicts. OpenCV, Pillow, and NumPy are explicitly validated.
Pre-recorded video files are supported as inputs alongside live camera streams, which is the recommended approach for repeatable bench testing during development.
Integration patterns on the Edge AI Expansion Board
The Edge AI Expansion Board carries the NPU, the NVMe slot, and the LTE/5G modem on a single baseboard, so a deployed inference application typically writes to local storage and sends results upstream over cellular while inferring. The patterns below cover the points where Sixfab software ends and customer application code begins.
Inference, NVMe, and LTE/5G in the same pipeline
In bench testing, concurrent NVMe writes and LTE/5G traffic do not degrade AI inference throughput on the Edge AI Expansion Board. The NPU runs on a dedicated PCIe Gen 2/Gen 3 x1 link to the Pi 5, while storage and cellular share the on-board USB 3.2 Gen 1 hub. The practical implication for custom-model deployments: a single application can buffer raw frames to NVMe, run inference, and stream structured results over LTE/5G without staging a separate inference host.
The sixfab-dx package installs the driver, runtime, and CLI tools — it does
not save camera footage or inference results anywhere on its own. Set the log and output
paths in the inference application to the mount point of the NVMe SSD, and transmit
results over the LTE/5G connection through whatever platform the application uses
(MQTT, HTTPS, custom protocol). The NPU does not constrain either choice.
Multiple concurrent models on one NPU
The DEEPX NPU can run more than one compiled model in parallel. Up to three concurrent models have been validated in bench testing; more is possible but not characterized. The practical pattern is a detection model plus a classification or pose model fed from the detected regions, all on the same NPU.
Multiple camera streams
Bench testing has shown the NPU handling four concurrent 720p streams cleanly. Beyond that, frame rate begins to drop — but the bottleneck is the Raspberry Pi 5 CPU handling capture, decode, and preprocessing, not the NPU itself. Plan accordingly: if more camera streams are needed, optimize the CPU-side pipeline (hardware decode, fewer copies, async I/O) before assuming the NPU is the limit.
Image preprocessing and postprocessing
The Raspberry Pi 5 CPU is responsible for everything around the inference call: capture,
colour conversion, resize, normalisation, drawing bounding boxes and labels on output frames,
and any non-maximum suppression that is not handled by a PPU-fused YOLO model. The NPU only
runs the compiled graph. Image format expectations vary per model — most compiled models
accept NHWC input and the exact tensor shape must match the config.json used at
compile time.
Moving an existing application from a GPU or cloud server
If the model already fits within DEEPX-supported architectures, the transition is usually
short. Compile the existing ONNX export with dx-compiler, swap the inference
backend from the GPU framework to the dx_engine Python API (or
dxrt in C++), and validate. With ready inference code, this is typically a
matter of hours rather than weeks.
Limitations and constraints
Understanding these constraints up front saves significant debugging time later.
Vision models today
ScopeThe DEEPX NPU accelerates convolutional and vision-based networks. LLMs, audio-only models, and text-based transformers are not supported on current silicon. LLMs are on the DEEPX roadmap and Sixfab will support them as the silicon enables.
INT8 precision only
~2 % lossAll models run at INT8. The compiler quantizes automatically. Expect approximately 2 % accuracy reduction versus the original FP32 model. Plan evaluation against this envelope, not against the FP32 numbers.
Inference only
No trainingThe Edge AI Expansion Board runs inference exclusively. Train on a GPU-equipped machine (or via the Sixfab × Ultralytics acceleration path), then deploy the resulting .dxnn file.
Ubuntu x86_64 to compile
No ARMThe dx-compiler requires Ubuntu x86_64 with at least 16 GB RAM. ARM and aarch64 are not supported for compilation. The Pi 5 itself cannot compile models.
Model fits NPU memory
Hard capThe compiled .dxnn footprint cannot exceed available NPU on-chip memory. DX-M1M provides the larger envelope; DX-M1ML the smaller. Footprint is reported at the end of compilation.
No automatic NMS for non-PPU models
ManualFor YOLO models compiled with PPU support, bounding-box output is produced directly. For all other architectures, non-maximum suppression and other postprocessing run in CPU application code.
Use Sixfab-compiled artifacts only
VersioningA .dxnn file downloaded from a third party (Hugging Face, ONNX Model Zoo, internet at large) only runs if it was compiled with dx-compiler against a compatible runtime version. Raw ONNX from those sources must be recompiled.
No published version matrix
Upgrade togetherDEEPX does not publish a compatibility matrix between dx-compiler, dxrt-runtime, the kernel driver, and the on-NPU firmware. Upgrade them together via the sixfab-dx APT package, and recompile .dxnn files if the runtime reports a version mismatch at load time.
.dxnn file? Skip Stages 1 and 2.
If a pre-compiled artifact has been provided (by Sixfab, a colleague, or a partner),
skip directly to Stage 3 and run inference on the Pi 5. The
Sixfab Model Zoo
is the curated source for ready-to-run .dxnn files maintained by Sixfab.
Per Sixfab's content cornerstone, every command sequence on this page is re-run on the listed hardware within the 30 days preceding publication. If a runtime, compiler, or operating-system update lands inside that window, this page is reverified before the change is shipped.
Updated 6 days ago
