Troubleshooting

Diagnose and resolve technical hurdles with the AI HAT troubleshooting guide. Follow practical fixes for PCIe errors, driver installation failures, and inference debugging to restore your Raspberry Pi edge AI deployment to peak performance.

Troubleshooting

Diagnose and resolve issues with the Sixfab AI HAT+ for Raspberry Pi 5: NPU detection, dxrt-runtime behavior, the 16-pin FFC cable and PCIe link, the thermal envelope, and host power. Each issue lists the symptom, the most likely cause, and the verified fix.

DEEPX DX-M1M / DX-M1ML Raspberry Pi 5 PCIe Gen 3 x1 dxrt-runtime
AI HAT+ · Reference · Troubleshooting · Updated 2026-05-11
How do I troubleshoot the Sixfab AI HAT+?

Start with the five quick diagnostic commands below: dxrt-cli -s, lspci, lsmod, dmesg, and journalctl. Their output narrows almost every issue to one of six categories: NPU detection, dxrt-runtime mismatch, PCIe / 16-pin FFC cable, thermal, host power, or a Raspberry Pi OS kernel update that broke the DEEPX driver. Match your symptom to one of the issue cards below and follow the fix steps.

Quick diagnostics: run these first

These five commands cover the majority of issues. Run them in order and check the output before drilling into a specific issue card.

bash · run in order 5 checks
# 1. Full NPU status: runtime/firmware versions, per-core voltages and temperatures
dxrt-cli -s

# 2. Confirm the NPU is visible on the PCIe bus
lspci

# 3. Confirm the DEEPX kernel module is loaded
lsmod | grep -i dx

# 4. Check the kernel ring buffer for driver / PCIe errors
dmesg | grep -i dx

# 5. Inspect the runtime service log
journalctl -xeu dxrt.service

Common issues

Each row below is one symptom. Click a row to open it; the body covers what you see, why it happens, and the verified fix. Rows are colour-coded by category, with the dot, the category pill, and the left rail all matching, so you can scan the list by category at a glance.

NPU

NPU not detected: lspci shows no DEEPX device

Symptom

lspci output contains no DEEPX entry. dxrt-cli -s reports no device found. The runtime is installed but the NPU is invisible to the host.

Cause

In the vast majority of cases the 16-pin FFC cable between the AI HAT+ and the Raspberry Pi 5 PCIe FPC connector is not making electrical contact. The usual culprits are a partially closed latch, a reversed cable, or a HAT not fully seated on the 40-pin GPIO header.

Fix

  1. 1
    Power off the Pi 5 and reseat the 16-pin FFC cable.

    The AI HAT+ does not support hot-plug. Disconnect power before touching the cable. Open both FFC latches, remove and reinsert the cable firmly at both ends, close the latches, then power on and re-run lspci.

  2. 2
    Confirm each latch is fully closed.

    A latch that feels closed but is not flush is a common trap. The latch should sit level with the connector body; a partially latched cable will not make electrical contact with the PCIe lane.

  3. 3
    Inspect dmesg for PCIe enumeration errors.
    dmesg | grep -i dx

    Look for PCIe link failed, timeout, or enumeration error. These confirm the host saw the bus event but could not establish a link.

  4. 4
    Verify the 40-pin header is fully seated.

    Press down on the AI HAT+ to confirm all 40 pins are engaged with the Raspberry Pi 5 GPIO header. A partially seated header can hold the FFC at an angle and prevent PCIe link establishment.

Runtime

Driver installed but dxrt-cli -s reports "no device found"

Symptom

dxrt-runtime is installed and dxrt-cli is on $PATH, but dxrt-cli -s prints no device found. lspci may or may not list the DEEPX entry.

Cause

The DEEPX kernel module is not loaded into the running kernel. Either the install completed without loading the module into this session, or a kernel update has invalidated the module compiled at the previous install.

Fix

  1. 1
    Confirm whether the kernel module is loaded.
    lsmod | grep -i dx

    Empty output means the module is not loaded. Proceed to step 2.

  2. 2
    Reboot the Raspberry Pi 5.

    The DEEPX kernel module may not have been loaded into this session (common after a fresh install). Reboot, then re-check with lsmod | grep -i dx and dxrt-cli -s.

  3. 3
    Reinstall dxrt-runtime to rebuild the kernel module.

    Reinstalling forces APT to recompile the DEEPX kernel module against the currently running kernel.

    sudo apt install --reinstall dxrt-runtime
  4. 4
    Inspect the DEEPX runtime service log.
    journalctl -xeu dxrt.service

    If lsmod still shows no DEEPX module after the reinstall, the journal will tell you what failed during module load.

If the symptom started after an OS update

Skip to Driver stopped working after a Raspberry Pi OS update. The recovery sequence is different.

Power

System reboots spontaneously or shows an under-voltage warning

Symptom

The Pi 5 reboots without warning under inference load, the Raspberry Pi OS taskbar shows a yellow lightning bolt, or vcgencmd get_throttled reports a non-zero state.

Cause

Almost always a power supply problem.

The combined load of the Raspberry Pi 5 plus the AI HAT+ under sustained inference reaches roughly 13–15 W. Standard 5 V / 3 A (15 W) USB-C phone chargers cannot sustain this margin reliably and will trigger under-voltage protection.

Fix

  1. 1
    Use the official Raspberry Pi 27 W USB-C PD Supply (5 V / 5 A).

    This is the only supply Sixfab recommends for the AI HAT+ on Pi 5. The yellow lightning bolt clears once a sufficient supply is in place.

  2. 2
    Use the cable shipped with the 27 W supply.

    Some USB-C cables are rated only for charging current and cannot sustain 5 A continuously. Cable quality matters at this current. If the supply is correct but symptoms persist, swap to the original cable.

  3. 3
    Confirm the warning has cleared.
    vcgencmd get_throttled

    Expected output: throttled=0x0. Any non-zero value indicates an under-voltage event has been recorded since boot.

Thermal

Frame rate drops over time during sustained inference

Symptom

Inference starts at the expected FPS but throughput degrades after several minutes of continuous load. The system does not crash; FPS stabilises at a lower value and recovers when load is paused.

Cause

NPU temperature is rising into the throttling band. Passive cooling (the supplied thermal pad) is sufficient for typical workloads, but sustained 100 % NPU utilisation in a closed enclosure can push temperature high enough that the runtime backs off clocks to stay within the safe envelope.

Fix

  1. 1
    Watch NPU temperature live.
    dxrt-cli -s

    The output reports per-core NPU temperatures. Repeat during sustained inference and watch for a steady climb. See System Monitoring for continuous telemetry.

  2. 2
    Confirm the supplied thermal pad is properly seated.

    The thermal pad sits between the DEEPX NPU package and the heatsink. A pad installed off-centre or with a protective film still attached produces the same symptom as no pad at all.

  3. 3
    Add active cooling for sustained 100 % NPU workloads.

    Passive cooling covers typical bursty inference. For benchmarks, multi-camera continuous inference, or deployment in a sealed enclosure, an active cooler is recommended. Improving enclosure ventilation also helps.

  4. 4
    Check ambient temperature.

    If ambient temperature is at the upper end of the operating range, thermal headroom shrinks. Move the system to a cooler location or improve airflow before assuming a hardware issue.

Runtime

Inference is running but frame rate is much lower than expected

Symptom

Inference completes without errors, but throughput is well below the published figure for the model on this configuration (for example, YOLOv8n at 640×640 should reach 30+ FPS on Raspberry Pi 5 with the DX-M1M running PPU).

Cause

One of three things: the model is not running on the NPU (a raw .onnx falls back to CPU), a YOLO model was compiled without PPU so post-processing serialises on the CPU, or CPU-side capture and preprocessing are starving the NPU.

Fix

  1. 1
    Confirm the model is running on the NPU.

    If you passed a raw .onnx file to the runtime instead of a compiled .dxnn, the runtime falls back to CPU inference. That fallback can be one to two orders of magnitude slower. Always compile to DXNN with DX-COM first.

  2. 2
    For YOLO models, recompile with PPU support.

    A YOLO model compiled without PPU (Post-Processing Unit) performs non-maximum suppression on the CPU after every NPU inference, which becomes the bottleneck at high frame rates. Recompile with the PPU flag enabled in DX-COM.

  3. 3
    Watch NPU utilisation during inference.

    Run dxtop in a second terminal. Near 100 % NPU utilisation with low FPS means CPU-side preprocessing is the bottleneck. Low NPU utilisation means the NPU is starved. Fix capture and preprocessing first.

  4. 4
    Use the async API.

    The dxrt API supports asynchronous inference. Submit the next frame to the NPU while it processes the current one. This hides preprocessing latency and significantly improves throughput.

NPU

Inference results are wrong, inaccurate, or nonsensical

Symptom

Inference runs without error and at the expected frame rate, but the output is wrong: detections are random, classes are mismatched, or bounding boxes are misaligned. The same model behaves correctly on CPU.

Cause

The pre-processing pipeline does not match what the model expects. INT8 inference on the NPU is approximately 2 % less accurate than the original FP32 model (that is normal), but a wrong colour channel order, wrong normalisation, or wrong tensor shape produces output that looks nonsensical regardless of the NPU.

Fix

  1. 1
    Check colour channel order, the most common cause.

    OpenCV loads images as BGR. Most vision models are trained on RGB. Always convert before inference:

    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
  2. 2
    Check normalisation.

    Verify pixel values are in the range and scale expected by the model (e.g. / 255.0 for [0, 1], or ImageNet mean/std). A wrong scale silently produces bad output.

  3. 3
    Check input dimensions.

    Verify the tensor shape fed to the model matches what was specified at DX-COM compile time. Shape mismatches may not raise an error but produce meaningless output.

  4. 4
    Benchmark against CPU inference.

    Run the same model as ONNX on the CPU to establish a ground truth. If CPU results are also wrong, the bug is in your pre-processing code, not the NPU.

NPU

Model runs correctly but accuracy is significantly worse than the FP32 baseline

Symptom

The compiled .dxnn model loads, runs at the expected frame rate on the NPU, and produces sensible-looking output (correct class labels, plausible bounding boxes). Measured accuracy against your evaluation set is materially lower than the same model running at FP32 on a workstation. A typical signature: detection mAP drops by 10–20+ percentage points, or top-1 classification accuracy drops by more than a few percent.

Cause

The expected envelope is ~2 % loss. More than that points to a conversion problem.

DEEPX silicon runs INT8 only. After DX-COM quantization, expect approximately 2 % accuracy reduction versus the original FP32 model. Loss substantially larger than that is not a quantization limit; it is almost always one of: an ONNX export that did not preserve eval-mode behavior, an opset mismatch, unsupported operators silently falling back, weak calibration, or a preprocessing pipeline that quietly differs between training and inference.

Fix

  1. 1
    Quantify the gap against an FP32 baseline before changing anything.

    Run the original ONNX model on the CPU (via onnxruntime) on the same evaluation set. Compare the metric you actually care about (mAP, top-1, IoU) against the .dxnn result on the NPU. If the gap is within ~2 %, this is the published envelope and no fix is needed. If it is materially larger, continue to step 2.

  2. 2
    Re-export ONNX from a clean eval-mode model.

    The most common defect: the model was exported while still in training mode (dropout active, batch-norm running stats wrong). Switch the model to eval() mode and re-export with opset 11, which is the widely-compatible default for DX-COM.

    model.eval() torch.onnx.export(model, dummy_input, "my_model.onnx", opset_version=11)
  3. 3
    Check the ONNX graph against the DEEPX supported operator list.

    An unsupported operator can either error at compile time or fall back partially to CPU at runtime, which silently distorts numerical behavior. Inspect the exported ONNX with Netron and compare every op against the DEEPX-supported list at github.com/DEEPX-AI/dx-compiler · Building_Models.md. Replace or remove unsupported ops in your model definition before re-exporting.

  4. 4
    Verify the preprocessing pipeline is byte-identical between training and inference.

    Quantization calibrates against the activation ranges your data produces. If training used one normalisation (e.g. ImageNet mean/std) and inference uses another (e.g. / 255.0), or if channel order swaps between RGB and BGR somewhere in the pipeline, INT8 calibration is fitted to one distribution while inference sees a different one. Re-check both pipelines side by side. See Inference results are wrong, inaccurate, or nonsensical for the full preprocessing checklist.

  5. 5
    Recompile with a representative calibration set.

    DX-COM quantizes automatically and does not require manual configuration, but the calibration data it sees determines how INT8 ranges are fitted. If the calibration data is unrepresentative of your deployment distribution (e.g. all daytime images for a model that runs day and night), accuracy can degrade well beyond the 2 % envelope on the under-represented slice. Recompile the model with calibration data that matches the deployment distribution. [NEED FROM SIXFAB: confirm the DX-COM flag(s) for supplying a calibration dataset, recommended sample size, and whether the default behavior is to calibrate from the model's training data or from a separate set.]

  6. 6
    If accuracy is still off, the model architecture may be quantization-sensitive.

    Some architectures (depthwise-separable convolutions, certain attention blocks, models with extreme activation ranges) lose more than 2 % under any post-training INT8 scheme. For these cases the workaround is at training time: apply quantization-aware training (QAT) on the workstation, export the QAT model to ONNX, then run it through DX-COM. [NEED FROM SIXFAB: explicit guidance on QAT support in DX-COM and any architecture-specific accuracy notes.]

Always validate on your own evaluation set before production rollout

The published ~2 % accuracy figure is a population-level expectation across typical vision workloads. Per-class and per-scenario accuracy can move differently. For safety-critical use cases, treat the .dxnn output as the source of truth and validate against the application-specific evaluation set, not against the FP32 baseline.

Runtime

Driver stopped working after a Raspberry Pi OS update

Symptom

The AI HAT+ was working before a routine sudo apt upgrade or kernel update. After reboot, dxrt-cli -s reports no device found, lsmod | grep -i dx returns empty, and lspci may or may not still list the DEEPX entry.

Cause

Why this happens

An OS update has installed a new Raspberry Pi kernel binary. The DEEPX kernel module shipped with dxrt-runtime was compiled against the previous kernel and no longer matches the new kernel's ABI, so it refuses to load. Reinstalling the runtime triggers a fresh compile against the now-running kernel and restores normal operation.

Fix

Run the following sequence in order. The Question Bank guidance is that a clean reinstall of dxrt-runtime is sufficient to recover; the steps below add the verification checks needed to confirm the rebuild succeeded.

  1. 1
    Confirm the kernel version actually changed.

    This rules out unrelated regressions. If the kernel version is unchanged, the OS update did not invalidate the module. Re-run the Quick diagnostics above first.

    uname -r
  2. 2
    Confirm the DEEPX kernel module is no longer loaded.
    lsmod | grep -i dx

    Empty output is consistent with the OS-update cause. Continue to the reinstall.

  3. 3
    Refresh the package index, then reinstall dxrt-runtime.

    The --reinstall flag forces APT to recompile the kernel module against the current running kernel even if the package version on disk has not changed.

    bash
    # 1. Refresh package index
    sudo apt update
    
    # 2. Reinstall dxrt-runtime (recompiles the DEEPX kernel module)
    sudo apt install --reinstall dxrt-runtime

    Compilation against the new kernel is the longest part of the reinstall. Lots of compiler output is normal.

  4. 4
    Reboot.

    A reboot ensures the freshly compiled module is loaded cleanly into the running kernel, rather than relying on the install-time hooks.

    sudo reboot
  5. 5
    Verify the module loaded and the NPU is reachable.
    bash
    # Module loaded?
    lsmod | grep -i dx
    
    # Runtime + NPU healthy?
    dxrt-cli -s
    Expected output Ready
    DXRT v3.2.0
     * Device 0: M1, Accelerator type
     * RT Driver version  : v2.1.0
     * FW version         : v2.5.0
     * Memory : LPDDR5x 6000 Mbps, 3.92 GiB
     * PCIe   : Gen3 X1 [01:00:00]
    NPU 0: voltage 750 mV, clock 1000 MHz, temperature 46°C
    NPU 1: voltage 750 mV, clock 1000 MHz, temperature 46°C
    NPU 2: voltage 750 mV, clock 1000 MHz, temperature 46°C
    If the reinstall fails to compile

    If apt install --reinstall reports build errors against the new kernel, the safest path is a fresh setup against the current kernel. Follow the Quickstart install steps from the start. [NEED FROM SIXFAB: explicit kernel-headers / DKMS recovery sequence. Current Question Bank guidance only covers the apt install --reinstall dxrt-runtime path.]

Still need help?

If your symptom is not covered above, contact Sixfab support. Including the output of the five Quick diagnostic commands in your report turns a multi-day debugging cycle into a single round-trip.

Sixfab support

Couldn't fix it with the cards above?

Attach the output of dxrt-cli -s, lspci, lsmod | grep -i dx, dmesg | grep -i dx, and journalctl -xeu dxrt.service. Those five outputs are usually enough to pinpoint the issue.