homelabby/.planning/research/PITFALLS.md

24 KiB

Pitfalls Research

Domain: AI-powered homelab hardware inventory — Go + USB serial + local AI + NetBox Researched: 2026-04-09 Confidence: MEDIUM-HIGH (domain-specific issues verified via community sources; some areas LOW confidence where official docs are thin)


Critical Pitfalls

Pitfall 1: USB Serial Port Path Churn on Device Replug

What goes wrong: On macOS, /dev/cu.usbmodem* and /dev/tty.usbmodem* paths are assigned dynamically at plug time. If the Mac Mini is rebooted, USB ports are replugged in a different sequence, or a hub is involved, the same physical device gets a different /dev/ path. Hard-coding device paths (or deriving them once at startup) means the wrong device gets written to — label printer commands going to the power meter, or goroutines blocking on a disconnected port.

Why it happens: Developers test with one device plugged in, path is stable during development, then reality hits when three USB devices are present and unplugged/replugged during use.

How to avoid: Enumerate devices by USB VID/PID + serial number, not by path. Use github.com/google/gousb for enumeration or shell out to ioreg -p IOUSB on macOS to resolve stable identifiers to current paths at each open. Re-resolve the path on every reconnect, not once at boot.

Warning signs:

  • Tests pass with one device, fail silently or misbehave when all three are connected
  • Label printing works but data lands in wrong device log
  • Goroutine appears to be reading but produces zero bytes (hung on wrong port)

Phase to address: USB device layer — the very first phase that integrates any serial peripheral. Establish the device enumeration abstraction before writing any device-specific protocol code.


Pitfall 2: Goroutine Leak on USB Disconnect

What goes wrong: A goroutine blocked on serial.Read() does not unblock when the port is closed from another goroutine or when the device is physically unplugged. The goroutine leaks indefinitely. Over a session where devices are plugged/unplugged several times, leaked goroutines accumulate. On macOS, the port's file descriptor eventually becomes invalid but the goroutine may block in a syscall until process exit.

Why it happens: The standard Go serial libraries (go.bug.st/serial, tarm/serial) block in Read() using OS syscalls. Closing the port from a separate goroutine does not reliably interrupt the blocked read on all platforms. Developers assume port.Close() unblocks all readers.

How to avoid: Use go.bug.st/serial which has explicit support for unblocking reads via port.Close() (tracked in their issue #13). Wrap every read loop with a context.Context that is cancelled before port.Close(). Use a select on a done channel so the read goroutine can exit even if the port close does not interrupt the syscall. Test with -race flag.

Warning signs:

  • runtime.NumGoroutine() grows after each replug cycle
  • go tool pprof goroutine profile shows multiple read goroutines in syscall state
  • Memory creeps up over a long session

Phase to address: USB device layer. Write a reconnect harness test (plug/unplug 10 times, assert goroutine count is stable) before any feature work on top of the layer.


Pitfall 3: NetBox as Sole Data Store — No Offline or Degraded Mode

What goes wrong: HWLab stores zero inventory data locally. When NetBox LXC 130 is unreachable (Proxmox maintenance, network hiccup, NetBox upgrade), the entire app becomes non-functional — not just degraded. Photo intake, label printing, cable testing results — all blocked on a NetBox write that can never complete. No queue, no cache, no fallback.

Why it happens: The decision to use NetBox as the sole source of truth is architecturally clean and correct for the goal (no data duplication). But "no local DB" is misread as "no local state at all." The difference between inventory data (belongs in NetBox) and operation queuing (belongs locally) gets collapsed.

How to avoid: Maintain a small local SQLite write-ahead queue for pending NetBox operations. Items are written to the queue first (synchronous, fast), then flushed to NetBox asynchronously with retry. The UI reflects queue state, not NetBox state, during flush. This is not a data store — it is a transactional buffer. The advisor chat history and config already live in SQLite per the design; add a pending_operations table to the same DB.

Warning signs:

  • Photo intake blocks waiting for NetBox HTTP response
  • Error: "NetBox unavailable" with no recovery path shown in UI
  • Label is printed before NetBox record is confirmed — ID mismatch risk

Phase to address: NetBox integration phase. Define the queue schema and flush logic before any intake workflow is built on top.


Pitfall 4: AI Confidently Misidentifies Hardware — No Quality Gate Enforcement

What goes wrong: Gemma 4 returns a plausible but wrong product identification — e.g., a PCIe riser card identified as a "USB hub," a 24-port patch panel identified as "network switch." The record is created in NetBox with wrong type, wrong manufacturer, wrong custom fields. The quality gate state machine (draft → indexed → needs_research → researched → complete) exists in the design but is never enforced in code: items advance automatically rather than requiring explicit confirmation for uncertain classifications.

Why it happens: Multimodal LLMs have high confidence scores even on wrong answers for visually ambiguous hardware. The three-tier pipeline is designed to handle this but the escalation triggers are undefined — there is no concrete threshold for "local model is uncertain, escalate to research agent."

How to avoid: The local indexer (Gemma 4) must return a structured confidence score alongside classification. Define hard thresholds: below 0.7 confidence, item is pinned at needs_research and flagged for manual review or automatic SearXNG escalation. Never auto-advance past indexed without either (a) a confidence score above threshold or (b) explicit operator confirmation. Store the raw AI response and confidence in the NetBox record's custom fields so the decision is auditable.

Warning signs:

  • All items advance to complete status immediately after intake
  • NetBox records show wrong device_type or manufacturer for known items you can visually verify
  • SearXNG and OpenRouter tiers are never triggered in practice

Phase to address: AI pipeline phase. Define confidence thresholds in config before wiring up the intake flow. Build the quality gate state machine before building the happy path.


Pitfall 5: PRT Qutie Protocol Unknown — Blocking Hardware Dependency

What goes wrong: The PRT Qutie uses Bluetooth-to-app communication as its primary documented interface. USB raw protocol documentation does not exist publicly. If reverse engineering reveals the device only accepts commands via its proprietary Bluetooth stack (not raw USB serial), the entire label printing architecture needs to change. This blocks any phase that delivers end-to-end intake (photo → record → label).

Why it happens: Hardware ordered before protocol feasibility is confirmed. The PROJECT.md correctly notes "protocols need reverse-engineering once hardware arrives," but this risk is not explicitly sized — it could be one day of work or two weeks.

How to avoid: On hardware arrival (2026-04-13), first action is protocol characterization, not feature development. Capture raw USB traffic with Wireshark + USBPCap (or macOS usbmon equivalent via tcpdump). Test if the device enumerates as a CDC-ACM or HID device. If only Bluetooth is functional, pivot to using the macOS CoreBluetooth framework via a CGo shim or a small helper process. Have a fallback plan ready: ZPL-compatible USB-C printer as alternative (Brother QL-820NWBc is well-documented).

Warning signs:

  • lsusb (or macOS system_profiler SPUSBDataType) shows device as HID-only, not CDC-ACM
  • No /dev/cu.* device appears when connected
  • Manufacturer app communicates exclusively via Bluetooth, USB only charges

Phase to address: Hardware characterization spike — must complete before committing to any label printing architecture. Do this in the first sprint after hardware arrival.


Pitfall 6: 16GB Unified Memory — Gemma 4 Leaves No Headroom for the Rest of the Stack

What goes wrong: On the Mac Mini M4 with 16GB unified memory, the Gemma 4 E4B model (4-bit quantized) needs approximately 5-8GB of model weights plus KV cache. oMLX, the Go backend, the React dev server, NetBox (running on a separate LXC but API calls still pass through), Proxmox overhead, macOS itself — with all running simultaneously, memory pressure triggers macOS compressed memory and swap. Inference slows catastrophically. Worse: at long context windows (intake photo + product research + NetBox context), KV cache grows and throughput drops beyond 8K tokens.

Why it happens: Memory estimates are done in isolation: "Gemma 4B fits in 16GB" — true in isolation, false in production with everything else running.

How to avoid: Run a memory profiling session before any feature development: load oMLX with Gemma 4 E4B, run the Go backend, open a browser tab, and watch Activity Monitor's memory pressure indicator and swap usage. If memory pressure is yellow/red, either (a) use 26B A4B with TurboQuant only for the research agent tier (not the fast indexer tier), (b) set oMLX's max concurrent requests to 1, or (c) shut down other Mac Mini workloads during intake sessions. Document the working memory budget and enforce it in oMLX config from day one.

Warning signs:

  • macOS memory pressure bar is yellow or red during normal operation
  • vm_stat shows high pageouts or swapins during inference
  • Inference latency spikes from ~2s to 20s+ without apparent cause
  • oMLX logs "out of memory" or silently returns truncated completions

Phase to address: Infrastructure setup phase (before any AI pipeline work). Run the memory budget test as a gating condition before committing to model selection.


Pitfall 7: NetBox Custom Fields — Write Format Differs from Read Format

What goes wrong: When reading a NetBox object via REST API, custom_fields returns nested objects with id, url, display, name. When writing (PATCH/PUT), you must send only an array of integer IDs for object-type fields, and a flat dictionary for scalar fields. Go structs generated from the OpenAPI spec (or hand-coded) that reuse the same type for read and write will silently drop custom field updates — the PATCH succeeds with HTTP 200 but the field is not updated.

Why it happens: NetBox's REST API has asymmetric read/write representations for custom fields. The official go-netbox client reflects this asymmetry but it is non-obvious. Community discussions confirm this trips up almost everyone working with custom fields programmatically.

How to avoid: Write integration tests that (1) PATCH a custom field on a real NetBox object, (2) immediately GET the object back, and (3) assert the field value matches what was sent. Never assume a 200 response means the field was written. Create separate Go structs for the read and write representations of custom fields.

Warning signs:

  • PATCH returns 200 but NetBox UI shows field still empty
  • Custom fields look correct in test with scalar types but break with object-reference fields
  • NetBox API returns {"custom_fields": {"hwlab_status": null}} after a write

Phase to address: NetBox integration phase. Write the custom field round-trip test before building any intake workflow that depends on custom fields.


Technical Debt Patterns

Shortcut Immediate Benefit Long-term Cost When Acceptable
Hard-code /dev/cu.usbmodem* paths Works in 5 minutes Wrong device targeted after any replug; non-recoverable in prod Never — use VID/PID enumeration from the start
Skip write queue, call NetBox synchronously in intake handler Simpler code Intake blocks on NetBox latency (50-200ms per call); app dead during NetBox downtime Never for the main intake path
Auto-advance quality gate without confidence check All items reach complete fast NetBox fills with wrong data; impossible to clean up at scale Never — the quality gate is the whole product value
Share one serial port handle across goroutines with a mutex Avoids per-device abstraction Deadlock if one goroutine is in a long read while another needs to write Only acceptable for single-device prototype, never for multi-device production
Use time.Sleep polling loop to check for new USB devices Simple device detection Wastes CPU, misses hot-plug events, introduces latency Never on macOS — use IOKit notification or enumerate on each operation
One monolithic AI prompt for all hardware types Simpler prompt engineering Low accuracy for visually ambiguous items; no structured output to parse Only during initial prompt development, never in production

Integration Gotchas

Integration Common Mistake Correct Approach
NetBox REST API Using PUT when you mean PATCH — PUT requires ALL mandatory fields Always use PATCH for partial updates; build a PatchNetBoxObject(id, fields) helper
NetBox custom fields Passing write payload as read format (nested objects instead of ID arrays) Maintain separate Go types for read (CustomFieldValue) and write (CustomFieldPatch)
oMLX / mlx-lm inference Treating inference as a fast synchronous call (fire and forget with 5s timeout) Use SSE streaming for long inference; set generous timeouts (60-120s); handle partial stream failures
FNIRSI FNB58 Assuming energy/capacity values come directly from device Device sends raw power+current samples at 10ms intervals; integrate on the host. Use baryluk/fnirsi-usb-power-data-logger as reference implementation
SearXNG Sending raw AI-generated product names as queries Sanitize queries; extract make/model tokens from AI output before querying; handle JSON ?format=json parse errors gracefully
OpenRouter No per-request cost cap Set max_tokens hard limit on every OpenRouter call; log tokens consumed per pipeline run; set account-level spend limit in OpenRouter dashboard
NetBox NetBox-inventory plugin Assuming plugin custom fields are available immediately after install Custom fields must be created and assigned to object types via API or UI after plugin install; verify with a test GET before intake flow depends on them

Performance Traps

Trap Symptoms Prevention When It Breaks
Synchronous NetBox API calls on the intake hot path Intake UI freezes for 100-500ms per item; label doesn't print until all NetBox writes succeed Write-ahead queue in SQLite; async flush to NetBox Day 1 with real network latency to LXC 130
Unbounded KV cache during long research agent prompts Inference latency spikes from 2s to 20s+; oMLX logs memory warnings Set max_tokens on research agent tier; keep prompts focused; use --kv-bits 4 in oMLX config Context window >8K tokens on M4 16GB
Polling USB device state on a timer CPU spike every N ms; goroutine accumulation if timer fires faster than reads complete Event-driven reconnect with IOKit notifications (or enumerate once at request time) With 3+ USB devices continuously polled
Loading full NetBox inventory for dashboard on every page load Dashboard takes 3-5s to load; NetBox gets hammered with large paginated requests Cache dashboard data locally (SQLite) with TTL; paginate lazily; never load all records into memory at once Once NetBox has >200 items
React re-rendering entire device status panel on every SSE event UI stutters during active cable testing with rapid updates Use useMemo/useCallback, key by device ID, throttle SSE event processing to 10fps max When SSE fires >5 events/second during live power testing

Security Mistakes

Mistake Risk Prevention
Storing NetBox API token in plain Go config file Token leaked via git, readable by any process Load from environment variable or macOS Keychain; never commit token; add config.json to .gitignore
Passing user-supplied image data directly to Gemma without size/type validation OOM on large images; potential prompt injection via steganographic payloads Validate MIME type and max size (e.g., 10MB) before passing to inference; resize/normalize before encoding to base64
No authentication on HWLab Go backend Any device on the homelab network can add/modify NetBox records via HWLab Even for solo use, add a simple bearer token or basic auth to the Go API; the backend has write access to NetBox
OpenRouter API key in frontend bundle Key exposed to any browser that loads the app OpenRouter calls must go through Go backend only; never expose key to frontend
SearXNG queries logged with full hardware descriptions Sensitive inventory information in SearXNG logs SearXNG is self-hosted so this is lower risk, but keep queries minimal — send model numbers, not full descriptions

UX Pitfalls

Pitfall User Impact Better Approach
No progress feedback during AI intake (photo upload → result) Operator thinks the app is frozen during 5-20s inference; submits photo again Show SSE-streamed inference progress: "Indexing... Researching... Creating record..." with each tier's state
Quality gate status shown as enum code (needs_research) not human label Operator confused about what action is needed Display human labels: "Needs research", "Ready to print", "Complete" with action buttons per state
Label prints before NetBox record is confirmed QR code points to a record that may not exist if NetBox write fails Print label only after NetBox write is confirmed (or queue is flushed); show "printing..." state
Cable test results shown as raw hex/bytes Operator can't interpret pass/fail Parse protocol data into human-readable result: "All 8 conductors — PASS", "Pin 4 open — FAIL"
No way to correct a wrong AI classification without going into NetBox UI AI errors require leaving HWLab Provide inline edit for manufacturer/model/type on the intake confirmation screen before committing to NetBox

"Looks Done But Isn't" Checklist

  • USB device layer: Can handle all three devices connected simultaneously AND hot-unplug/replug of any one without affecting the others — verify with integration test
  • Label printing: QR code URL resolves to actual NetBox record (not 404) — verify that HW-XXXXX ID is written to NetBox before label is generated
  • AI intake pipeline: Confidence threshold enforcement is active — verify that a deliberately ambiguous photo does NOT auto-advance past indexed status
  • NetBox custom fields: Round-trip write+read test passes for all HWLab-specific custom fields — verify with a dedicated test script before intake is considered working
  • Three-tier escalation: Tier 2 (SearXNG) and Tier 3 (OpenRouter) are actually triggered in practice — verify by running an item that is genuinely ambiguous and watching pipeline logs
  • Memory budget: Mac Mini M4 stays out of memory pressure (green in Activity Monitor) with oMLX loaded and backend running — verify before declaring inference "working"
  • NetBox downtime handling: If NetBox LXC is shut down mid-intake, the operation queues and resumes cleanly — verify with a chaos test

Recovery Strategies

Pitfall Recovery Cost Recovery Steps
Hard-coded USB paths in production MEDIUM Refactor device manager to VID/PID enumeration; update all open() calls; test full replug cycle
NetBox filled with wrong AI classifications HIGH Write a NetBox API script to bulk-set affected items back to draft status; re-run intake on each item; no shortcut
Goroutine leak accumulation over long session LOW Restart Go backend; investigate with pprof; add goroutine count metric to telemetry endpoint
PRT Qutie only works via Bluetooth MEDIUM Pivot to CoreBluetooth CGo shim or external helper process; estimated 3-5 days additional work
16GB memory exhausted during intake session LOW Restart oMLX; switch to smaller model variant; add memory monitoring to runbook
OpenRouter spend spike from runaway escalation LOW-MEDIUM Set account spend limit in OpenRouter dashboard; add per-run token counter with hard cutoff in Go pipeline code

Pitfall-to-Phase Mapping

Pitfall Prevention Phase Verification
USB path churn USB device layer (first hardware phase) Integration test: replug all 3 devices in random order, verify correct device responds
Goroutine leak on disconnect USB device layer Goroutine count stable after 10 replug cycles (test with pprof)
NetBox downtime / no offline mode NetBox integration phase Chaos test: kill NetBox mid-intake, verify queue persists and resumes
AI misidentification / quality gate bypass AI pipeline phase Ambiguous photo stays at needs_research; confident photo reaches indexed pending confirmation
PRT Qutie protocol unknown Hardware characterization spike (day of hardware arrival) USB traffic captured and protocol characterized before architecture is committed
16GB memory exhaustion Infrastructure setup phase Memory pressure remains green during full stack with oMLX loaded
NetBox custom field write/read asymmetry NetBox integration phase Round-trip test: PATCH field, GET object, assert value matches
Three-tier escalation never triggers AI pipeline phase Log shows tier promotions happening on genuinely ambiguous items
Runaway OpenRouter spend AI pipeline phase max_tokens set on every OpenRouter call; spend limit set in dashboard; per-run cost logged

Sources


Pitfalls research for: HWLab — Go + USB serial + local AI inference + NetBox Researched: 2026-04-09