Mikkel Georgsen 5d7c5e90a2 docs: complete project research

2026-04-09 23:35:26 +00:00

24 KiB

Raw Blame History

Pitfalls Research

Domain: AI-powered homelab hardware inventory — Go + USB serial + local AI + NetBox Researched: 2026-04-09 Confidence: MEDIUM-HIGH (domain-specific issues verified via community sources; some areas LOW confidence where official docs are thin)

Critical Pitfalls

Pitfall 1: USB Serial Port Path Churn on Device Replug

What goes wrong: On macOS, /dev/cu.usbmodem* and /dev/tty.usbmodem* paths are assigned dynamically at plug time. If the Mac Mini is rebooted, USB ports are replugged in a different sequence, or a hub is involved, the same physical device gets a different /dev/ path. Hard-coding device paths (or deriving them once at startup) means the wrong device gets written to — label printer commands going to the power meter, or goroutines blocking on a disconnected port.

Why it happens: Developers test with one device plugged in, path is stable during development, then reality hits when three USB devices are present and unplugged/replugged during use.

How to avoid: Enumerate devices by USB VID/PID + serial number, not by path. Use github.com/google/gousb for enumeration or shell out to ioreg -p IOUSB on macOS to resolve stable identifiers to current paths at each open. Re-resolve the path on every reconnect, not once at boot.

Warning signs:

Tests pass with one device, fail silently or misbehave when all three are connected
Label printing works but data lands in wrong device log
Goroutine appears to be reading but produces zero bytes (hung on wrong port)

Phase to address: USB device layer — the very first phase that integrates any serial peripheral. Establish the device enumeration abstraction before writing any device-specific protocol code.

Pitfall 2: Goroutine Leak on USB Disconnect

What goes wrong: A goroutine blocked on serial.Read() does not unblock when the port is closed from another goroutine or when the device is physically unplugged. The goroutine leaks indefinitely. Over a session where devices are plugged/unplugged several times, leaked goroutines accumulate. On macOS, the port's file descriptor eventually becomes invalid but the goroutine may block in a syscall until process exit.

Why it happens: The standard Go serial libraries (go.bug.st/serial, tarm/serial) block in Read() using OS syscalls. Closing the port from a separate goroutine does not reliably interrupt the blocked read on all platforms. Developers assume port.Close() unblocks all readers.

How to avoid: Use go.bug.st/serial which has explicit support for unblocking reads via port.Close() (tracked in their issue #13). Wrap every read loop with a context.Context that is cancelled before port.Close(). Use a select on a done channel so the read goroutine can exit even if the port close does not interrupt the syscall. Test with -race flag.

Warning signs:

runtime.NumGoroutine() grows after each replug cycle
go tool pprof goroutine profile shows multiple read goroutines in syscall state
Memory creeps up over a long session

Phase to address: USB device layer. Write a reconnect harness test (plug/unplug 10 times, assert goroutine count is stable) before any feature work on top of the layer.

Pitfall 3: NetBox as Sole Data Store — No Offline or Degraded Mode

What goes wrong: HWLab stores zero inventory data locally. When NetBox LXC 130 is unreachable (Proxmox maintenance, network hiccup, NetBox upgrade), the entire app becomes non-functional — not just degraded. Photo intake, label printing, cable testing results — all blocked on a NetBox write that can never complete. No queue, no cache, no fallback.

Why it happens: The decision to use NetBox as the sole source of truth is architecturally clean and correct for the goal (no data duplication). But "no local DB" is misread as "no local state at all." The difference between inventory data (belongs in NetBox) and operation queuing (belongs locally) gets collapsed.

How to avoid: Maintain a small local SQLite write-ahead queue for pending NetBox operations. Items are written to the queue first (synchronous, fast), then flushed to NetBox asynchronously with retry. The UI reflects queue state, not NetBox state, during flush. This is not a data store — it is a transactional buffer. The advisor chat history and config already live in SQLite per the design; add a pending_operations table to the same DB.

Warning signs:

Photo intake blocks waiting for NetBox HTTP response
Error: "NetBox unavailable" with no recovery path shown in UI
Label is printed before NetBox record is confirmed — ID mismatch risk

Phase to address: NetBox integration phase. Define the queue schema and flush logic before any intake workflow is built on top.

Pitfall 4: AI Confidently Misidentifies Hardware — No Quality Gate Enforcement

What goes wrong: Gemma 4 returns a plausible but wrong product identification — e.g., a PCIe riser card identified as a "USB hub," a 24-port patch panel identified as "network switch." The record is created in NetBox with wrong type, wrong manufacturer, wrong custom fields. The quality gate state machine (draft → indexed → needs_research → researched → complete) exists in the design but is never enforced in code: items advance automatically rather than requiring explicit confirmation for uncertain classifications.

Why it happens: Multimodal LLMs have high confidence scores even on wrong answers for visually ambiguous hardware. The three-tier pipeline is designed to handle this but the escalation triggers are undefined — there is no concrete threshold for "local model is uncertain, escalate to research agent."

How to avoid: The local indexer (Gemma 4) must return a structured confidence score alongside classification. Define hard thresholds: below 0.7 confidence, item is pinned at needs_research and flagged for manual review or automatic SearXNG escalation. Never auto-advance past indexed without either (a) a confidence score above threshold or (b) explicit operator confirmation. Store the raw AI response and confidence in the NetBox record's custom fields so the decision is auditable.

Warning signs:

All items advance to complete status immediately after intake
NetBox records show wrong device_type or manufacturer for known items you can visually verify
SearXNG and OpenRouter tiers are never triggered in practice

Phase to address: AI pipeline phase. Define confidence thresholds in config before wiring up the intake flow. Build the quality gate state machine before building the happy path.

Pitfall 5: PRT Qutie Protocol Unknown — Blocking Hardware Dependency

What goes wrong: The PRT Qutie uses Bluetooth-to-app communication as its primary documented interface. USB raw protocol documentation does not exist publicly. If reverse engineering reveals the device only accepts commands via its proprietary Bluetooth stack (not raw USB serial), the entire label printing architecture needs to change. This blocks any phase that delivers end-to-end intake (photo → record → label).

Why it happens: Hardware ordered before protocol feasibility is confirmed. The PROJECT.md correctly notes "protocols need reverse-engineering once hardware arrives," but this risk is not explicitly sized — it could be one day of work or two weeks.

How to avoid: On hardware arrival (2026-04-13), first action is protocol characterization, not feature development. Capture raw USB traffic with Wireshark + USBPCap (or macOS usbmon equivalent via tcpdump). Test if the device enumerates as a CDC-ACM or HID device. If only Bluetooth is functional, pivot to using the macOS CoreBluetooth framework via a CGo shim or a small helper process. Have a fallback plan ready: ZPL-compatible USB-C printer as alternative (Brother QL-820NWBc is well-documented).

Warning signs:

lsusb (or macOS system_profiler SPUSBDataType) shows device as HID-only, not CDC-ACM
No /dev/cu.* device appears when connected
Manufacturer app communicates exclusively via Bluetooth, USB only charges

Phase to address: Hardware characterization spike — must complete before committing to any label printing architecture. Do this in the first sprint after hardware arrival.

Pitfall 6: 16GB Unified Memory — Gemma 4 Leaves No Headroom for the Rest of the Stack

What goes wrong: On the Mac Mini M4 with 16GB unified memory, the Gemma 4 E4B model (4-bit quantized) needs approximately 5-8GB of model weights plus KV cache. oMLX, the Go backend, the React dev server, NetBox (running on a separate LXC but API calls still pass through), Proxmox overhead, macOS itself — with all running simultaneously, memory pressure triggers macOS compressed memory and swap. Inference slows catastrophically. Worse: at long context windows (intake photo + product research + NetBox context), KV cache grows and throughput drops beyond 8K tokens.

Why it happens: Memory estimates are done in isolation: "Gemma 4B fits in 16GB" — true in isolation, false in production with everything else running.

How to avoid: Run a memory profiling session before any feature development: load oMLX with Gemma 4 E4B, run the Go backend, open a browser tab, and watch Activity Monitor's memory pressure indicator and swap usage. If memory pressure is yellow/red, either (a) use 26B A4B with TurboQuant only for the research agent tier (not the fast indexer tier), (b) set oMLX's max concurrent requests to 1, or (c) shut down other Mac Mini workloads during intake sessions. Document the working memory budget and enforce it in oMLX config from day one.

Warning signs:

macOS memory pressure bar is yellow or red during normal operation
vm_stat shows high pageouts or swapins during inference
Inference latency spikes from ~2s to 20s+ without apparent cause
oMLX logs "out of memory" or silently returns truncated completions

Phase to address: Infrastructure setup phase (before any AI pipeline work). Run the memory budget test as a gating condition before committing to model selection.

Pitfall 7: NetBox Custom Fields — Write Format Differs from Read Format

What goes wrong: When reading a NetBox object via REST API, custom_fields returns nested objects with id, url, display, name. When writing (PATCH/PUT), you must send only an array of integer IDs for object-type fields, and a flat dictionary for scalar fields. Go structs generated from the OpenAPI spec (or hand-coded) that reuse the same type for read and write will silently drop custom field updates — the PATCH succeeds with HTTP 200 but the field is not updated.

Why it happens: NetBox's REST API has asymmetric read/write representations for custom fields. The official go-netbox client reflects this asymmetry but it is non-obvious. Community discussions confirm this trips up almost everyone working with custom fields programmatically.

How to avoid: Write integration tests that (1) PATCH a custom field on a real NetBox object, (2) immediately GET the object back, and (3) assert the field value matches what was sent. Never assume a 200 response means the field was written. Create separate Go structs for the read and write representations of custom fields.

Warning signs:

PATCH returns 200 but NetBox UI shows field still empty
Custom fields look correct in test with scalar types but break with object-reference fields
NetBox API returns {"custom_fields": {"hwlab_status": null}} after a write

Phase to address: NetBox integration phase. Write the custom field round-trip test before building any intake workflow that depends on custom fields.

Technical Debt Patterns

Shortcut	Immediate Benefit	Long-term Cost	When Acceptable
Hard-code `/dev/cu.usbmodem*` paths	Works in 5 minutes	Wrong device targeted after any replug; non-recoverable in prod	Never — use VID/PID enumeration from the start
Skip write queue, call NetBox synchronously in intake handler	Simpler code	Intake blocks on NetBox latency (50-200ms per call); app dead during NetBox downtime	Never for the main intake path
Auto-advance quality gate without confidence check	All items reach `complete` fast	NetBox fills with wrong data; impossible to clean up at scale	Never — the quality gate is the whole product value
Share one serial port handle across goroutines with a mutex	Avoids per-device abstraction	Deadlock if one goroutine is in a long read while another needs to write	Only acceptable for single-device prototype, never for multi-device production
Use `time.Sleep` polling loop to check for new USB devices	Simple device detection	Wastes CPU, misses hot-plug events, introduces latency	Never on macOS — use `IOKit` notification or enumerate on each operation
One monolithic AI prompt for all hardware types	Simpler prompt engineering	Low accuracy for visually ambiguous items; no structured output to parse	Only during initial prompt development, never in production

Integration Gotchas

Integration	Common Mistake	Correct Approach
NetBox REST API	Using `PUT` when you mean `PATCH` — PUT requires ALL mandatory fields	Always use `PATCH` for partial updates; build a `PatchNetBoxObject(id, fields)` helper
NetBox custom fields	Passing write payload as read format (nested objects instead of ID arrays)	Maintain separate Go types for read (`CustomFieldValue`) and write (`CustomFieldPatch`)
oMLX / mlx-lm inference	Treating inference as a fast synchronous call (fire and forget with 5s timeout)	Use SSE streaming for long inference; set generous timeouts (60-120s); handle partial stream failures
FNIRSI FNB58	Assuming energy/capacity values come directly from device	Device sends raw power+current samples at 10ms intervals; integrate on the host. Use `baryluk/fnirsi-usb-power-data-logger` as reference implementation
SearXNG	Sending raw AI-generated product names as queries	Sanitize queries; extract make/model tokens from AI output before querying; handle JSON `?format=json` parse errors gracefully
OpenRouter	No per-request cost cap	Set `max_tokens` hard limit on every OpenRouter call; log tokens consumed per pipeline run; set account-level spend limit in OpenRouter dashboard
NetBox NetBox-inventory plugin	Assuming plugin custom fields are available immediately after install	Custom fields must be created and assigned to object types via API or UI after plugin install; verify with a test GET before intake flow depends on them

Performance Traps

Trap	Symptoms	Prevention	When It Breaks
Synchronous NetBox API calls on the intake hot path	Intake UI freezes for 100-500ms per item; label doesn't print until all NetBox writes succeed	Write-ahead queue in SQLite; async flush to NetBox	Day 1 with real network latency to LXC 130
Unbounded KV cache during long research agent prompts	Inference latency spikes from 2s to 20s+; oMLX logs memory warnings	Set `max_tokens` on research agent tier; keep prompts focused; use `--kv-bits 4` in oMLX config	Context window >8K tokens on M4 16GB
Polling USB device state on a timer	CPU spike every N ms; goroutine accumulation if timer fires faster than reads complete	Event-driven reconnect with IOKit notifications (or enumerate once at request time)	With 3+ USB devices continuously polled
Loading full NetBox inventory for dashboard on every page load	Dashboard takes 3-5s to load; NetBox gets hammered with large paginated requests	Cache dashboard data locally (SQLite) with TTL; paginate lazily; never load all records into memory at once	Once NetBox has >200 items
React re-rendering entire device status panel on every SSE event	UI stutters during active cable testing with rapid updates	Use `useMemo`/`useCallback`, key by device ID, throttle SSE event processing to 10fps max	When SSE fires >5 events/second during live power testing

Security Mistakes

Mistake	Risk	Prevention
Storing NetBox API token in plain Go config file	Token leaked via git, readable by any process	Load from environment variable or macOS Keychain; never commit token; add `config.json` to `.gitignore`
Passing user-supplied image data directly to Gemma without size/type validation	OOM on large images; potential prompt injection via steganographic payloads	Validate MIME type and max size (e.g., 10MB) before passing to inference; resize/normalize before encoding to base64
No authentication on HWLab Go backend	Any device on the homelab network can add/modify NetBox records via HWLab	Even for solo use, add a simple bearer token or basic auth to the Go API; the backend has write access to NetBox
OpenRouter API key in frontend bundle	Key exposed to any browser that loads the app	OpenRouter calls must go through Go backend only; never expose key to frontend
SearXNG queries logged with full hardware descriptions	Sensitive inventory information in SearXNG logs	SearXNG is self-hosted so this is lower risk, but keep queries minimal — send model numbers, not full descriptions

UX Pitfalls

Pitfall	User Impact	Better Approach
No progress feedback during AI intake (photo upload → result)	Operator thinks the app is frozen during 5-20s inference; submits photo again	Show SSE-streamed inference progress: "Indexing... Researching... Creating record..." with each tier's state
Quality gate status shown as enum code (`needs_research`) not human label	Operator confused about what action is needed	Display human labels: "Needs research", "Ready to print", "Complete" with action buttons per state
Label prints before NetBox record is confirmed	QR code points to a record that may not exist if NetBox write fails	Print label only after NetBox write is confirmed (or queue is flushed); show "printing..." state
Cable test results shown as raw hex/bytes	Operator can't interpret pass/fail	Parse protocol data into human-readable result: "All 8 conductors — PASS", "Pin 4 open — FAIL"
No way to correct a wrong AI classification without going into NetBox UI	AI errors require leaving HWLab	Provide inline edit for manufacturer/model/type on the intake confirmation screen before committing to NetBox

"Looks Done But Isn't" Checklist

USB device layer: Can handle all three devices connected simultaneously AND hot-unplug/replug of any one without affecting the others — verify with integration test
Label printing: QR code URL resolves to actual NetBox record (not 404) — verify that HW-XXXXX ID is written to NetBox before label is generated
AI intake pipeline: Confidence threshold enforcement is active — verify that a deliberately ambiguous photo does NOT auto-advance past indexed status
NetBox custom fields: Round-trip write+read test passes for all HWLab-specific custom fields — verify with a dedicated test script before intake is considered working
Three-tier escalation: Tier 2 (SearXNG) and Tier 3 (OpenRouter) are actually triggered in practice — verify by running an item that is genuinely ambiguous and watching pipeline logs
Memory budget: Mac Mini M4 stays out of memory pressure (green in Activity Monitor) with oMLX loaded and backend running — verify before declaring inference "working"
NetBox downtime handling: If NetBox LXC is shut down mid-intake, the operation queues and resumes cleanly — verify with a chaos test

Recovery Strategies

Pitfall	Recovery Cost	Recovery Steps
Hard-coded USB paths in production	MEDIUM	Refactor device manager to VID/PID enumeration; update all open() calls; test full replug cycle
NetBox filled with wrong AI classifications	HIGH	Write a NetBox API script to bulk-set affected items back to `draft` status; re-run intake on each item; no shortcut
Goroutine leak accumulation over long session	LOW	Restart Go backend; investigate with `pprof`; add goroutine count metric to telemetry endpoint
PRT Qutie only works via Bluetooth	MEDIUM	Pivot to CoreBluetooth CGo shim or external helper process; estimated 3-5 days additional work
16GB memory exhausted during intake session	LOW	Restart oMLX; switch to smaller model variant; add memory monitoring to runbook
OpenRouter spend spike from runaway escalation	LOW-MEDIUM	Set account spend limit in OpenRouter dashboard; add per-run token counter with hard cutoff in Go pipeline code

Pitfall-to-Phase Mapping

Pitfall	Prevention Phase	Verification
USB path churn	USB device layer (first hardware phase)	Integration test: replug all 3 devices in random order, verify correct device responds
Goroutine leak on disconnect	USB device layer	Goroutine count stable after 10 replug cycles (test with `pprof`)
NetBox downtime / no offline mode	NetBox integration phase	Chaos test: kill NetBox mid-intake, verify queue persists and resumes
AI misidentification / quality gate bypass	AI pipeline phase	Ambiguous photo stays at `needs_research`; confident photo reaches `indexed` pending confirmation
PRT Qutie protocol unknown	Hardware characterization spike (day of hardware arrival)	USB traffic captured and protocol characterized before architecture is committed
16GB memory exhaustion	Infrastructure setup phase	Memory pressure remains green during full stack with oMLX loaded
NetBox custom field write/read asymmetry	NetBox integration phase	Round-trip test: PATCH field, GET object, assert value matches
Three-tier escalation never triggers	AI pipeline phase	Log shows tier promotions happening on genuinely ambiguous items
Runaway OpenRouter spend	AI pipeline phase	`max_tokens` set on every OpenRouter call; spend limit set in dashboard; per-run cost logged

Sources

Pitfalls research for: HWLab — Go + USB serial + local AI inference + NetBox Researched: 2026-04-09

24 KiB Raw Blame History

Pitfalls Research

Critical Pitfalls

Pitfall 1: USB Serial Port Path Churn on Device Replug

Pitfall 2: Goroutine Leak on USB Disconnect

Pitfall 3: NetBox as Sole Data Store — No Offline or Degraded Mode

Pitfall 4: AI Confidently Misidentifies Hardware — No Quality Gate Enforcement

Pitfall 5: PRT Qutie Protocol Unknown — Blocking Hardware Dependency

Pitfall 6: 16GB Unified Memory — Gemma 4 Leaves No Headroom for the Rest of the Stack

Pitfall 7: NetBox Custom Fields — Write Format Differs from Read Format

Technical Debt Patterns

Integration Gotchas

Performance Traps

Security Mistakes

UX Pitfalls

"Looks Done But Isn't" Checklist

Recovery Strategies

Pitfall-to-Phase Mapping

Sources

24 KiB

Raw Blame History