nexus/.planning/research/FORMAT-CONVERSION.md

846 lines
38 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Format Conversion Ecosystem
**Project:** Nexus v1.7 — supplemental research for two-tier format conversion system
**Researched:** 2026-04-04
**Scope:** Direct conversion tools, format registry pattern, AI-bridged conversion boundary, UI patterns, security and performance pitfalls
**Confidence:** HIGH for tool choices, MEDIUM for version numbers (npm registry cross-checked)
---
## Context: What Is Already Available
These are confirmed installed and must not be re-added:
| Package | Location | Version | Relevant To Conversion |
|---------|----------|---------|----------------------|
| `sharp ^0.34.5` | `server/` | 0.34.5 | Raster image conversion (resize, format, SVG→PNG) |
| `ffmpeg-static ^5.3.0` | `server/` | 5.3.0 | Audio/video conversion binary |
| `mermaid ^11.12.0` | `ui/` | 11.12.0 | Client-side Mermaid rendering |
| `playwright-chromium ^1.50.0` | `server/` | 1.50.0 | HTML→PDF already decided in STACK.md |
---
## Tier 1: Direct Conversion Tools
### Image Formats
**Primary: `sharp ^0.34.5` (already installed)**
Sharp handles the majority of image format pairs without any additional dependency:
| Source | Target | Method |
|--------|--------|--------|
| JPEG/PNG/WebP/AVIF/TIFF/GIF | Any raster | `sharp(input).toFormat('webp').toBuffer()` |
| SVG | PNG/JPEG/WebP | `sharp(svgBuffer).png().toBuffer()` — libvips handles SVG via librsvg |
| PNG/JPEG | WebP/AVIF | `sharp(input).webp({ quality: 80 }).toBuffer()` |
**Important SVG caveat:** sharp's SVG→PNG conversion uses librsvg. It works well for most SVGs but does NOT support all CSS features. For agent-generated SVGs with embedded fonts (produced by `satori`), use `@resvg/resvg-js` as specified in STACK.md. For user-uploaded SVGs without special fonts, `sharp` is sufficient.
**No ImageMagick needed.** ImageMagick via CLI or WASM adds complexity:
- `imagemagick` npm is an unmaintained CLI wrapper (last release 2020)
- WASM ImageMagick (`@imagemagick/magick-wasm`) works but runs at ~0.3× native speed
- `sharp` via libvips is 45× faster than ImageMagick for every supported format pair
- For the format pairs Nexus needs (JPEG, PNG, WebP, AVIF, TIFF, SVG→raster), sharp covers everything
**No Inkscape needed** for Nexus's scope (vector conversion beyond SVG→PNG is out of scope for v1.7).
---
### Audio / Video Formats
**Wrapper: `fluent-ffmpeg ^2.1.3`**
**Important maintenance note:** The `fluent-ffmpeg` repository was archived on May 22, 2025. The package is no longer receiving new features. However:
- It remains published on npm and functional with Node.js >=18
- The `ffmpeg-static ^5.3.0` binary it wraps is still actively maintained
- `@types/fluent-ffmpeg ^2.1.28` provides TypeScript types (last updated October 2025)
- For Nexus's use case (spawn ffmpeg with known args), the archived state is low risk
- Alternative: write a thin `child_process.spawn` wrapper directly — this is ~30 lines and removes the archived dependency entirely
**Recommendation for Nexus:** Implement a minimal `ffmpegConvert(inputPath, outputPath, extraArgs)` wrapper using `child_process.spawn` instead of taking on the archived `fluent-ffmpeg`. The full fluent API is unnecessary — format conversion is a single `ffmpeg -i input.mp4 output.webm` call.
```typescript
// server/src/services/converters/ffmpeg-converter.ts
import { spawn } from "child_process";
import ffmpegPath from "ffmpeg-static";
export function ffmpegConvert(
inputPath: string,
outputPath: string,
extraArgs: string[] = []
): Promise<void> {
return new Promise((resolve, reject) => {
const proc = spawn(ffmpegPath!, [
"-i", inputPath,
...extraArgs,
"-y", // overwrite output
outputPath,
]);
proc.on("close", (code) =>
code === 0 ? resolve() : reject(new Error(`ffmpeg exited ${code}`))
);
proc.stderr.on("data", () => {}); // consume stderr to prevent backpressure
});
}
```
**Format coverage via ffmpeg-static:**
| Source | Target | Extra args |
|--------|--------|-----------|
| MP4/MKV/AVI/MOV | WebM | `["-c:v", "libvpx-vp9", "-c:a", "libopus"]` |
| MP4/WebM | MP3/AAC | `["-vn", "-c:a", "libmp3lame"]` (audio extract) |
| MP3/WAV/FLAC/OGG | MP3 | `["-c:a", "libmp3lame", "-b:a", "192k"]` |
| MP3/WAV/OGG | WAV | `["-c:a", "pcm_s16le"]` |
| Image sequence | MP4 | `["-r", "30", "-c:v", "libx264"]` |
| MP4 | GIF | `["-vf", "fps=10,scale=640:-1:flags=lanczos"]` |
| Any video | Audio-only MP3 | `["-vn"]` |
---
### Documents
#### DOCX to HTML: `mammoth ^1.12.0`
Mammoth converts `.docx` → HTML with semantic preservation. It does NOT create DOCX.
```bash
pnpm --filter @paperclipai/server add mammoth
```
```typescript
import mammoth from "mammoth";
const { value: html } = await mammoth.convertToHtml({ path: docxPath });
// html is a clean HTML string; images are embedded as base64 data URIs by default
```
**Why mammoth over pandoc for DOCX→HTML:** Mammoth preserves heading hierarchy, tables, lists, and images correctly. Its output is cleaner HTML than pandoc's for Word documents. Single-purpose library, no system binary required.
**TypeScript types:** Included in the package since v1.7 (`@types/mammoth` not needed).
**Confidence: HIGH** — official npm, v1.12.0, actively maintained (last publish 20 days ago).
---
#### Markdown → DOCX / PDF / HTML: Pandoc (system binary + thin wrapper)
**Integration approach:** Pandoc is a Haskell binary — there is no pure-Node.js implementation. Existing Node.js wrappers are thin `child_process` shims:
- `node-pandoc ^0.2.7` — 13K weekly downloads, most popular, but last updated 2021
- `pandoc-ts` — TypeScript wrapper, smaller community
- **Recommended: write a 20-line `child_process.spawn` wrapper** — same as ffmpeg approach
```typescript
// server/src/services/converters/pandoc-converter.ts
import { spawn } from "child_process";
export function pandocConvert(
inputPath: string,
outputPath: string,
from: string,
to: string
): Promise<void> {
return new Promise((resolve, reject) => {
const proc = spawn("pandoc", [
inputPath,
"-f", from,
"-t", to,
"-o", outputPath,
]);
proc.on("close", (code) =>
code === 0 ? resolve() : reject(new Error(`pandoc exited ${code}`))
);
});
}
```
**Supported format pairs via pandoc:**
| Source | Target |
|--------|--------|
| Markdown | DOCX, HTML, RST, LaTeX, EPUB |
| RST | Markdown, HTML, DOCX |
| HTML | Markdown, DOCX |
| LaTeX | HTML, Markdown |
| DOCX | Markdown (lossier than mammoth for HTML) |
**System dependency:** pandoc must be installed on the Mac Mini. Install via `brew install pandoc`. Check at server startup with `which pandoc`; if absent, degrade gracefully.
**Confidence: HIGH** — pandoc is the de-facto standard; brew install is 1 command; child_process wrapper is trivial.
---
#### DOCX / ODT / PPTX → PDF: LibreOffice headless
**Package: `libreoffice-convert ^1.8.1`**
```bash
pnpm --filter @paperclipai/server add libreoffice-convert
# System dependency: brew install --cask libreoffice
```
```typescript
import { convertAsync } from "libreoffice-convert";
import fs from "fs/promises";
const inputBuffer = await fs.readFile(docxPath);
const pdfBuffer = await convertAsync(inputBuffer, ".pdf", undefined);
```
**Format pairs:**
| Source | Target |
|--------|--------|
| DOCX / DOC | PDF, ODT, HTML |
| PPTX / PPT | PDF, ODP |
| XLSX / XLS | PDF, ODS, CSV |
| ODT / ODP / ODS | PDF, DOCX, PPTX |
**Performance warning:** LibreOffice launches a JVM-equivalent runtime on first call. Cold start: ~3-5 seconds. Warm subsequent calls: ~500ms. Serialize LibreOffice jobs (no concurrent renders) — run maximum one at a time.
**System dependency:** LibreOffice must be installed at `/Applications/LibreOffice.app` on macOS. Check at server startup; degrade gracefully if absent.
**Confidence: MEDIUM** — package v1.8.1 confirmed. macOS arm64 LibreOffice runs natively on M4 (confirmed via LibreOffice download page). Single source for LibreOffice npm package maintenance status.
---
#### HTML → PDF: playwright-chromium (already decided in STACK.md)
Use `playwright-chromium ^1.50.0` for HTML→PDF. Already researched; do not re-add.
---
### Data Formats
#### Spreadsheets: `xlsx` (SheetJS Community Edition)
**Package: `xlsx ^0.20.x`** (SheetJS Community Edition)
```bash
pnpm --filter @paperclipai/server add xlsx
```
**Format coverage:**
| Source | Target | Method |
|--------|--------|--------|
| XLSX / XLS / ODS | CSV | `XLSX.utils.sheet_to_csv(ws)` |
| XLSX / XLS / ODS | JSON | `XLSX.utils.sheet_to_json(ws)` |
| CSV / JSON | XLSX | `XLSX.utils.json_to_sheet(data)``XLSX.writeFile(wb, path)` |
SheetJS has ~7.8M weekly downloads and handles all Excel formats including legacy `.xls`. It is the default choice with no alternatives needed for basic spreadsheet conversion.
**Licensing note:** SheetJS Community Edition is free (Apache 2.0 for historical versions; check current license at install time). SheetJS Pro adds streaming for very large files — not needed at single-user scale.
**Confidence: MEDIUM-HIGH** — widely used, 7.8M weekly downloads confirmed. Version 0.20.x confirmed. License nuance warrants a check at install time.
---
#### CSV Parsing: `csv-parse ^5.6.0` (part of the `csv` ecosystem)
```bash
pnpm --filter @paperclipai/server add csv-parse
```
The `csv-parse ^6.2.1` package (latest as of April 2026) implements `stream.Transform` and supports both streaming and synchronous/callback modes. It includes TypeScript types.
**When to use:** Parsing user-uploaded CSV files before transformation (CSV→JSON, CSV→XLSX). For generating CSV output from JSON/objects, use SheetJS or `csv-stringify` (same ecosystem as `csv-parse`).
**Confidence: HIGH** — v6.2.1 confirmed from npm search. Maintained (last publish 4 days ago).
---
#### JSON ↔ CSV: Use `csv-parse` + `csv-stringify` (not `json2csv`)
The `json2csv` package is in maintenance mode at v6.0.0-alpha (3 years old). The `json-2-csv` package (v5.5.10, different package) is active but adds a dependency for something `csv-stringify` already handles.
**Recommendation:** Use `csv-stringify ^6.x` (same ecosystem as `csv-parse`, same maintainer) for JSON→CSV. This avoids pulling in a separate package.
```bash
pnpm --filter @paperclipai/server add csv-stringify
```
---
### Code Formats
#### Code Formatting (JS/TS/CSS/HTML): `prettier ^3.x`
```bash
pnpm --filter @paperclipai/server add --save-dev prettier
# For programmatic use in server:
pnpm --filter @paperclipai/server add prettier
```
Prettier exposes a programmatic API:
```typescript
import { format } from "prettier";
const formatted = await format(sourceCode, {
parser: "typescript", // or "babel", "css", "html", "markdown", etc.
semi: true,
singleQuote: true,
});
```
**Use case:** Agent generates code → prettier formats it before saving. Also enables code→code conversions like "reformat this JSON" or "convert CommonJS to ESM style."
**Confidence: HIGH** — Prettier API is well-documented at prettier.io/docs/api.
---
#### TypeScript Type Generation (JSON Schema → TypeScript): `json-schema-to-typescript ^15.x`
For the AI-bridged case where an agent converts a JSON schema into TypeScript type definitions, this library handles it deterministically:
```bash
pnpm --filter @paperclipai/server add json-schema-to-typescript
```
```typescript
import { compile } from "json-schema-to-typescript";
const ts = await compile(jsonSchema, "MyType");
```
**Confidence: MEDIUM** — widely used library; version verified as 15.x on npm (as of late 2025). Use for JSON Schema→TypeScript specifically; TypeScript→TypeScript reformatting uses the compiler API or prettier.
---
## Format Coverage Matrix
| Source → | PNG | JPEG | WebP | AVIF | SVG | PDF | DOCX | XLSX | CSV | JSON | MP4 | MP3 | WebM |
|----------|-----|------|------|------|-----|-----|------|------|-----|------|-----|-----|------|
| PNG | — | sharp | sharp | sharp | — | playwright | — | — | — | — | — | — | — |
| JPEG | sharp | — | sharp | sharp | — | playwright | — | — | — | — | — | — | — |
| WebP | sharp | sharp | — | sharp | — | playwright | — | — | — | — | — | — | — |
| SVG | sharp/@resvg | sharp/@resvg | sharp/@resvg | — | — | playwright | — | — | — | — | — | — | — |
| DOCX | — | — | — | — | — | LibreOffice | — | — | — | — | — | — | — |
| PPTX | — | — | — | — | — | LibreOffice | — | — | — | — | — | — | — |
| XLSX/XLS | — | — | — | — | — | LibreOffice | — | — | SheetJS | SheetJS | — | — | — |
| HTML | — | — | — | — | — | playwright | mammoth→† | — | — | — | — | — | — |
| Markdown | — | — | — | — | — | pandoc | pandoc | — | — | — | — | — | — |
| CSV | — | — | — | — | — | AI-bridged | — | SheetJS | — | csv-parse | — | — | — |
| JSON | — | — | — | — | — | AI-bridged | — | SheetJS | csv-stringify | — | — | — | — |
| MP4/MKV | — | — | — | — | — | — | — | — | — | — | — | ffmpeg | ffmpeg |
| MP3/WAV | — | — | — | — | — | — | — | — | — | — | — | — | — |
| WAV/OGG | — | — | — | — | — | — | — | — | — | — | — | ffmpeg | — |
† HTML→DOCX requires pandoc (mammoth is one-way: DOCX→HTML only)
**AI-bridged**: Format pairs without a deterministic tool path. See Tier 2 below.
---
## Tier 2: Format Registry Pattern
### Dispatch Table Design
The registry is a map of `"source/target"` → handler function. This is simpler than a class hierarchy and matches the existing Nexus factory function pattern.
```typescript
// server/src/services/converters/registry.ts
export type ConversionHandler = (
inputPath: string,
outputPath: string,
opts?: Record<string, unknown>
) => Promise<void>;
export type ConversionCapability = "direct" | "ai-bridged" | "unavailable";
export interface ConversionRoute {
capability: ConversionCapability;
handler?: ConversionHandler; // present when capability = "direct"
aiHint?: string; // present when capability = "ai-bridged"
requiresSystemDep?: string; // e.g. "pandoc", "libreoffice"
}
// Key format: "source.ext/target.ext" — always lowercase, no leading dot
const registry = new Map<string, ConversionRoute>();
export function registerConverter(
sourceExt: string,
targetExt: string,
route: ConversionRoute
): void {
registry.set(`${sourceExt}/${targetExt}`, route);
}
export function getConverter(sourceExt: string, targetExt: string): ConversionRoute {
return registry.get(`${sourceExt}/${targetExt}`) ?? { capability: "unavailable" };
}
export function listSupportedTargets(sourceExt: string): string[] {
const results: string[] = [];
for (const [key, route] of registry.entries()) {
if (key.startsWith(`${sourceExt}/`) && route.capability !== "unavailable") {
results.push(key.split("/")[1]);
}
}
return results;
}
```
**Registration (in server startup):**
```typescript
// server/src/services/converters/index.ts
import { registerConverter } from "./registry";
import { sharpConvert } from "./sharp-converter";
import { ffmpegConvert } from "./ffmpeg-converter";
import { pandocConvert } from "./pandoc-converter";
// Image
registerConverter("png", "webp", { capability: "direct", handler: (i, o) => sharpConvert(i, o, "webp") });
registerConverter("jpg", "webp", { capability: "direct", handler: (i, o) => sharpConvert(i, o, "webp") });
registerConverter("svg", "png", { capability: "direct", handler: (i, o) => sharpConvert(i, o, "png") });
// Documents
registerConverter("docx", "html", { capability: "direct", handler: mammothConvert, requiresSystemDep: undefined });
registerConverter("docx", "pdf", { capability: "direct", handler: libreofficeConvert, requiresSystemDep: "libreoffice" });
registerConverter("md", "docx", { capability: "direct", handler: (i, o) => pandocConvert(i, o, "markdown", "docx"), requiresSystemDep: "pandoc" });
// Data
registerConverter("xlsx", "csv", { capability: "direct", handler: sheetjsConvert });
registerConverter("csv", "xlsx", { capability: "direct", handler: sheetjsConvert });
registerConverter("csv", "pdf", { capability: "ai-bridged", aiHint: "format as a formatted table report PDF" });
registerConverter("json", "pdf", { capability: "ai-bridged", aiHint: "render as a structured document report" });
// Audio/Video
registerConverter("mp4", "webm", { capability: "direct", handler: (i, o) => ffmpegConvert(i, o, ["-c:v", "libvpx-vp9"]) });
registerConverter("mp3", "wav", { capability: "direct", handler: (i, o) => ffmpegConvert(i, o, ["-c:a", "pcm_s16le"]) });
```
**Extensibility:** Adding a new format pair is one `registerConverter()` call. The route handler and the registry are decoupled. System dependency checks happen at `registerConverter()` time, not at request time — unavailable converters are registered as `capability: "unavailable"` when their system dep is absent.
---
## Tier 2: AI-Bridged Conversion
### When to Use AI (vs Direct Tool)
| Criterion | Direct Tool | AI-Bridged |
|-----------|-------------|------------|
| Output is byte-for-byte deterministic | Yes | No |
| Format pair has an established tool | Yes | No |
| Conversion is purely structural (no semantic change) | Yes | — |
| Conversion requires understanding content meaning | — | Yes |
| Source format is machine-readable but lacks a direct path | — | Yes |
| Example pairs | PNG→WebP, DOCX→PDF, MP4→WebM | CSV→PDF report, JSON→DOCX narrative, schema→TypeScript |
**Decision rule:**
> Use direct tool when: `getConverter(src, tgt).capability === "direct"`.
> Use AI when: capability is `"ai-bridged"` AND the source is a text/data format that an LLM can read as context.
> Return `"unavailable"` error when: capability is `"unavailable"` (binary formats with no path, e.g. PDF→XLSX).
**AI-bridged is NOT a fallback for when a tool is missing.** If LibreOffice is not installed, DOCX→PDF is `"unavailable"`, not `"ai-bridged"`. AI-bridged is only for semantically complex conversions where no deterministic tool exists.
---
### AI-Bridged Prompt Structure
The prompt must be deterministic in its output format requirements. Vague prompts produce vague output.
```typescript
// server/src/services/converters/ai-bridged-converter.ts
export async function aiBridgedConvert(
sourceExt: string,
targetExt: string,
sourceContent: string, // text content of the source file
outputPath: string,
aiHint: string, // from registry route
agentAdapter: AgentAdapter // existing adapter interface
): Promise<void> {
const prompt = buildConversionPrompt(sourceExt, targetExt, sourceContent, aiHint);
const result = await agentAdapter.complete(prompt);
await writeConversionResult(result, targetExt, outputPath);
}
function buildConversionPrompt(
sourceExt: string,
targetExt: string,
sourceContent: string,
aiHint: string
): string {
const outputSpec = OUTPUT_SPEC[targetExt] ?? "the target format";
return `Convert the following ${sourceExt.toUpperCase()} content to ${targetExt.toUpperCase()}.
Instruction: ${aiHint}
Requirements:
- Output ONLY the converted content, no explanation, no preamble
- Output format: ${outputSpec}
- Preserve all data values exactly; do not summarize or truncate
Source content:
\`\`\`
${sourceContent}
\`\`\``;
}
const OUTPUT_SPEC: Record<string, string> = {
html: "Valid HTML5. No <!DOCTYPE>, no <html>/<body> wrapper. Only the content fragment.",
md: "GitHub Flavored Markdown.",
ts: "Valid TypeScript. No imports unless required. Export all types.",
pdf: "HTML that will be rendered to PDF. Use inline <style> for layout.",
json: "Valid JSON object or array. No trailing commas. No comments.",
};
```
**Reliability rules for AI-bridged conversion:**
1. Always specify exact output format in the prompt — "output ONLY the converted content" prevents LLM preamble
2. Pass `temperature: 0` to the adapter — conversions are deterministic tasks, not creative ones
3. Validate the output: HTML is parsed with a DOM parser, JSON is `JSON.parse()`, TypeScript is compiled with `tsc --noEmit`
4. Keep source content under 50K characters — larger inputs require chunking or direct tool upgrade
5. Never use AI-bridged for binary formats (images, audio, video) — LLMs cannot output binary faithfully
---
## UI Patterns
### Deep-Linkable Route Structure
```
/convert → landing page, format picker
/convert/:from → source chosen, target picker
/convert/:from/:to → conversion page, upload + convert CTA
/convert/:from/:to/:jobId → result page, download + share
```
This mirrors the CloudConvert URL pattern. Each route is bookmarkable and shareable. React Router `<Link>` handles navigation without page reload.
**Implementation with React Router v6:**
```typescript
// ui/src/pages/convert/index.tsx
<Route path="/convert" element={<ConvertLanding />} />
<Route path="/convert/:from" element={<ConvertSourcePage />} />
<Route path="/convert/:from/:to" element={<ConvertPage />} />
```
The `:from` and `:to` params are lowercase file extension strings (e.g. `png`, `pdf`, `docx`). The UI queries `GET /api/convert/formats/:from` to show available target formats dynamically.
---
### Conversion API Endpoints
```
GET /api/convert/formats → all supported format pairs
GET /api/convert/formats/:from → targets available for a given source
POST /api/convert/jobs → create job, returns jobId (202 Accepted)
GET /api/convert/jobs/:jobId → job status + result download URL
```
This mirrors the existing `content-jobs` pattern from ARCHITECTURE.md. The conversion system is a second consumer of `contentJobService` — not a separate job system.
```typescript
// Register conversion as content job types
registerConverter("png", "webp", {
capability: "direct",
handler: sharpConvert,
});
// POST /api/convert/jobs creates a content_jobs row with type: "convert:png/webp"
```
---
### UI Component Pattern
Key components for the conversion UI:
```
ConvertLanding — grid of format categories (Images, Documents, Data, Video)
FormatPicker — searchable list of source/target formats
ConvertDropzone — drag-drop + paste + file picker, file size display
ConvertProgress — SSE-driven progress bar (reuses existing SSE subscriber hook)
ConvertResult — download button, preview (image inline, PDF iframe), copy link
FormatBadge — small pill showing format ext with category color
```
**Drag-drop:** Use the existing file upload pattern from Nexus v1.3 file system. Do not add a new drag-drop library — the existing upload component already handles it.
**Progress:** Use the existing SSE live event subscriber. The conversion job emits `content.job.started` and `content.job.done` events — the same events as other content jobs. The `ConvertProgress` component subscribes to these. Zero new infrastructure needed.
**Format categories for the landing grid:**
| Category | Formats | Icon |
|----------|---------|------|
| Images | PNG, JPEG, WebP, AVIF, SVG, GIF | camera |
| Documents | PDF, DOCX, HTML, Markdown, ODT | file-text |
| Data | CSV, XLSX, JSON, TSV | table |
| Video | MP4, WebM, MKV, MOV, GIF | video |
| Audio | MP3, WAV, OGG, FLAC | music |
| Code | JS, TS, CSS, HTML | code |
---
## Security Pitfalls
### Critical: Path Traversal via Filename
**What goes wrong:** A user uploads a file named `../../etc/passwd.csv` or `../config.json`. If the server uses the original filename to construct the output path, it writes outside the intended temp directory.
**How it happens in conversion specifically:** Conversion tools (pandoc, ffmpeg, LibreOffice) write output to a path the server constructs. If the output path includes any user-supplied component, traversal is trivial.
**CVE context:** CVE-2026-21440 (AdonisJS, CVSS 9.2) is exactly this pattern — `MultipartFile.move()` without filename sanitization allows arbitrary file write.
**Prevention:**
```typescript
import path from "path";
import crypto from "crypto";
// NEVER use the original filename for output paths
function safeOutputPath(tempDir: string, targetExt: string): string {
const id = crypto.randomUUID();
return path.join(tempDir, `${id}.${targetExt}`);
// path.join cannot traverse outside tempDir because id contains no slashes
}
// NEVER pass user-supplied paths to child_process.spawn
// Always resolve through the StorageService, which controls the output namespace
```
**Absolute rule:** The only user-supplied value that enters the conversion pipeline is the **file content (Buffer)**. File names, paths, and extensions are derived server-side from MIME type detection and the registry.
---
### Critical: MIME Type vs Extension Spoofing
**What goes wrong:** A user uploads a file with extension `.csv` but the actual content is an executable or a PHP file. The server passes it to the pandoc handler expecting text input.
**Prevention:**
1. Validate MIME type with `file-type ^19.x` (reads magic bytes, not extension)
2. Reject files whose detected MIME type does not match the declared source format
3. Set `Content-Disposition: attachment` on all download responses — never `inline` for non-image, non-PDF files
```typescript
import { fileTypeFromBuffer } from "file-type";
const detected = await fileTypeFromBuffer(inputBuffer);
if (detected?.mime !== EXPECTED_MIME[sourceExt]) {
throw new Error(`MIME mismatch: declared ${sourceExt} but detected ${detected?.mime}`);
}
```
**New package: `file-type ^19.x`**
```bash
pnpm --filter @paperclipai/server add file-type
```
Confidence: HIGH — `file-type` is the standard Node.js magic-bytes library, widely used. v19.x is pure ESM; confirm that server's module resolution handles ESM imports.
---
### Moderate: Unbound Resource Consumption (DoS via Large Files)
**What goes wrong:** A user uploads a 4GB video file. LibreOffice or ffmpeg consumes all available RAM before the job starts. The server crashes.
**Prevention:**
```typescript
// Set limits in Express multipart config (Express 5 uses built-in body limits)
const MAX_FILE_SIZES: Record<string, number> = {
image: 50 * 1024 * 1024, // 50 MB
document: 100 * 1024 * 1024, // 100 MB
video: 500 * 1024 * 1024, // 500 MB
audio: 200 * 1024 * 1024, // 200 MB
data: 20 * 1024 * 1024, // 20 MB (CSV/XLSX)
};
```
Also: set `ulimit` on child processes spawned for conversion (ffmpeg, pandoc). On macOS: use `RLIMIT_AS` via `child_process.spawn` options. Conservative default: 2GB per subprocess.
---
### Moderate: Temp File Accumulation (Disk Exhaustion)
**What goes wrong:** Conversion jobs fail midway. Temp files in `/tmp` are never cleaned up. The disk fills over days/weeks.
**CVE context:** CVE-2026-3304 (Multer < 2.1.0) failed requests leave temp files on disk. Multer 2.1.0 fixes this but only for Multer's own temp files. Conversion intermediates are your responsibility.
**Prevention:**
```typescript
// Always use try/finally to clean up temp files
import fs from "fs/promises";
import os from "os";
import path from "path";
async function withTempDir<T>(fn: (dir: string) => Promise<T>): Promise<T> {
const dir = await fs.mkdtemp(path.join(os.tmpdir(), "nexus-convert-"));
try {
return await fn(dir);
} finally {
await fs.rm(dir, { recursive: true, force: true }).catch(() => {});
}
}
```
Additionally: register a process exit handler that sweeps for stale `nexus-convert-*` directories older than 1 hour.
---
### Minor: Arbitrary Code Execution via Conversion Tool Arguments
**What goes wrong:** Conversion parameters from the API request body are passed directly as CLI arguments to pandoc/ffmpeg. An attacker sends `{"extraArgs": ["--lua-filter=/etc/passwd"]}`.
**Prevention:** Never expose raw CLI arg arrays to the API. The registry pre-defines all argument templates. The only variable substitution is for safe values (bitrate as a number, output format as an enum from the registry).
```typescript
// BAD: never do this
const args = userRequest.extraArgs as string[];
spawn("pandoc", [input, ...args, "-o", output]);
// GOOD: use pre-defined templates from the registry
const route = getConverter(from, to);
route.handler(inputPath, outputPath, {}); // handler owns its own args
```
---
## Performance Pitfalls
### Anti-Pattern: Spawning One Process Per Request
**What goes wrong:** Each conversion request immediately spawns a new ffmpeg/LibreOffice/pandoc subprocess. Under concurrent load, 20 requests spawn 20 ffmpeg processes simultaneously, exhausting CPU and memory.
**Mitigation:** Queue conversion jobs through `contentJobService` (already the pattern). The job queue naturally serializes heavy jobs. For LibreOffice specifically: enforce single-concurrency in the LibreOffice adapter with a simple in-memory semaphore:
```typescript
let libreofficeRunning = false;
const libreofficeQueue: Array<() => void> = [];
async function libreofficeConvertSerialized(buf: Buffer, ext: string): Promise<Buffer> {
while (libreofficeRunning) {
await new Promise<void>(resolve => libreofficeQueue.push(resolve));
}
libreofficeRunning = true;
try {
return await convertAsync(buf, ext, undefined);
} finally {
libreofficeRunning = false;
libreofficeQueue.shift()?.();
}
}
```
For ffmpeg: allow up to 3 concurrent processes (M4 has 10 cores; each ffmpeg uses ~2-3 threads for simple conversions).
---
### Anti-Pattern: Loading Entire File into Memory Before Converting
**What goes wrong:** A 500MB MP4 file is read into a `Buffer` before being passed to ffmpeg. Node.js allocates 500MB of heap. V8 GC stalls. Server becomes unresponsive.
**Mitigation:** For files larger than 10MB, write the upload directly to a temp file path and pass the path to the converter. The `StorageService.getStream(objectKey)` method should be used to pipe large files to disk before conversion.
---
### Anti-Pattern: Returning Converted File as HTTP Response Body
**What goes wrong:** `POST /api/convert/jobs` holds the connection open for 30+ seconds then streams back a 200MB video file as the HTTP response. Upstream proxy (nginx) times out at 30s.
**Mitigation:** This is the same pattern solved by `contentJobService`. Always: create a job, return 202+jobId, render async, store result in StorageService, emit SSE done event, client fetches download URL. The download URL uses the existing signed URL / direct serve pattern from `assetService`.
---
## What NOT to Build
| Avoid | Why | Use Instead |
|-------|-----|-------------|
| ImageMagick CLI wrapper | `sharp` covers all needed raster formats at 4-5× speed | `sharp ^0.34.5` (already installed) |
| `@imagemagick/magick-wasm` | ~0.3× native speed; complex install | `sharp` (libvips-backed) |
| `fluent-ffmpeg ^2.1.3` | Archived May 2025; full fluent API unnecessary | 20-line `child_process.spawn` wrapper |
| `node-pandoc ^0.2.7` | Last updated 2021; adds dependency for a 5-line child_process call | Thin wrapper using `child_process.spawn` |
| `json2csv` (original) | v6 alpha, 3 years stale | `csv-stringify` (same ecosystem as `csv-parse`) |
| `pdf-lib` | Pure-JS PDF assembly; no HTML rendering; wrong for HTMLPDF | `playwright-chromium` (already decided) |
| `jsPDF` | CVE-2025-68428 path traversal in versions <4.0.0; even fixed versions are JS-first PDF generation with weak CSS support | `playwright-chromium` |
| Arbitrary `extraArgs` in API | Arbitrary CLI args = code execution via crafted filenames/flags | Pre-defined handler templates in registry |
| Shared temp files between jobs | Race conditions, cleanup failures | `withTempDir()` scoped to each job |
| AI for binarybinary conversion | LLMs cannot produce binary output faithfully | Always use direct tools for binary format pairs |
| Polling loop for job status | Creates unnecessary load; SSE already available | Subscribe to existing `content.job.done` SSE event |
---
## Installation Summary
**New packages to add (v1.7 format conversion):**
```bash
# Document conversion
pnpm --filter @paperclipai/server add mammoth # ^1.12.0 — DOCX→HTML (no system dep)
pnpm --filter @paperclipai/server add libreoffice-convert # ^1.8.1 — Office→PDF (requires LibreOffice)
pnpm --filter @paperclipai/server add xlsx # ^0.20.x — SheetJS spreadsheet R/W
# CSV/data
pnpm --filter @paperclipai/server add csv-parse # ^6.2.1 — CSV streaming parser
pnpm --filter @paperclipai/server add csv-stringify # ^6.x — JSON→CSV generator
# Code formatting
pnpm --filter @paperclipai/server add prettier # ^3.x — code formatting (JS/TS/CSS/HTML/MD)
pnpm --filter @paperclipai/server add json-schema-to-typescript # ^15.x — schema→TS types
# Security: MIME type validation
pnpm --filter @paperclipai/server add file-type # ^19.x — magic byte MIME detection (ESM)
# System dependencies (install once on Mac Mini)
brew install pandoc # markdown↔docx/html/rst conversions
brew install --cask libreoffice # office→pdf; optional, degrade gracefully if absent
```
**No new packages needed for:**
- Image conversion `sharp` already installed
- Audio/video `ffmpeg-static` already installed (write thin wrapper)
- SVGPNG `sharp` (basic) or `@resvg/resvg-js` (already in STACK.md for satori SVGs)
- HTMLPDF `playwright-chromium` already in STACK.md
---
## Phase-Specific Warnings
| Phase Topic | Likely Pitfall | Mitigation |
|-------------|---------------|------------|
| Format registry setup | Registering converters before system dep checks silent `"direct"` entries that fail at runtime | Check `which pandoc`, `which soffice` at startup; register as `"unavailable"` if absent |
| LibreOffice integration | LibreOffice spawns a UNO bridge on first call; second call within the same socket fails | Serialize all LibreOffice calls with the semaphore pattern above; never concurrent |
| File upload security | User-controlled filenames in temp paths | Use `crypto.randomUUID()` for all output paths; never use original filename |
| `file-type ^19.x` | Pure ESM package in a CJS/ESM mixed server | Use dynamic `await import("file-type")` or configure server's tsconfig for ESM interop |
| Large video files | Buffer entire file into memory | Pipe uploads directly to disk via streaming; pass file path (not buffer) to ffmpeg |
| AI-bridged output validation | LLM returns text with preamble before the converted content | Enforce `OUTPUT_SPEC` in prompt; strip leading non-content lines; validate with format parser |
| `/convert/:from/:to` route | Collides with existing routes if Express route order is wrong | Mount conversion routes before wildcard routes; use `/api/convert/` prefix throughout |
| `xlsx` (SheetJS) license | Community Edition license changed in recent versions | Check npm package license field at install time; log at startup if non-OSI |
---
## Sources
- [fluent-ffmpeg GitHub (archived May 2025)](https://github.com/fluent-ffmpeg/node-fluent-ffmpeg/issues/1324) archival notice
- [fluent-ffmpeg npm](https://www.npmjs.com/package/fluent-ffmpeg) v2.1.3 confirmed
- [sharp official docs](https://sharp.pixelplumbing.com/) SVG support via librsvg confirmed
- [sharp GitHub](https://github.com/lovell/sharp) v0.34.5 confirmed; 4-5× faster than ImageMagick
- [mammoth npm](https://www.npmjs.com/package/mammoth) v1.12.0, last published 20 days ago
- [mammoth GitHub](https://github.com/mwilliamson/mammoth.js/) one-way DOCXHTML converter
- [pandoc official](https://pandoc.org/) universal markup converter, Haskell binary
- [node-pandoc npm](https://www.npmjs.com/package/node-pandoc) thin child_process wrapper, v0.2.7, last updated 2021
- [libreoffice-convert npm](https://www.npmjs.com/package/libreoffice-convert) v1.8.1, last published ~February 2026
- [SheetJS Community Edition docs](https://docs.sheetjs.com/) spreadsheet format coverage
- [SheetJS vs ExcelJS comparison 2026](https://www.pkgpulse.com/blog/sheetjs-vs-exceljs-vs-node-xlsx-excel-files-node-2026) SheetJS 7.8M weekly downloads (MEDIUM confidence single source)
- [csv-parse npm](https://www.npmjs.com/package/csv-parse) v6.2.1, last published 4 days ago
- [json-2-csv npm](https://www.npmjs.com/package/json-2-csv) v5.5.10, maintained alternative to archived json2csv
- [Prettier API docs](https://prettier.io/docs/api) programmatic `format()` function confirmed
- [json-schema-to-typescript GitHub](https://github.com/bcherny/json-schema-to-typescript) JSON SchemaTypeScript types
- [ConvertX GitHub (C4illin/ConvertX)](https://github.com/C4illin/ConvertX) reference architecture: tool-per-category dispatch with 1000+ format pairs
- [Worker Threads in Node.js 2026 (DEV Community)](https://dev.to/young_gao/worker-threads-in-nodejs-when-and-how-to-use-them-2jdm) pooling recommendation for CPU-bound tasks
- [CVE-2025-68428: jsPDF path traversal](https://www.endorlabs.com/learn/cve-2025-68428-critical-path-traversal-in-jspdf) avoid jsPDF for user-input file paths
- [CVE-2026-21440: AdonisJS bodyparser path traversal (CVSS 9.2)](https://thehackernews.com/2026/01/critical-adonisjs-bodyparser-flaw-cvss.html) file upload filename sanitization
- [CVE-2026-3304: Multer temp file cleanup DoS](https://cvereports.com/reports/CVE-2026-3304) Multer <2.1.0 does not clean temp files on async filter error
- [Node.js path traversal prevention](https://www.nodejs-security.com/blog/secure-coding-practices-nodejs-path-traversal-vulnerabilities) path.normalize alone is insufficient
- [Hybrid AI deterministic/LLM boundary (New Math Data)](https://newmathdata.com/blog/hybrid-ai-deterministic-code-llm-reasoning-systems/) deterministic tools for deterministic tasks, LLM for semantic tasks
---
*Format conversion research for: Nexus v1.7 Content Generation*
*Researched: 2026-04-04*
*Scope: Supplemental — format conversion ecosystem only. Does not supersede STACK.md or ARCHITECTURE.md.*