Nexus Dev a01c28dff2 feat: Phase 40 — Job Infrastructure (content jobs, SSE events, namespaced storage)

2026-04-05 09:55:08 +00:00

38 KiB

Raw Blame History

Format Conversion Ecosystem

Project: Nexus v1.7 — supplemental research for two-tier format conversion system Researched: 2026-04-04 Scope: Direct conversion tools, format registry pattern, AI-bridged conversion boundary, UI patterns, security and performance pitfalls Confidence: HIGH for tool choices, MEDIUM for version numbers (npm registry cross-checked)

Context: What Is Already Available

These are confirmed installed and must not be re-added:

Package	Location	Version	Relevant To Conversion
`sharp ^0.34.5`	`server/`	0.34.5	Raster image conversion (resize, format, SVG→PNG)
`ffmpeg-static ^5.3.0`	`server/`	5.3.0	Audio/video conversion binary
`mermaid ^11.12.0`	`ui/`	11.12.0	Client-side Mermaid rendering
`playwright-chromium ^1.50.0`	`server/`	1.50.0	HTML→PDF already decided in STACK.md

Tier 1: Direct Conversion Tools

Image Formats

Primary: sharp ^0.34.5 (already installed)

Sharp handles the majority of image format pairs without any additional dependency:

Source	Target	Method
JPEG/PNG/WebP/AVIF/TIFF/GIF	Any raster	`sharp(input).toFormat('webp').toBuffer()`
SVG	PNG/JPEG/WebP	`sharp(svgBuffer).png().toBuffer()` — libvips handles SVG via librsvg
PNG/JPEG	WebP/AVIF	`sharp(input).webp({ quality: 80 }).toBuffer()`

Important SVG caveat: sharp's SVG→PNG conversion uses librsvg. It works well for most SVGs but does NOT support all CSS features. For agent-generated SVGs with embedded fonts (produced by satori), use @resvg/resvg-js as specified in STACK.md. For user-uploaded SVGs without special fonts, sharp is sufficient.

No ImageMagick needed. ImageMagick via CLI or WASM adds complexity:

imagemagick npm is an unmaintained CLI wrapper (last release 2020)
WASM ImageMagick (@imagemagick/magick-wasm) works but runs at ~0.3× native speed
sharp via libvips is 4–5× faster than ImageMagick for every supported format pair
For the format pairs Nexus needs (JPEG, PNG, WebP, AVIF, TIFF, SVG→raster), sharp covers everything

No Inkscape needed for Nexus's scope (vector conversion beyond SVG→PNG is out of scope for v1.7).

Audio / Video Formats

Wrapper: fluent-ffmpeg ^2.1.3

Important maintenance note: The fluent-ffmpeg repository was archived on May 22, 2025. The package is no longer receiving new features. However:

It remains published on npm and functional with Node.js >=18
The ffmpeg-static ^5.3.0 binary it wraps is still actively maintained
@types/fluent-ffmpeg ^2.1.28 provides TypeScript types (last updated October 2025)
For Nexus's use case (spawn ffmpeg with known args), the archived state is low risk
Alternative: write a thin child_process.spawn wrapper directly — this is ~30 lines and removes the archived dependency entirely

Recommendation for Nexus: Implement a minimal ffmpegConvert(inputPath, outputPath, extraArgs) wrapper using child_process.spawn instead of taking on the archived fluent-ffmpeg. The full fluent API is unnecessary — format conversion is a single ffmpeg -i input.mp4 output.webm call.

// server/src/services/converters/ffmpeg-converter.ts
import { spawn } from "child_process";
import ffmpegPath from "ffmpeg-static";

export function ffmpegConvert(
  inputPath: string,
  outputPath: string,
  extraArgs: string[] = []
): Promise<void> {
  return new Promise((resolve, reject) => {
    const proc = spawn(ffmpegPath!, [
      "-i", inputPath,
      ...extraArgs,
      "-y",  // overwrite output
      outputPath,
    ]);
    proc.on("close", (code) =>
      code === 0 ? resolve() : reject(new Error(`ffmpeg exited ${code}`))
    );
    proc.stderr.on("data", () => {}); // consume stderr to prevent backpressure
  });
}

Format coverage via ffmpeg-static:

Source	Target	Extra args
MP4/MKV/AVI/MOV	WebM	`["-c:v", "libvpx-vp9", "-c:a", "libopus"]`
MP4/WebM	MP3/AAC	`["-vn", "-c:a", "libmp3lame"]` (audio extract)
MP3/WAV/FLAC/OGG	MP3	`["-c:a", "libmp3lame", "-b:a", "192k"]`
MP3/WAV/OGG	WAV	`["-c:a", "pcm_s16le"]`
Image sequence	MP4	`["-r", "30", "-c:v", "libx264"]`
MP4	GIF	`["-vf", "fps=10,scale=640:-1:flags=lanczos"]`
Any video	Audio-only MP3	`["-vn"]`

Documents

DOCX to HTML: `mammoth ^1.12.0`

Mammoth converts .docx → HTML with semantic preservation. It does NOT create DOCX.

pnpm --filter @paperclipai/server add mammoth

import mammoth from "mammoth";

const { value: html } = await mammoth.convertToHtml({ path: docxPath });
// html is a clean HTML string; images are embedded as base64 data URIs by default

Why mammoth over pandoc for DOCX→HTML: Mammoth preserves heading hierarchy, tables, lists, and images correctly. Its output is cleaner HTML than pandoc's for Word documents. Single-purpose library, no system binary required.

TypeScript types: Included in the package since v1.7 (@types/mammoth not needed).

Confidence: HIGH — official npm, v1.12.0, actively maintained (last publish 20 days ago).

Markdown → DOCX / PDF / HTML: Pandoc (system binary + thin wrapper)

Integration approach: Pandoc is a Haskell binary — there is no pure-Node.js implementation. Existing Node.js wrappers are thin child_process shims:

node-pandoc ^0.2.7 — 13K weekly downloads, most popular, but last updated 2021
pandoc-ts — TypeScript wrapper, smaller community
Recommended: write a 20-line child_process.spawn wrapper — same as ffmpeg approach

// server/src/services/converters/pandoc-converter.ts
import { spawn } from "child_process";

export function pandocConvert(
  inputPath: string,
  outputPath: string,
  from: string,
  to: string
): Promise<void> {
  return new Promise((resolve, reject) => {
    const proc = spawn("pandoc", [
      inputPath,
      "-f", from,
      "-t", to,
      "-o", outputPath,
    ]);
    proc.on("close", (code) =>
      code === 0 ? resolve() : reject(new Error(`pandoc exited ${code}`))
    );
  });
}

Supported format pairs via pandoc:

Source	Target
Markdown	DOCX, HTML, RST, LaTeX, EPUB
RST	Markdown, HTML, DOCX
HTML	Markdown, DOCX
LaTeX	HTML, Markdown
DOCX	Markdown (lossier than mammoth for HTML)

System dependency: pandoc must be installed on the Mac Mini. Install via brew install pandoc. Check at server startup with which pandoc; if absent, degrade gracefully.

Confidence: HIGH — pandoc is the de-facto standard; brew install is 1 command; child_process wrapper is trivial.

DOCX / ODT / PPTX → PDF: LibreOffice headless

Package: libreoffice-convert ^1.8.1

pnpm --filter @paperclipai/server add libreoffice-convert
# System dependency: brew install --cask libreoffice

import { convertAsync } from "libreoffice-convert";
import fs from "fs/promises";

const inputBuffer = await fs.readFile(docxPath);
const pdfBuffer = await convertAsync(inputBuffer, ".pdf", undefined);

Format pairs:

Source	Target
DOCX / DOC	PDF, ODT, HTML
PPTX / PPT	PDF, ODP
XLSX / XLS	PDF, ODS, CSV
ODT / ODP / ODS	PDF, DOCX, PPTX

Performance warning: LibreOffice launches a JVM-equivalent runtime on first call. Cold start: ~3-5 seconds. Warm subsequent calls: ~500ms. Serialize LibreOffice jobs (no concurrent renders) — run maximum one at a time.

System dependency: LibreOffice must be installed at /Applications/LibreOffice.app on macOS. Check at server startup; degrade gracefully if absent.

Confidence: MEDIUM — package v1.8.1 confirmed. macOS arm64 LibreOffice runs natively on M4 (confirmed via LibreOffice download page). Single source for LibreOffice npm package maintenance status.

HTML → PDF: playwright-chromium (already decided in STACK.md)

Use playwright-chromium ^1.50.0 for HTML→PDF. Already researched; do not re-add.

Data Formats

Spreadsheets: `xlsx` (SheetJS Community Edition)

Package: xlsx ^0.20.x (SheetJS Community Edition)

pnpm --filter @paperclipai/server add xlsx

Format coverage:

Source	Target	Method
XLSX / XLS / ODS	CSV	`XLSX.utils.sheet_to_csv(ws)`
XLSX / XLS / ODS	JSON	`XLSX.utils.sheet_to_json(ws)`
CSV / JSON	XLSX	`XLSX.utils.json_to_sheet(data)` → `XLSX.writeFile(wb, path)`

SheetJS has ~7.8M weekly downloads and handles all Excel formats including legacy .xls. It is the default choice with no alternatives needed for basic spreadsheet conversion.

Licensing note: SheetJS Community Edition is free (Apache 2.0 for historical versions; check current license at install time). SheetJS Pro adds streaming for very large files — not needed at single-user scale.

Confidence: MEDIUM-HIGH — widely used, 7.8M weekly downloads confirmed. Version 0.20.x confirmed. License nuance warrants a check at install time.

CSV Parsing: `csv-parse ^5.6.0` (part of the `csv` ecosystem)

pnpm --filter @paperclipai/server add csv-parse

The csv-parse ^6.2.1 package (latest as of April 2026) implements stream.Transform and supports both streaming and synchronous/callback modes. It includes TypeScript types.

When to use: Parsing user-uploaded CSV files before transformation (CSV→JSON, CSV→XLSX). For generating CSV output from JSON/objects, use SheetJS or csv-stringify (same ecosystem as csv-parse).

Confidence: HIGH — v6.2.1 confirmed from npm search. Maintained (last publish 4 days ago).

JSON ↔ CSV: Use `csv-parse` + `csv-stringify` (not `json2csv`)

The json2csv package is in maintenance mode at v6.0.0-alpha (3 years old). The json-2-csv package (v5.5.10, different package) is active but adds a dependency for something csv-stringify already handles.

Recommendation: Use csv-stringify ^6.x (same ecosystem as csv-parse, same maintainer) for JSON→CSV. This avoids pulling in a separate package.

pnpm --filter @paperclipai/server add csv-stringify

Code Formats

Code Formatting (JS/TS/CSS/HTML): `prettier ^3.x`

pnpm --filter @paperclipai/server add --save-dev prettier
# For programmatic use in server:
pnpm --filter @paperclipai/server add prettier

Prettier exposes a programmatic API:

import { format } from "prettier";

const formatted = await format(sourceCode, {
  parser: "typescript",  // or "babel", "css", "html", "markdown", etc.
  semi: true,
  singleQuote: true,
});

Use case: Agent generates code → prettier formats it before saving. Also enables code→code conversions like "reformat this JSON" or "convert CommonJS to ESM style."

Confidence: HIGH — Prettier API is well-documented at prettier.io/docs/api.

TypeScript Type Generation (JSON Schema → TypeScript): `json-schema-to-typescript ^15.x`

For the AI-bridged case where an agent converts a JSON schema into TypeScript type definitions, this library handles it deterministically:

pnpm --filter @paperclipai/server add json-schema-to-typescript

import { compile } from "json-schema-to-typescript";
const ts = await compile(jsonSchema, "MyType");

Confidence: MEDIUM — widely used library; version verified as 15.x on npm (as of late 2025). Use for JSON Schema→TypeScript specifically; TypeScript→TypeScript reformatting uses the compiler API or prettier.

Format Coverage Matrix

Source →	PNG	JPEG	WebP	AVIF	SVG	PDF	DOCX	XLSX	CSV	JSON	MP4	MP3	WebM
PNG	—	sharp	sharp	sharp	—	playwright	—	—	—	—	—	—	—
JPEG	sharp	—	sharp	sharp	—	playwright	—	—	—	—	—	—	—
WebP	sharp	sharp	—	sharp	—	playwright	—	—	—	—	—	—	—
SVG	sharp/@resvg	sharp/@resvg	sharp/@resvg	—	—	playwright	—	—	—	—	—	—	—
DOCX	—	—	—	—	—	LibreOffice	—	—	—	—	—	—	—
PPTX	—	—	—	—	—	LibreOffice	—	—	—	—	—	—	—
XLSX/XLS	—	—	—	—	—	LibreOffice	—	—	SheetJS	SheetJS	—	—	—
HTML	—	—	—	—	—	playwright	mammoth→†	—	—	—	—	—	—
Markdown	—	—	—	—	—	pandoc	pandoc	—	—	—	—	—	—
CSV	—	—	—	—	—	AI-bridged	—	SheetJS	—	csv-parse	—	—	—
JSON	—	—	—	—	—	AI-bridged	—	SheetJS	csv-stringify	—	—	—	—
MP4/MKV	—	—	—	—	—	—	—	—	—	—	—	ffmpeg	ffmpeg
MP3/WAV	—	—	—	—	—	—	—	—	—	—	—	—	—
WAV/OGG	—	—	—	—	—	—	—	—	—	—	—	ffmpeg	—

† HTML→DOCX requires pandoc (mammoth is one-way: DOCX→HTML only)

AI-bridged: Format pairs without a deterministic tool path. See Tier 2 below.

Tier 2: Format Registry Pattern

Dispatch Table Design

The registry is a map of "source/target" → handler function. This is simpler than a class hierarchy and matches the existing Nexus factory function pattern.

// server/src/services/converters/registry.ts

export type ConversionHandler = (
  inputPath: string,
  outputPath: string,
  opts?: Record<string, unknown>
) => Promise<void>;

export type ConversionCapability = "direct" | "ai-bridged" | "unavailable";

export interface ConversionRoute {
  capability: ConversionCapability;
  handler?: ConversionHandler;     // present when capability = "direct"
  aiHint?: string;                 // present when capability = "ai-bridged"
  requiresSystemDep?: string;      // e.g. "pandoc", "libreoffice"
}

// Key format: "source.ext/target.ext" — always lowercase, no leading dot
const registry = new Map<string, ConversionRoute>();

export function registerConverter(
  sourceExt: string,
  targetExt: string,
  route: ConversionRoute
): void {
  registry.set(`${sourceExt}/${targetExt}`, route);
}

export function getConverter(sourceExt: string, targetExt: string): ConversionRoute {
  return registry.get(`${sourceExt}/${targetExt}`) ?? { capability: "unavailable" };
}

export function listSupportedTargets(sourceExt: string): string[] {
  const results: string[] = [];
  for (const [key, route] of registry.entries()) {
    if (key.startsWith(`${sourceExt}/`) && route.capability !== "unavailable") {
      results.push(key.split("/")[1]);
    }
  }
  return results;
}

Registration (in server startup):

// server/src/services/converters/index.ts
import { registerConverter } from "./registry";
import { sharpConvert } from "./sharp-converter";
import { ffmpegConvert } from "./ffmpeg-converter";
import { pandocConvert } from "./pandoc-converter";

// Image
registerConverter("png", "webp", { capability: "direct", handler: (i, o) => sharpConvert(i, o, "webp") });
registerConverter("jpg", "webp", { capability: "direct", handler: (i, o) => sharpConvert(i, o, "webp") });
registerConverter("svg", "png",  { capability: "direct", handler: (i, o) => sharpConvert(i, o, "png") });

// Documents
registerConverter("docx", "html",   { capability: "direct", handler: mammothConvert, requiresSystemDep: undefined });
registerConverter("docx", "pdf",    { capability: "direct", handler: libreofficeConvert, requiresSystemDep: "libreoffice" });
registerConverter("md",   "docx",   { capability: "direct", handler: (i, o) => pandocConvert(i, o, "markdown", "docx"), requiresSystemDep: "pandoc" });

// Data
registerConverter("xlsx", "csv",    { capability: "direct", handler: sheetjsConvert });
registerConverter("csv",  "xlsx",   { capability: "direct", handler: sheetjsConvert });
registerConverter("csv",  "pdf",    { capability: "ai-bridged", aiHint: "format as a formatted table report PDF" });
registerConverter("json", "pdf",    { capability: "ai-bridged", aiHint: "render as a structured document report" });

// Audio/Video
registerConverter("mp4",  "webm",   { capability: "direct", handler: (i, o) => ffmpegConvert(i, o, ["-c:v", "libvpx-vp9"]) });
registerConverter("mp3",  "wav",    { capability: "direct", handler: (i, o) => ffmpegConvert(i, o, ["-c:a", "pcm_s16le"]) });

Extensibility: Adding a new format pair is one registerConverter() call. The route handler and the registry are decoupled. System dependency checks happen at registerConverter() time, not at request time — unavailable converters are registered as capability: "unavailable" when their system dep is absent.

Tier 2: AI-Bridged Conversion

When to Use AI (vs Direct Tool)

Criterion	Direct Tool	AI-Bridged
Output is byte-for-byte deterministic	Yes	No
Format pair has an established tool	Yes	No
Conversion is purely structural (no semantic change)	Yes	—
Conversion requires understanding content meaning	—	Yes
Source format is machine-readable but lacks a direct path	—	Yes
Example pairs	PNG→WebP, DOCX→PDF, MP4→WebM	CSV→PDF report, JSON→DOCX narrative, schema→TypeScript

Decision rule:

Use direct tool when: getConverter(src, tgt).capability === "direct". Use AI when: capability is "ai-bridged" AND the source is a text/data format that an LLM can read as context. Return "unavailable" error when: capability is "unavailable" (binary formats with no path, e.g. PDF→XLSX).

AI-bridged is NOT a fallback for when a tool is missing. If LibreOffice is not installed, DOCX→PDF is "unavailable", not "ai-bridged". AI-bridged is only for semantically complex conversions where no deterministic tool exists.

AI-Bridged Prompt Structure

The prompt must be deterministic in its output format requirements. Vague prompts produce vague output.

// server/src/services/converters/ai-bridged-converter.ts

export async function aiBridgedConvert(
  sourceExt: string,
  targetExt: string,
  sourceContent: string,       // text content of the source file
  outputPath: string,
  aiHint: string,              // from registry route
  agentAdapter: AgentAdapter   // existing adapter interface
): Promise<void> {
  const prompt = buildConversionPrompt(sourceExt, targetExt, sourceContent, aiHint);
  const result = await agentAdapter.complete(prompt);
  await writeConversionResult(result, targetExt, outputPath);
}

function buildConversionPrompt(
  sourceExt: string,
  targetExt: string,
  sourceContent: string,
  aiHint: string
): string {
  const outputSpec = OUTPUT_SPEC[targetExt] ?? "the target format";
  return `Convert the following ${sourceExt.toUpperCase()} content to ${targetExt.toUpperCase()}.

Instruction: ${aiHint}

Requirements:
- Output ONLY the converted content, no explanation, no preamble
- Output format: ${outputSpec}
- Preserve all data values exactly; do not summarize or truncate

Source content:
\`\`\`
${sourceContent}
\`\`\``;
}

const OUTPUT_SPEC: Record<string, string> = {
  html:  "Valid HTML5. No <!DOCTYPE>, no <html>/<body> wrapper. Only the content fragment.",
  md:    "GitHub Flavored Markdown.",
  ts:    "Valid TypeScript. No imports unless required. Export all types.",
  pdf:   "HTML that will be rendered to PDF. Use inline <style> for layout.",
  json:  "Valid JSON object or array. No trailing commas. No comments.",
};

Reliability rules for AI-bridged conversion:

Always specify exact output format in the prompt — "output ONLY the converted content" prevents LLM preamble
Pass temperature: 0 to the adapter — conversions are deterministic tasks, not creative ones
Validate the output: HTML is parsed with a DOM parser, JSON is JSON.parse(), TypeScript is compiled with tsc --noEmit
Keep source content under 50K characters — larger inputs require chunking or direct tool upgrade
Never use AI-bridged for binary formats (images, audio, video) — LLMs cannot output binary faithfully

UI Patterns

Deep-Linkable Route Structure

/convert                    → landing page, format picker
/convert/:from              → source chosen, target picker
/convert/:from/:to          → conversion page, upload + convert CTA
/convert/:from/:to/:jobId   → result page, download + share

This mirrors the CloudConvert URL pattern. Each route is bookmarkable and shareable. React Router <Link> handles navigation without page reload.

Implementation with React Router v6:

// ui/src/pages/convert/index.tsx
<Route path="/convert" element={<ConvertLanding />} />
<Route path="/convert/:from" element={<ConvertSourcePage />} />
<Route path="/convert/:from/:to" element={<ConvertPage />} />

The :from and :to params are lowercase file extension strings (e.g. png, pdf, docx). The UI queries GET /api/convert/formats/:from to show available target formats dynamically.

Conversion API Endpoints

GET  /api/convert/formats           → all supported format pairs
GET  /api/convert/formats/:from     → targets available for a given source
POST /api/convert/jobs              → create job, returns jobId (202 Accepted)
GET  /api/convert/jobs/:jobId       → job status + result download URL

This mirrors the existing content-jobs pattern from ARCHITECTURE.md. The conversion system is a second consumer of contentJobService — not a separate job system.

// Register conversion as content job types
registerConverter("png", "webp", {
  capability: "direct",
  handler: sharpConvert,
});
// POST /api/convert/jobs creates a content_jobs row with type: "convert:png/webp"

UI Component Pattern

Key components for the conversion UI:

ConvertLanding           — grid of format categories (Images, Documents, Data, Video)
FormatPicker             — searchable list of source/target formats
ConvertDropzone          — drag-drop + paste + file picker, file size display
ConvertProgress          — SSE-driven progress bar (reuses existing SSE subscriber hook)
ConvertResult            — download button, preview (image inline, PDF iframe), copy link
FormatBadge              — small pill showing format ext with category color

Drag-drop: Use the existing file upload pattern from Nexus v1.3 file system. Do not add a new drag-drop library — the existing upload component already handles it.

Progress: Use the existing SSE live event subscriber. The conversion job emits content.job.started and content.job.done events — the same events as other content jobs. The ConvertProgress component subscribes to these. Zero new infrastructure needed.

Format categories for the landing grid:

Category	Formats	Icon
Images	PNG, JPEG, WebP, AVIF, SVG, GIF	camera
Documents	PDF, DOCX, HTML, Markdown, ODT	file-text
Data	CSV, XLSX, JSON, TSV	table
Video	MP4, WebM, MKV, MOV, GIF	video
Audio	MP3, WAV, OGG, FLAC	music
Code	JS, TS, CSS, HTML	code

Security Pitfalls

Critical: Path Traversal via Filename

What goes wrong: A user uploads a file named ../../etc/passwd.csv or ../config.json. If the server uses the original filename to construct the output path, it writes outside the intended temp directory.

How it happens in conversion specifically: Conversion tools (pandoc, ffmpeg, LibreOffice) write output to a path the server constructs. If the output path includes any user-supplied component, traversal is trivial.

CVE context: CVE-2026-21440 (AdonisJS, CVSS 9.2) is exactly this pattern — MultipartFile.move() without filename sanitization allows arbitrary file write.

Prevention:

import path from "path";
import crypto from "crypto";

// NEVER use the original filename for output paths
function safeOutputPath(tempDir: string, targetExt: string): string {
  const id = crypto.randomUUID();
  return path.join(tempDir, `${id}.${targetExt}`);
  // path.join cannot traverse outside tempDir because id contains no slashes
}

// NEVER pass user-supplied paths to child_process.spawn
// Always resolve through the StorageService, which controls the output namespace

Absolute rule: The only user-supplied value that enters the conversion pipeline is the file content (Buffer). File names, paths, and extensions are derived server-side from MIME type detection and the registry.

Critical: MIME Type vs Extension Spoofing

What goes wrong: A user uploads a file with extension .csv but the actual content is an executable or a PHP file. The server passes it to the pandoc handler expecting text input.

Prevention:

Validate MIME type with file-type ^19.x (reads magic bytes, not extension)
Reject files whose detected MIME type does not match the declared source format
Set Content-Disposition: attachment on all download responses — never inline for non-image, non-PDF files

import { fileTypeFromBuffer } from "file-type";

const detected = await fileTypeFromBuffer(inputBuffer);
if (detected?.mime !== EXPECTED_MIME[sourceExt]) {
  throw new Error(`MIME mismatch: declared ${sourceExt} but detected ${detected?.mime}`);
}

New package: file-type ^19.x

pnpm --filter @paperclipai/server add file-type

Confidence: HIGH — file-type is the standard Node.js magic-bytes library, widely used. v19.x is pure ESM; confirm that server's module resolution handles ESM imports.

Moderate: Unbound Resource Consumption (DoS via Large Files)

What goes wrong: A user uploads a 4GB video file. LibreOffice or ffmpeg consumes all available RAM before the job starts. The server crashes.

Prevention:

// Set limits in Express multipart config (Express 5 uses built-in body limits)
const MAX_FILE_SIZES: Record<string, number> = {
  image:    50  * 1024 * 1024,   // 50 MB
  document: 100 * 1024 * 1024,   // 100 MB
  video:    500 * 1024 * 1024,   // 500 MB
  audio:    200 * 1024 * 1024,   // 200 MB
  data:     20  * 1024 * 1024,   // 20 MB (CSV/XLSX)
};

Also: set ulimit on child processes spawned for conversion (ffmpeg, pandoc). On macOS: use RLIMIT_AS via child_process.spawn options. Conservative default: 2GB per subprocess.

Moderate: Temp File Accumulation (Disk Exhaustion)

What goes wrong: Conversion jobs fail midway. Temp files in /tmp are never cleaned up. The disk fills over days/weeks.

CVE context: CVE-2026-3304 (Multer < 2.1.0) — failed requests leave temp files on disk. Multer ≥ 2.1.0 fixes this but only for Multer's own temp files. Conversion intermediates are your responsibility.

Prevention:

// Always use try/finally to clean up temp files
import fs from "fs/promises";
import os from "os";
import path from "path";

async function withTempDir<T>(fn: (dir: string) => Promise<T>): Promise<T> {
  const dir = await fs.mkdtemp(path.join(os.tmpdir(), "nexus-convert-"));
  try {
    return await fn(dir);
  } finally {
    await fs.rm(dir, { recursive: true, force: true }).catch(() => {});
  }
}

Additionally: register a process exit handler that sweeps for stale nexus-convert-* directories older than 1 hour.

Minor: Arbitrary Code Execution via Conversion Tool Arguments

What goes wrong: Conversion parameters from the API request body are passed directly as CLI arguments to pandoc/ffmpeg. An attacker sends {"extraArgs": ["--lua-filter=/etc/passwd"]}.

Prevention: Never expose raw CLI arg arrays to the API. The registry pre-defines all argument templates. The only variable substitution is for safe values (bitrate as a number, output format as an enum from the registry).

// BAD: never do this
const args = userRequest.extraArgs as string[];
spawn("pandoc", [input, ...args, "-o", output]);

// GOOD: use pre-defined templates from the registry
const route = getConverter(from, to);
route.handler(inputPath, outputPath, {}); // handler owns its own args

Performance Pitfalls

Anti-Pattern: Spawning One Process Per Request

What goes wrong: Each conversion request immediately spawns a new ffmpeg/LibreOffice/pandoc subprocess. Under concurrent load, 20 requests spawn 20 ffmpeg processes simultaneously, exhausting CPU and memory.

Mitigation: Queue conversion jobs through contentJobService (already the pattern). The job queue naturally serializes heavy jobs. For LibreOffice specifically: enforce single-concurrency in the LibreOffice adapter with a simple in-memory semaphore:

let libreofficeRunning = false;
const libreofficeQueue: Array<() => void> = [];

async function libreofficeConvertSerialized(buf: Buffer, ext: string): Promise<Buffer> {
  while (libreofficeRunning) {
    await new Promise<void>(resolve => libreofficeQueue.push(resolve));
  }
  libreofficeRunning = true;
  try {
    return await convertAsync(buf, ext, undefined);
  } finally {
    libreofficeRunning = false;
    libreofficeQueue.shift()?.();
  }
}

For ffmpeg: allow up to 3 concurrent processes (M4 has 10 cores; each ffmpeg uses ~2-3 threads for simple conversions).

Anti-Pattern: Loading Entire File into Memory Before Converting

What goes wrong: A 500MB MP4 file is read into a Buffer before being passed to ffmpeg. Node.js allocates 500MB of heap. V8 GC stalls. Server becomes unresponsive.

Mitigation: For files larger than 10MB, write the upload directly to a temp file path and pass the path to the converter. The StorageService.getStream(objectKey) method should be used to pipe large files to disk before conversion.

Anti-Pattern: Returning Converted File as HTTP Response Body

What goes wrong: POST /api/convert/jobs holds the connection open for 30+ seconds then streams back a 200MB video file as the HTTP response. Upstream proxy (nginx) times out at 30s.

Mitigation: This is the same pattern solved by contentJobService. Always: create a job, return 202+jobId, render async, store result in StorageService, emit SSE done event, client fetches download URL. The download URL uses the existing signed URL / direct serve pattern from assetService.

What NOT to Build

Avoid	Why	Use Instead
ImageMagick CLI wrapper	`sharp` covers all needed raster formats at 4-5× speed	`sharp ^0.34.5` (already installed)
`@imagemagick/magick-wasm`	~0.3× native speed; complex install	`sharp` (libvips-backed)
`fluent-ffmpeg ^2.1.3`	Archived May 2025; full fluent API unnecessary	20-line `child_process.spawn` wrapper
`node-pandoc ^0.2.7`	Last updated 2021; adds dependency for a 5-line child_process call	Thin wrapper using `child_process.spawn`
`json2csv` (original)	v6 alpha, 3 years stale	`csv-stringify` (same ecosystem as `csv-parse`)
`pdf-lib`	Pure-JS PDF assembly; no HTML rendering; wrong for HTML→PDF	`playwright-chromium` (already decided)
`jsPDF`	CVE-2025-68428 path traversal in versions <4.0.0; even fixed versions are JS-first PDF generation with weak CSS support	`playwright-chromium`
Arbitrary `extraArgs` in API	Arbitrary CLI args = code execution via crafted filenames/flags	Pre-defined handler templates in registry
Shared temp files between jobs	Race conditions, cleanup failures	`withTempDir()` scoped to each job
AI for binary→binary conversion	LLMs cannot produce binary output faithfully	Always use direct tools for binary format pairs
Polling loop for job status	Creates unnecessary load; SSE already available	Subscribe to existing `content.job.done` SSE event

Installation Summary

New packages to add (v1.7 format conversion):

# Document conversion
pnpm --filter @paperclipai/server add mammoth          # ^1.12.0 — DOCX→HTML (no system dep)
pnpm --filter @paperclipai/server add libreoffice-convert  # ^1.8.1 — Office→PDF (requires LibreOffice)
pnpm --filter @paperclipai/server add xlsx             # ^0.20.x — SheetJS spreadsheet R/W

# CSV/data
pnpm --filter @paperclipai/server add csv-parse        # ^6.2.1 — CSV streaming parser
pnpm --filter @paperclipai/server add csv-stringify    # ^6.x — JSON→CSV generator

# Code formatting
pnpm --filter @paperclipai/server add prettier         # ^3.x — code formatting (JS/TS/CSS/HTML/MD)
pnpm --filter @paperclipai/server add json-schema-to-typescript  # ^15.x — schema→TS types

# Security: MIME type validation
pnpm --filter @paperclipai/server add file-type        # ^19.x — magic byte MIME detection (ESM)

# System dependencies (install once on Mac Mini)
brew install pandoc         # markdown↔docx/html/rst conversions
brew install --cask libreoffice  # office→pdf; optional, degrade gracefully if absent

No new packages needed for:

Image conversion → sharp already installed
Audio/video → ffmpeg-static already installed (write thin wrapper)
SVG→PNG → sharp (basic) or @resvg/resvg-js (already in STACK.md for satori SVGs)
HTML→PDF → playwright-chromium already in STACK.md

Phase-Specific Warnings

Phase Topic	Likely Pitfall	Mitigation
Format registry setup	Registering converters before system dep checks → silent `"direct"` entries that fail at runtime	Check `which pandoc`, `which soffice` at startup; register as `"unavailable"` if absent
LibreOffice integration	LibreOffice spawns a UNO bridge on first call; second call within the same socket fails	Serialize all LibreOffice calls with the semaphore pattern above; never concurrent
File upload security	User-controlled filenames in temp paths	Use `crypto.randomUUID()` for all output paths; never use original filename
`file-type ^19.x`	Pure ESM package in a CJS/ESM mixed server	Use dynamic `await import("file-type")` or configure server's tsconfig for ESM interop
Large video files	Buffer entire file into memory	Pipe uploads directly to disk via streaming; pass file path (not buffer) to ffmpeg
AI-bridged output validation	LLM returns text with preamble before the converted content	Enforce `OUTPUT_SPEC` in prompt; strip leading non-content lines; validate with format parser
`/convert/:from/:to` route	Collides with existing routes if Express route order is wrong	Mount conversion routes before wildcard routes; use `/api/convert/` prefix throughout
`xlsx` (SheetJS) license	Community Edition license changed in recent versions	Check npm package license field at install time; log at startup if non-OSI

Sources

fluent-ffmpeg GitHub (archived May 2025) — archival notice
fluent-ffmpeg npm — v2.1.3 confirmed
sharp official docs — SVG support via librsvg confirmed
sharp GitHub — v0.34.5 confirmed; 4-5× faster than ImageMagick
mammoth npm — v1.12.0, last published 20 days ago
mammoth GitHub — one-way DOCX→HTML converter
pandoc official — universal markup converter, Haskell binary
node-pandoc npm — thin child_process wrapper, v0.2.7, last updated 2021
libreoffice-convert npm — v1.8.1, last published ~February 2026
SheetJS Community Edition docs — spreadsheet format coverage
SheetJS vs ExcelJS comparison 2026 — SheetJS 7.8M weekly downloads (MEDIUM confidence — single source)
csv-parse npm — v6.2.1, last published 4 days ago
json-2-csv npm — v5.5.10, maintained alternative to archived json2csv
Prettier API docs — programmatic format() function confirmed
json-schema-to-typescript GitHub — JSON Schema→TypeScript types
ConvertX GitHub (C4illin/ConvertX) — reference architecture: tool-per-category dispatch with 1000+ format pairs
Worker Threads in Node.js 2026 (DEV Community) — pooling recommendation for CPU-bound tasks
CVE-2025-68428: jsPDF path traversal — avoid jsPDF for user-input file paths
CVE-2026-21440: AdonisJS bodyparser path traversal (CVSS 9.2) — file upload filename sanitization
CVE-2026-3304: Multer temp file cleanup DoS — Multer <2.1.0 does not clean temp files on async filter error
Node.js path traversal prevention — path.normalize alone is insufficient
Hybrid AI deterministic/LLM boundary (New Math Data) — deterministic tools for deterministic tasks, LLM for semantic tasks

Format conversion research for: Nexus v1.7 Content Generation Researched: 2026-04-04 Scope: Supplemental — format conversion ecosystem only. Does not supersede STACK.md or ARCHITECTURE.md.

38 KiB Raw Blame History Unescape Escape

Format Conversion Ecosystem

Context: What Is Already Available

Tier 1: Direct Conversion Tools

Image Formats

Audio / Video Formats

Documents

DOCX to HTML: mammoth ^1.12.0

Markdown → DOCX / PDF / HTML: Pandoc (system binary + thin wrapper)

DOCX / ODT / PPTX → PDF: LibreOffice headless

HTML → PDF: playwright-chromium (already decided in STACK.md)

Data Formats

Spreadsheets: xlsx (SheetJS Community Edition)

CSV Parsing: csv-parse ^5.6.0 (part of the csv ecosystem)

JSON ↔ CSV: Use csv-parse + csv-stringify (not json2csv)

Code Formats

Code Formatting (JS/TS/CSS/HTML): prettier ^3.x

TypeScript Type Generation (JSON Schema → TypeScript): json-schema-to-typescript ^15.x

Format Coverage Matrix

Tier 2: Format Registry Pattern

Dispatch Table Design

Tier 2: AI-Bridged Conversion

When to Use AI (vs Direct Tool)

AI-Bridged Prompt Structure

UI Patterns

Deep-Linkable Route Structure

Conversion API Endpoints

UI Component Pattern

Security Pitfalls

Critical: Path Traversal via Filename

Critical: MIME Type vs Extension Spoofing

Moderate: Unbound Resource Consumption (DoS via Large Files)

Moderate: Temp File Accumulation (Disk Exhaustion)

Minor: Arbitrary Code Execution via Conversion Tool Arguments

Performance Pitfalls

Anti-Pattern: Spawning One Process Per Request

Anti-Pattern: Loading Entire File into Memory Before Converting

Anti-Pattern: Returning Converted File as HTTP Response Body

What NOT to Build

Installation Summary

Phase-Specific Warnings

Sources

38 KiB

Raw Blame History

DOCX to HTML: `mammoth ^1.12.0`

Spreadsheets: `xlsx` (SheetJS Community Edition)

CSV Parsing: `csv-parse ^5.6.0` (part of the `csv` ecosystem)

JSON ↔ CSV: Use `csv-parse` + `csv-stringify` (not `json2csv`)

Code Formatting (JS/TS/CSS/HTML): `prettier ^3.x`

TypeScript Type Generation (JSON Schema → TypeScript): `json-schema-to-typescript ^15.x`