846 lines
38 KiB
Markdown
846 lines
38 KiB
Markdown
# Format Conversion Ecosystem
|
||
|
||
**Project:** Nexus v1.7 — supplemental research for two-tier format conversion system
|
||
**Researched:** 2026-04-04
|
||
**Scope:** Direct conversion tools, format registry pattern, AI-bridged conversion boundary, UI patterns, security and performance pitfalls
|
||
**Confidence:** HIGH for tool choices, MEDIUM for version numbers (npm registry cross-checked)
|
||
|
||
---
|
||
|
||
## Context: What Is Already Available
|
||
|
||
These are confirmed installed and must not be re-added:
|
||
|
||
| Package | Location | Version | Relevant To Conversion |
|
||
|---------|----------|---------|----------------------|
|
||
| `sharp ^0.34.5` | `server/` | 0.34.5 | Raster image conversion (resize, format, SVG→PNG) |
|
||
| `ffmpeg-static ^5.3.0` | `server/` | 5.3.0 | Audio/video conversion binary |
|
||
| `mermaid ^11.12.0` | `ui/` | 11.12.0 | Client-side Mermaid rendering |
|
||
| `playwright-chromium ^1.50.0` | `server/` | 1.50.0 | HTML→PDF already decided in STACK.md |
|
||
|
||
---
|
||
|
||
## Tier 1: Direct Conversion Tools
|
||
|
||
### Image Formats
|
||
|
||
**Primary: `sharp ^0.34.5` (already installed)**
|
||
|
||
Sharp handles the majority of image format pairs without any additional dependency:
|
||
|
||
| Source | Target | Method |
|
||
|--------|--------|--------|
|
||
| JPEG/PNG/WebP/AVIF/TIFF/GIF | Any raster | `sharp(input).toFormat('webp').toBuffer()` |
|
||
| SVG | PNG/JPEG/WebP | `sharp(svgBuffer).png().toBuffer()` — libvips handles SVG via librsvg |
|
||
| PNG/JPEG | WebP/AVIF | `sharp(input).webp({ quality: 80 }).toBuffer()` |
|
||
|
||
**Important SVG caveat:** sharp's SVG→PNG conversion uses librsvg. It works well for most SVGs but does NOT support all CSS features. For agent-generated SVGs with embedded fonts (produced by `satori`), use `@resvg/resvg-js` as specified in STACK.md. For user-uploaded SVGs without special fonts, `sharp` is sufficient.
|
||
|
||
**No ImageMagick needed.** ImageMagick via CLI or WASM adds complexity:
|
||
- `imagemagick` npm is an unmaintained CLI wrapper (last release 2020)
|
||
- WASM ImageMagick (`@imagemagick/magick-wasm`) works but runs at ~0.3× native speed
|
||
- `sharp` via libvips is 4–5× faster than ImageMagick for every supported format pair
|
||
- For the format pairs Nexus needs (JPEG, PNG, WebP, AVIF, TIFF, SVG→raster), sharp covers everything
|
||
|
||
**No Inkscape needed** for Nexus's scope (vector conversion beyond SVG→PNG is out of scope for v1.7).
|
||
|
||
---
|
||
|
||
### Audio / Video Formats
|
||
|
||
**Wrapper: `fluent-ffmpeg ^2.1.3`**
|
||
|
||
**Important maintenance note:** The `fluent-ffmpeg` repository was archived on May 22, 2025. The package is no longer receiving new features. However:
|
||
- It remains published on npm and functional with Node.js >=18
|
||
- The `ffmpeg-static ^5.3.0` binary it wraps is still actively maintained
|
||
- `@types/fluent-ffmpeg ^2.1.28` provides TypeScript types (last updated October 2025)
|
||
- For Nexus's use case (spawn ffmpeg with known args), the archived state is low risk
|
||
- Alternative: write a thin `child_process.spawn` wrapper directly — this is ~30 lines and removes the archived dependency entirely
|
||
|
||
**Recommendation for Nexus:** Implement a minimal `ffmpegConvert(inputPath, outputPath, extraArgs)` wrapper using `child_process.spawn` instead of taking on the archived `fluent-ffmpeg`. The full fluent API is unnecessary — format conversion is a single `ffmpeg -i input.mp4 output.webm` call.
|
||
|
||
```typescript
|
||
// server/src/services/converters/ffmpeg-converter.ts
|
||
import { spawn } from "child_process";
|
||
import ffmpegPath from "ffmpeg-static";
|
||
|
||
export function ffmpegConvert(
|
||
inputPath: string,
|
||
outputPath: string,
|
||
extraArgs: string[] = []
|
||
): Promise<void> {
|
||
return new Promise((resolve, reject) => {
|
||
const proc = spawn(ffmpegPath!, [
|
||
"-i", inputPath,
|
||
...extraArgs,
|
||
"-y", // overwrite output
|
||
outputPath,
|
||
]);
|
||
proc.on("close", (code) =>
|
||
code === 0 ? resolve() : reject(new Error(`ffmpeg exited ${code}`))
|
||
);
|
||
proc.stderr.on("data", () => {}); // consume stderr to prevent backpressure
|
||
});
|
||
}
|
||
```
|
||
|
||
**Format coverage via ffmpeg-static:**
|
||
|
||
| Source | Target | Extra args |
|
||
|--------|--------|-----------|
|
||
| MP4/MKV/AVI/MOV | WebM | `["-c:v", "libvpx-vp9", "-c:a", "libopus"]` |
|
||
| MP4/WebM | MP3/AAC | `["-vn", "-c:a", "libmp3lame"]` (audio extract) |
|
||
| MP3/WAV/FLAC/OGG | MP3 | `["-c:a", "libmp3lame", "-b:a", "192k"]` |
|
||
| MP3/WAV/OGG | WAV | `["-c:a", "pcm_s16le"]` |
|
||
| Image sequence | MP4 | `["-r", "30", "-c:v", "libx264"]` |
|
||
| MP4 | GIF | `["-vf", "fps=10,scale=640:-1:flags=lanczos"]` |
|
||
| Any video | Audio-only MP3 | `["-vn"]` |
|
||
|
||
---
|
||
|
||
### Documents
|
||
|
||
#### DOCX to HTML: `mammoth ^1.12.0`
|
||
|
||
Mammoth converts `.docx` → HTML with semantic preservation. It does NOT create DOCX.
|
||
|
||
```bash
|
||
pnpm --filter @paperclipai/server add mammoth
|
||
```
|
||
|
||
```typescript
|
||
import mammoth from "mammoth";
|
||
|
||
const { value: html } = await mammoth.convertToHtml({ path: docxPath });
|
||
// html is a clean HTML string; images are embedded as base64 data URIs by default
|
||
```
|
||
|
||
**Why mammoth over pandoc for DOCX→HTML:** Mammoth preserves heading hierarchy, tables, lists, and images correctly. Its output is cleaner HTML than pandoc's for Word documents. Single-purpose library, no system binary required.
|
||
|
||
**TypeScript types:** Included in the package since v1.7 (`@types/mammoth` not needed).
|
||
|
||
**Confidence: HIGH** — official npm, v1.12.0, actively maintained (last publish 20 days ago).
|
||
|
||
---
|
||
|
||
#### Markdown → DOCX / PDF / HTML: Pandoc (system binary + thin wrapper)
|
||
|
||
**Integration approach:** Pandoc is a Haskell binary — there is no pure-Node.js implementation. Existing Node.js wrappers are thin `child_process` shims:
|
||
|
||
- `node-pandoc ^0.2.7` — 13K weekly downloads, most popular, but last updated 2021
|
||
- `pandoc-ts` — TypeScript wrapper, smaller community
|
||
- **Recommended: write a 20-line `child_process.spawn` wrapper** — same as ffmpeg approach
|
||
|
||
```typescript
|
||
// server/src/services/converters/pandoc-converter.ts
|
||
import { spawn } from "child_process";
|
||
|
||
export function pandocConvert(
|
||
inputPath: string,
|
||
outputPath: string,
|
||
from: string,
|
||
to: string
|
||
): Promise<void> {
|
||
return new Promise((resolve, reject) => {
|
||
const proc = spawn("pandoc", [
|
||
inputPath,
|
||
"-f", from,
|
||
"-t", to,
|
||
"-o", outputPath,
|
||
]);
|
||
proc.on("close", (code) =>
|
||
code === 0 ? resolve() : reject(new Error(`pandoc exited ${code}`))
|
||
);
|
||
});
|
||
}
|
||
```
|
||
|
||
**Supported format pairs via pandoc:**
|
||
|
||
| Source | Target |
|
||
|--------|--------|
|
||
| Markdown | DOCX, HTML, RST, LaTeX, EPUB |
|
||
| RST | Markdown, HTML, DOCX |
|
||
| HTML | Markdown, DOCX |
|
||
| LaTeX | HTML, Markdown |
|
||
| DOCX | Markdown (lossier than mammoth for HTML) |
|
||
|
||
**System dependency:** pandoc must be installed on the Mac Mini. Install via `brew install pandoc`. Check at server startup with `which pandoc`; if absent, degrade gracefully.
|
||
|
||
**Confidence: HIGH** — pandoc is the de-facto standard; brew install is 1 command; child_process wrapper is trivial.
|
||
|
||
---
|
||
|
||
#### DOCX / ODT / PPTX → PDF: LibreOffice headless
|
||
|
||
**Package: `libreoffice-convert ^1.8.1`**
|
||
|
||
```bash
|
||
pnpm --filter @paperclipai/server add libreoffice-convert
|
||
# System dependency: brew install --cask libreoffice
|
||
```
|
||
|
||
```typescript
|
||
import { convertAsync } from "libreoffice-convert";
|
||
import fs from "fs/promises";
|
||
|
||
const inputBuffer = await fs.readFile(docxPath);
|
||
const pdfBuffer = await convertAsync(inputBuffer, ".pdf", undefined);
|
||
```
|
||
|
||
**Format pairs:**
|
||
|
||
| Source | Target |
|
||
|--------|--------|
|
||
| DOCX / DOC | PDF, ODT, HTML |
|
||
| PPTX / PPT | PDF, ODP |
|
||
| XLSX / XLS | PDF, ODS, CSV |
|
||
| ODT / ODP / ODS | PDF, DOCX, PPTX |
|
||
|
||
**Performance warning:** LibreOffice launches a JVM-equivalent runtime on first call. Cold start: ~3-5 seconds. Warm subsequent calls: ~500ms. Serialize LibreOffice jobs (no concurrent renders) — run maximum one at a time.
|
||
|
||
**System dependency:** LibreOffice must be installed at `/Applications/LibreOffice.app` on macOS. Check at server startup; degrade gracefully if absent.
|
||
|
||
**Confidence: MEDIUM** — package v1.8.1 confirmed. macOS arm64 LibreOffice runs natively on M4 (confirmed via LibreOffice download page). Single source for LibreOffice npm package maintenance status.
|
||
|
||
---
|
||
|
||
#### HTML → PDF: playwright-chromium (already decided in STACK.md)
|
||
|
||
Use `playwright-chromium ^1.50.0` for HTML→PDF. Already researched; do not re-add.
|
||
|
||
---
|
||
|
||
### Data Formats
|
||
|
||
#### Spreadsheets: `xlsx` (SheetJS Community Edition)
|
||
|
||
**Package: `xlsx ^0.20.x`** (SheetJS Community Edition)
|
||
|
||
```bash
|
||
pnpm --filter @paperclipai/server add xlsx
|
||
```
|
||
|
||
**Format coverage:**
|
||
|
||
| Source | Target | Method |
|
||
|--------|--------|--------|
|
||
| XLSX / XLS / ODS | CSV | `XLSX.utils.sheet_to_csv(ws)` |
|
||
| XLSX / XLS / ODS | JSON | `XLSX.utils.sheet_to_json(ws)` |
|
||
| CSV / JSON | XLSX | `XLSX.utils.json_to_sheet(data)` → `XLSX.writeFile(wb, path)` |
|
||
|
||
SheetJS has ~7.8M weekly downloads and handles all Excel formats including legacy `.xls`. It is the default choice with no alternatives needed for basic spreadsheet conversion.
|
||
|
||
**Licensing note:** SheetJS Community Edition is free (Apache 2.0 for historical versions; check current license at install time). SheetJS Pro adds streaming for very large files — not needed at single-user scale.
|
||
|
||
**Confidence: MEDIUM-HIGH** — widely used, 7.8M weekly downloads confirmed. Version 0.20.x confirmed. License nuance warrants a check at install time.
|
||
|
||
---
|
||
|
||
#### CSV Parsing: `csv-parse ^5.6.0` (part of the `csv` ecosystem)
|
||
|
||
```bash
|
||
pnpm --filter @paperclipai/server add csv-parse
|
||
```
|
||
|
||
The `csv-parse ^6.2.1` package (latest as of April 2026) implements `stream.Transform` and supports both streaming and synchronous/callback modes. It includes TypeScript types.
|
||
|
||
**When to use:** Parsing user-uploaded CSV files before transformation (CSV→JSON, CSV→XLSX). For generating CSV output from JSON/objects, use SheetJS or `csv-stringify` (same ecosystem as `csv-parse`).
|
||
|
||
**Confidence: HIGH** — v6.2.1 confirmed from npm search. Maintained (last publish 4 days ago).
|
||
|
||
---
|
||
|
||
#### JSON ↔ CSV: Use `csv-parse` + `csv-stringify` (not `json2csv`)
|
||
|
||
The `json2csv` package is in maintenance mode at v6.0.0-alpha (3 years old). The `json-2-csv` package (v5.5.10, different package) is active but adds a dependency for something `csv-stringify` already handles.
|
||
|
||
**Recommendation:** Use `csv-stringify ^6.x` (same ecosystem as `csv-parse`, same maintainer) for JSON→CSV. This avoids pulling in a separate package.
|
||
|
||
```bash
|
||
pnpm --filter @paperclipai/server add csv-stringify
|
||
```
|
||
|
||
---
|
||
|
||
### Code Formats
|
||
|
||
#### Code Formatting (JS/TS/CSS/HTML): `prettier ^3.x`
|
||
|
||
```bash
|
||
pnpm --filter @paperclipai/server add --save-dev prettier
|
||
# For programmatic use in server:
|
||
pnpm --filter @paperclipai/server add prettier
|
||
```
|
||
|
||
Prettier exposes a programmatic API:
|
||
|
||
```typescript
|
||
import { format } from "prettier";
|
||
|
||
const formatted = await format(sourceCode, {
|
||
parser: "typescript", // or "babel", "css", "html", "markdown", etc.
|
||
semi: true,
|
||
singleQuote: true,
|
||
});
|
||
```
|
||
|
||
**Use case:** Agent generates code → prettier formats it before saving. Also enables code→code conversions like "reformat this JSON" or "convert CommonJS to ESM style."
|
||
|
||
**Confidence: HIGH** — Prettier API is well-documented at prettier.io/docs/api.
|
||
|
||
---
|
||
|
||
#### TypeScript Type Generation (JSON Schema → TypeScript): `json-schema-to-typescript ^15.x`
|
||
|
||
For the AI-bridged case where an agent converts a JSON schema into TypeScript type definitions, this library handles it deterministically:
|
||
|
||
```bash
|
||
pnpm --filter @paperclipai/server add json-schema-to-typescript
|
||
```
|
||
|
||
```typescript
|
||
import { compile } from "json-schema-to-typescript";
|
||
const ts = await compile(jsonSchema, "MyType");
|
||
```
|
||
|
||
**Confidence: MEDIUM** — widely used library; version verified as 15.x on npm (as of late 2025). Use for JSON Schema→TypeScript specifically; TypeScript→TypeScript reformatting uses the compiler API or prettier.
|
||
|
||
---
|
||
|
||
## Format Coverage Matrix
|
||
|
||
| Source → | PNG | JPEG | WebP | AVIF | SVG | PDF | DOCX | XLSX | CSV | JSON | MP4 | MP3 | WebM |
|
||
|----------|-----|------|------|------|-----|-----|------|------|-----|------|-----|-----|------|
|
||
| PNG | — | sharp | sharp | sharp | — | playwright | — | — | — | — | — | — | — |
|
||
| JPEG | sharp | — | sharp | sharp | — | playwright | — | — | — | — | — | — | — |
|
||
| WebP | sharp | sharp | — | sharp | — | playwright | — | — | — | — | — | — | — |
|
||
| SVG | sharp/@resvg | sharp/@resvg | sharp/@resvg | — | — | playwright | — | — | — | — | — | — | — |
|
||
| DOCX | — | — | — | — | — | LibreOffice | — | — | — | — | — | — | — |
|
||
| PPTX | — | — | — | — | — | LibreOffice | — | — | — | — | — | — | — |
|
||
| XLSX/XLS | — | — | — | — | — | LibreOffice | — | — | SheetJS | SheetJS | — | — | — |
|
||
| HTML | — | — | — | — | — | playwright | mammoth→† | — | — | — | — | — | — |
|
||
| Markdown | — | — | — | — | — | pandoc | pandoc | — | — | — | — | — | — |
|
||
| CSV | — | — | — | — | — | AI-bridged | — | SheetJS | — | csv-parse | — | — | — |
|
||
| JSON | — | — | — | — | — | AI-bridged | — | SheetJS | csv-stringify | — | — | — | — |
|
||
| MP4/MKV | — | — | — | — | — | — | — | — | — | — | — | ffmpeg | ffmpeg |
|
||
| MP3/WAV | — | — | — | — | — | — | — | — | — | — | — | — | — |
|
||
| WAV/OGG | — | — | — | — | — | — | — | — | — | — | — | ffmpeg | — |
|
||
|
||
† HTML→DOCX requires pandoc (mammoth is one-way: DOCX→HTML only)
|
||
|
||
**AI-bridged**: Format pairs without a deterministic tool path. See Tier 2 below.
|
||
|
||
---
|
||
|
||
## Tier 2: Format Registry Pattern
|
||
|
||
### Dispatch Table Design
|
||
|
||
The registry is a map of `"source/target"` → handler function. This is simpler than a class hierarchy and matches the existing Nexus factory function pattern.
|
||
|
||
```typescript
|
||
// server/src/services/converters/registry.ts
|
||
|
||
export type ConversionHandler = (
|
||
inputPath: string,
|
||
outputPath: string,
|
||
opts?: Record<string, unknown>
|
||
) => Promise<void>;
|
||
|
||
export type ConversionCapability = "direct" | "ai-bridged" | "unavailable";
|
||
|
||
export interface ConversionRoute {
|
||
capability: ConversionCapability;
|
||
handler?: ConversionHandler; // present when capability = "direct"
|
||
aiHint?: string; // present when capability = "ai-bridged"
|
||
requiresSystemDep?: string; // e.g. "pandoc", "libreoffice"
|
||
}
|
||
|
||
// Key format: "source.ext/target.ext" — always lowercase, no leading dot
|
||
const registry = new Map<string, ConversionRoute>();
|
||
|
||
export function registerConverter(
|
||
sourceExt: string,
|
||
targetExt: string,
|
||
route: ConversionRoute
|
||
): void {
|
||
registry.set(`${sourceExt}/${targetExt}`, route);
|
||
}
|
||
|
||
export function getConverter(sourceExt: string, targetExt: string): ConversionRoute {
|
||
return registry.get(`${sourceExt}/${targetExt}`) ?? { capability: "unavailable" };
|
||
}
|
||
|
||
export function listSupportedTargets(sourceExt: string): string[] {
|
||
const results: string[] = [];
|
||
for (const [key, route] of registry.entries()) {
|
||
if (key.startsWith(`${sourceExt}/`) && route.capability !== "unavailable") {
|
||
results.push(key.split("/")[1]);
|
||
}
|
||
}
|
||
return results;
|
||
}
|
||
```
|
||
|
||
**Registration (in server startup):**
|
||
|
||
```typescript
|
||
// server/src/services/converters/index.ts
|
||
import { registerConverter } from "./registry";
|
||
import { sharpConvert } from "./sharp-converter";
|
||
import { ffmpegConvert } from "./ffmpeg-converter";
|
||
import { pandocConvert } from "./pandoc-converter";
|
||
|
||
// Image
|
||
registerConverter("png", "webp", { capability: "direct", handler: (i, o) => sharpConvert(i, o, "webp") });
|
||
registerConverter("jpg", "webp", { capability: "direct", handler: (i, o) => sharpConvert(i, o, "webp") });
|
||
registerConverter("svg", "png", { capability: "direct", handler: (i, o) => sharpConvert(i, o, "png") });
|
||
|
||
// Documents
|
||
registerConverter("docx", "html", { capability: "direct", handler: mammothConvert, requiresSystemDep: undefined });
|
||
registerConverter("docx", "pdf", { capability: "direct", handler: libreofficeConvert, requiresSystemDep: "libreoffice" });
|
||
registerConverter("md", "docx", { capability: "direct", handler: (i, o) => pandocConvert(i, o, "markdown", "docx"), requiresSystemDep: "pandoc" });
|
||
|
||
// Data
|
||
registerConverter("xlsx", "csv", { capability: "direct", handler: sheetjsConvert });
|
||
registerConverter("csv", "xlsx", { capability: "direct", handler: sheetjsConvert });
|
||
registerConverter("csv", "pdf", { capability: "ai-bridged", aiHint: "format as a formatted table report PDF" });
|
||
registerConverter("json", "pdf", { capability: "ai-bridged", aiHint: "render as a structured document report" });
|
||
|
||
// Audio/Video
|
||
registerConverter("mp4", "webm", { capability: "direct", handler: (i, o) => ffmpegConvert(i, o, ["-c:v", "libvpx-vp9"]) });
|
||
registerConverter("mp3", "wav", { capability: "direct", handler: (i, o) => ffmpegConvert(i, o, ["-c:a", "pcm_s16le"]) });
|
||
```
|
||
|
||
**Extensibility:** Adding a new format pair is one `registerConverter()` call. The route handler and the registry are decoupled. System dependency checks happen at `registerConverter()` time, not at request time — unavailable converters are registered as `capability: "unavailable"` when their system dep is absent.
|
||
|
||
---
|
||
|
||
## Tier 2: AI-Bridged Conversion
|
||
|
||
### When to Use AI (vs Direct Tool)
|
||
|
||
| Criterion | Direct Tool | AI-Bridged |
|
||
|-----------|-------------|------------|
|
||
| Output is byte-for-byte deterministic | Yes | No |
|
||
| Format pair has an established tool | Yes | No |
|
||
| Conversion is purely structural (no semantic change) | Yes | — |
|
||
| Conversion requires understanding content meaning | — | Yes |
|
||
| Source format is machine-readable but lacks a direct path | — | Yes |
|
||
| Example pairs | PNG→WebP, DOCX→PDF, MP4→WebM | CSV→PDF report, JSON→DOCX narrative, schema→TypeScript |
|
||
|
||
**Decision rule:**
|
||
|
||
> Use direct tool when: `getConverter(src, tgt).capability === "direct"`.
|
||
> Use AI when: capability is `"ai-bridged"` AND the source is a text/data format that an LLM can read as context.
|
||
> Return `"unavailable"` error when: capability is `"unavailable"` (binary formats with no path, e.g. PDF→XLSX).
|
||
|
||
**AI-bridged is NOT a fallback for when a tool is missing.** If LibreOffice is not installed, DOCX→PDF is `"unavailable"`, not `"ai-bridged"`. AI-bridged is only for semantically complex conversions where no deterministic tool exists.
|
||
|
||
---
|
||
|
||
### AI-Bridged Prompt Structure
|
||
|
||
The prompt must be deterministic in its output format requirements. Vague prompts produce vague output.
|
||
|
||
```typescript
|
||
// server/src/services/converters/ai-bridged-converter.ts
|
||
|
||
export async function aiBridgedConvert(
|
||
sourceExt: string,
|
||
targetExt: string,
|
||
sourceContent: string, // text content of the source file
|
||
outputPath: string,
|
||
aiHint: string, // from registry route
|
||
agentAdapter: AgentAdapter // existing adapter interface
|
||
): Promise<void> {
|
||
const prompt = buildConversionPrompt(sourceExt, targetExt, sourceContent, aiHint);
|
||
const result = await agentAdapter.complete(prompt);
|
||
await writeConversionResult(result, targetExt, outputPath);
|
||
}
|
||
|
||
function buildConversionPrompt(
|
||
sourceExt: string,
|
||
targetExt: string,
|
||
sourceContent: string,
|
||
aiHint: string
|
||
): string {
|
||
const outputSpec = OUTPUT_SPEC[targetExt] ?? "the target format";
|
||
return `Convert the following ${sourceExt.toUpperCase()} content to ${targetExt.toUpperCase()}.
|
||
|
||
Instruction: ${aiHint}
|
||
|
||
Requirements:
|
||
- Output ONLY the converted content, no explanation, no preamble
|
||
- Output format: ${outputSpec}
|
||
- Preserve all data values exactly; do not summarize or truncate
|
||
|
||
Source content:
|
||
\`\`\`
|
||
${sourceContent}
|
||
\`\`\``;
|
||
}
|
||
|
||
const OUTPUT_SPEC: Record<string, string> = {
|
||
html: "Valid HTML5. No <!DOCTYPE>, no <html>/<body> wrapper. Only the content fragment.",
|
||
md: "GitHub Flavored Markdown.",
|
||
ts: "Valid TypeScript. No imports unless required. Export all types.",
|
||
pdf: "HTML that will be rendered to PDF. Use inline <style> for layout.",
|
||
json: "Valid JSON object or array. No trailing commas. No comments.",
|
||
};
|
||
```
|
||
|
||
**Reliability rules for AI-bridged conversion:**
|
||
1. Always specify exact output format in the prompt — "output ONLY the converted content" prevents LLM preamble
|
||
2. Pass `temperature: 0` to the adapter — conversions are deterministic tasks, not creative ones
|
||
3. Validate the output: HTML is parsed with a DOM parser, JSON is `JSON.parse()`, TypeScript is compiled with `tsc --noEmit`
|
||
4. Keep source content under 50K characters — larger inputs require chunking or direct tool upgrade
|
||
5. Never use AI-bridged for binary formats (images, audio, video) — LLMs cannot output binary faithfully
|
||
|
||
---
|
||
|
||
## UI Patterns
|
||
|
||
### Deep-Linkable Route Structure
|
||
|
||
```
|
||
/convert → landing page, format picker
|
||
/convert/:from → source chosen, target picker
|
||
/convert/:from/:to → conversion page, upload + convert CTA
|
||
/convert/:from/:to/:jobId → result page, download + share
|
||
```
|
||
|
||
This mirrors the CloudConvert URL pattern. Each route is bookmarkable and shareable. React Router `<Link>` handles navigation without page reload.
|
||
|
||
**Implementation with React Router v6:**
|
||
|
||
```typescript
|
||
// ui/src/pages/convert/index.tsx
|
||
<Route path="/convert" element={<ConvertLanding />} />
|
||
<Route path="/convert/:from" element={<ConvertSourcePage />} />
|
||
<Route path="/convert/:from/:to" element={<ConvertPage />} />
|
||
```
|
||
|
||
The `:from` and `:to` params are lowercase file extension strings (e.g. `png`, `pdf`, `docx`). The UI queries `GET /api/convert/formats/:from` to show available target formats dynamically.
|
||
|
||
---
|
||
|
||
### Conversion API Endpoints
|
||
|
||
```
|
||
GET /api/convert/formats → all supported format pairs
|
||
GET /api/convert/formats/:from → targets available for a given source
|
||
POST /api/convert/jobs → create job, returns jobId (202 Accepted)
|
||
GET /api/convert/jobs/:jobId → job status + result download URL
|
||
```
|
||
|
||
This mirrors the existing `content-jobs` pattern from ARCHITECTURE.md. The conversion system is a second consumer of `contentJobService` — not a separate job system.
|
||
|
||
```typescript
|
||
// Register conversion as content job types
|
||
registerConverter("png", "webp", {
|
||
capability: "direct",
|
||
handler: sharpConvert,
|
||
});
|
||
// POST /api/convert/jobs creates a content_jobs row with type: "convert:png/webp"
|
||
```
|
||
|
||
---
|
||
|
||
### UI Component Pattern
|
||
|
||
Key components for the conversion UI:
|
||
|
||
```
|
||
ConvertLanding — grid of format categories (Images, Documents, Data, Video)
|
||
FormatPicker — searchable list of source/target formats
|
||
ConvertDropzone — drag-drop + paste + file picker, file size display
|
||
ConvertProgress — SSE-driven progress bar (reuses existing SSE subscriber hook)
|
||
ConvertResult — download button, preview (image inline, PDF iframe), copy link
|
||
FormatBadge — small pill showing format ext with category color
|
||
```
|
||
|
||
**Drag-drop:** Use the existing file upload pattern from Nexus v1.3 file system. Do not add a new drag-drop library — the existing upload component already handles it.
|
||
|
||
**Progress:** Use the existing SSE live event subscriber. The conversion job emits `content.job.started` and `content.job.done` events — the same events as other content jobs. The `ConvertProgress` component subscribes to these. Zero new infrastructure needed.
|
||
|
||
**Format categories for the landing grid:**
|
||
|
||
| Category | Formats | Icon |
|
||
|----------|---------|------|
|
||
| Images | PNG, JPEG, WebP, AVIF, SVG, GIF | camera |
|
||
| Documents | PDF, DOCX, HTML, Markdown, ODT | file-text |
|
||
| Data | CSV, XLSX, JSON, TSV | table |
|
||
| Video | MP4, WebM, MKV, MOV, GIF | video |
|
||
| Audio | MP3, WAV, OGG, FLAC | music |
|
||
| Code | JS, TS, CSS, HTML | code |
|
||
|
||
---
|
||
|
||
## Security Pitfalls
|
||
|
||
### Critical: Path Traversal via Filename
|
||
|
||
**What goes wrong:** A user uploads a file named `../../etc/passwd.csv` or `../config.json`. If the server uses the original filename to construct the output path, it writes outside the intended temp directory.
|
||
|
||
**How it happens in conversion specifically:** Conversion tools (pandoc, ffmpeg, LibreOffice) write output to a path the server constructs. If the output path includes any user-supplied component, traversal is trivial.
|
||
|
||
**CVE context:** CVE-2026-21440 (AdonisJS, CVSS 9.2) is exactly this pattern — `MultipartFile.move()` without filename sanitization allows arbitrary file write.
|
||
|
||
**Prevention:**
|
||
```typescript
|
||
import path from "path";
|
||
import crypto from "crypto";
|
||
|
||
// NEVER use the original filename for output paths
|
||
function safeOutputPath(tempDir: string, targetExt: string): string {
|
||
const id = crypto.randomUUID();
|
||
return path.join(tempDir, `${id}.${targetExt}`);
|
||
// path.join cannot traverse outside tempDir because id contains no slashes
|
||
}
|
||
|
||
// NEVER pass user-supplied paths to child_process.spawn
|
||
// Always resolve through the StorageService, which controls the output namespace
|
||
```
|
||
|
||
**Absolute rule:** The only user-supplied value that enters the conversion pipeline is the **file content (Buffer)**. File names, paths, and extensions are derived server-side from MIME type detection and the registry.
|
||
|
||
---
|
||
|
||
### Critical: MIME Type vs Extension Spoofing
|
||
|
||
**What goes wrong:** A user uploads a file with extension `.csv` but the actual content is an executable or a PHP file. The server passes it to the pandoc handler expecting text input.
|
||
|
||
**Prevention:**
|
||
1. Validate MIME type with `file-type ^19.x` (reads magic bytes, not extension)
|
||
2. Reject files whose detected MIME type does not match the declared source format
|
||
3. Set `Content-Disposition: attachment` on all download responses — never `inline` for non-image, non-PDF files
|
||
|
||
```typescript
|
||
import { fileTypeFromBuffer } from "file-type";
|
||
|
||
const detected = await fileTypeFromBuffer(inputBuffer);
|
||
if (detected?.mime !== EXPECTED_MIME[sourceExt]) {
|
||
throw new Error(`MIME mismatch: declared ${sourceExt} but detected ${detected?.mime}`);
|
||
}
|
||
```
|
||
|
||
**New package: `file-type ^19.x`**
|
||
|
||
```bash
|
||
pnpm --filter @paperclipai/server add file-type
|
||
```
|
||
|
||
Confidence: HIGH — `file-type` is the standard Node.js magic-bytes library, widely used. v19.x is pure ESM; confirm that server's module resolution handles ESM imports.
|
||
|
||
---
|
||
|
||
### Moderate: Unbound Resource Consumption (DoS via Large Files)
|
||
|
||
**What goes wrong:** A user uploads a 4GB video file. LibreOffice or ffmpeg consumes all available RAM before the job starts. The server crashes.
|
||
|
||
**Prevention:**
|
||
```typescript
|
||
// Set limits in Express multipart config (Express 5 uses built-in body limits)
|
||
const MAX_FILE_SIZES: Record<string, number> = {
|
||
image: 50 * 1024 * 1024, // 50 MB
|
||
document: 100 * 1024 * 1024, // 100 MB
|
||
video: 500 * 1024 * 1024, // 500 MB
|
||
audio: 200 * 1024 * 1024, // 200 MB
|
||
data: 20 * 1024 * 1024, // 20 MB (CSV/XLSX)
|
||
};
|
||
```
|
||
|
||
Also: set `ulimit` on child processes spawned for conversion (ffmpeg, pandoc). On macOS: use `RLIMIT_AS` via `child_process.spawn` options. Conservative default: 2GB per subprocess.
|
||
|
||
---
|
||
|
||
### Moderate: Temp File Accumulation (Disk Exhaustion)
|
||
|
||
**What goes wrong:** Conversion jobs fail midway. Temp files in `/tmp` are never cleaned up. The disk fills over days/weeks.
|
||
|
||
**CVE context:** CVE-2026-3304 (Multer < 2.1.0) — failed requests leave temp files on disk. Multer ≥ 2.1.0 fixes this but only for Multer's own temp files. Conversion intermediates are your responsibility.
|
||
|
||
**Prevention:**
|
||
```typescript
|
||
// Always use try/finally to clean up temp files
|
||
import fs from "fs/promises";
|
||
import os from "os";
|
||
import path from "path";
|
||
|
||
async function withTempDir<T>(fn: (dir: string) => Promise<T>): Promise<T> {
|
||
const dir = await fs.mkdtemp(path.join(os.tmpdir(), "nexus-convert-"));
|
||
try {
|
||
return await fn(dir);
|
||
} finally {
|
||
await fs.rm(dir, { recursive: true, force: true }).catch(() => {});
|
||
}
|
||
}
|
||
```
|
||
|
||
Additionally: register a process exit handler that sweeps for stale `nexus-convert-*` directories older than 1 hour.
|
||
|
||
---
|
||
|
||
### Minor: Arbitrary Code Execution via Conversion Tool Arguments
|
||
|
||
**What goes wrong:** Conversion parameters from the API request body are passed directly as CLI arguments to pandoc/ffmpeg. An attacker sends `{"extraArgs": ["--lua-filter=/etc/passwd"]}`.
|
||
|
||
**Prevention:** Never expose raw CLI arg arrays to the API. The registry pre-defines all argument templates. The only variable substitution is for safe values (bitrate as a number, output format as an enum from the registry).
|
||
|
||
```typescript
|
||
// BAD: never do this
|
||
const args = userRequest.extraArgs as string[];
|
||
spawn("pandoc", [input, ...args, "-o", output]);
|
||
|
||
// GOOD: use pre-defined templates from the registry
|
||
const route = getConverter(from, to);
|
||
route.handler(inputPath, outputPath, {}); // handler owns its own args
|
||
```
|
||
|
||
---
|
||
|
||
## Performance Pitfalls
|
||
|
||
### Anti-Pattern: Spawning One Process Per Request
|
||
|
||
**What goes wrong:** Each conversion request immediately spawns a new ffmpeg/LibreOffice/pandoc subprocess. Under concurrent load, 20 requests spawn 20 ffmpeg processes simultaneously, exhausting CPU and memory.
|
||
|
||
**Mitigation:** Queue conversion jobs through `contentJobService` (already the pattern). The job queue naturally serializes heavy jobs. For LibreOffice specifically: enforce single-concurrency in the LibreOffice adapter with a simple in-memory semaphore:
|
||
|
||
```typescript
|
||
let libreofficeRunning = false;
|
||
const libreofficeQueue: Array<() => void> = [];
|
||
|
||
async function libreofficeConvertSerialized(buf: Buffer, ext: string): Promise<Buffer> {
|
||
while (libreofficeRunning) {
|
||
await new Promise<void>(resolve => libreofficeQueue.push(resolve));
|
||
}
|
||
libreofficeRunning = true;
|
||
try {
|
||
return await convertAsync(buf, ext, undefined);
|
||
} finally {
|
||
libreofficeRunning = false;
|
||
libreofficeQueue.shift()?.();
|
||
}
|
||
}
|
||
```
|
||
|
||
For ffmpeg: allow up to 3 concurrent processes (M4 has 10 cores; each ffmpeg uses ~2-3 threads for simple conversions).
|
||
|
||
---
|
||
|
||
### Anti-Pattern: Loading Entire File into Memory Before Converting
|
||
|
||
**What goes wrong:** A 500MB MP4 file is read into a `Buffer` before being passed to ffmpeg. Node.js allocates 500MB of heap. V8 GC stalls. Server becomes unresponsive.
|
||
|
||
**Mitigation:** For files larger than 10MB, write the upload directly to a temp file path and pass the path to the converter. The `StorageService.getStream(objectKey)` method should be used to pipe large files to disk before conversion.
|
||
|
||
---
|
||
|
||
### Anti-Pattern: Returning Converted File as HTTP Response Body
|
||
|
||
**What goes wrong:** `POST /api/convert/jobs` holds the connection open for 30+ seconds then streams back a 200MB video file as the HTTP response. Upstream proxy (nginx) times out at 30s.
|
||
|
||
**Mitigation:** This is the same pattern solved by `contentJobService`. Always: create a job, return 202+jobId, render async, store result in StorageService, emit SSE done event, client fetches download URL. The download URL uses the existing signed URL / direct serve pattern from `assetService`.
|
||
|
||
---
|
||
|
||
## What NOT to Build
|
||
|
||
| Avoid | Why | Use Instead |
|
||
|-------|-----|-------------|
|
||
| ImageMagick CLI wrapper | `sharp` covers all needed raster formats at 4-5× speed | `sharp ^0.34.5` (already installed) |
|
||
| `@imagemagick/magick-wasm` | ~0.3× native speed; complex install | `sharp` (libvips-backed) |
|
||
| `fluent-ffmpeg ^2.1.3` | Archived May 2025; full fluent API unnecessary | 20-line `child_process.spawn` wrapper |
|
||
| `node-pandoc ^0.2.7` | Last updated 2021; adds dependency for a 5-line child_process call | Thin wrapper using `child_process.spawn` |
|
||
| `json2csv` (original) | v6 alpha, 3 years stale | `csv-stringify` (same ecosystem as `csv-parse`) |
|
||
| `pdf-lib` | Pure-JS PDF assembly; no HTML rendering; wrong for HTML→PDF | `playwright-chromium` (already decided) |
|
||
| `jsPDF` | CVE-2025-68428 path traversal in versions <4.0.0; even fixed versions are JS-first PDF generation with weak CSS support | `playwright-chromium` |
|
||
| Arbitrary `extraArgs` in API | Arbitrary CLI args = code execution via crafted filenames/flags | Pre-defined handler templates in registry |
|
||
| Shared temp files between jobs | Race conditions, cleanup failures | `withTempDir()` scoped to each job |
|
||
| AI for binary→binary conversion | LLMs cannot produce binary output faithfully | Always use direct tools for binary format pairs |
|
||
| Polling loop for job status | Creates unnecessary load; SSE already available | Subscribe to existing `content.job.done` SSE event |
|
||
|
||
---
|
||
|
||
## Installation Summary
|
||
|
||
**New packages to add (v1.7 format conversion):**
|
||
|
||
```bash
|
||
# Document conversion
|
||
pnpm --filter @paperclipai/server add mammoth # ^1.12.0 — DOCX→HTML (no system dep)
|
||
pnpm --filter @paperclipai/server add libreoffice-convert # ^1.8.1 — Office→PDF (requires LibreOffice)
|
||
pnpm --filter @paperclipai/server add xlsx # ^0.20.x — SheetJS spreadsheet R/W
|
||
|
||
# CSV/data
|
||
pnpm --filter @paperclipai/server add csv-parse # ^6.2.1 — CSV streaming parser
|
||
pnpm --filter @paperclipai/server add csv-stringify # ^6.x — JSON→CSV generator
|
||
|
||
# Code formatting
|
||
pnpm --filter @paperclipai/server add prettier # ^3.x — code formatting (JS/TS/CSS/HTML/MD)
|
||
pnpm --filter @paperclipai/server add json-schema-to-typescript # ^15.x — schema→TS types
|
||
|
||
# Security: MIME type validation
|
||
pnpm --filter @paperclipai/server add file-type # ^19.x — magic byte MIME detection (ESM)
|
||
|
||
# System dependencies (install once on Mac Mini)
|
||
brew install pandoc # markdown↔docx/html/rst conversions
|
||
brew install --cask libreoffice # office→pdf; optional, degrade gracefully if absent
|
||
```
|
||
|
||
**No new packages needed for:**
|
||
- Image conversion → `sharp` already installed
|
||
- Audio/video → `ffmpeg-static` already installed (write thin wrapper)
|
||
- SVG→PNG → `sharp` (basic) or `@resvg/resvg-js` (already in STACK.md for satori SVGs)
|
||
- HTML→PDF → `playwright-chromium` already in STACK.md
|
||
|
||
---
|
||
|
||
## Phase-Specific Warnings
|
||
|
||
| Phase Topic | Likely Pitfall | Mitigation |
|
||
|-------------|---------------|------------|
|
||
| Format registry setup | Registering converters before system dep checks → silent `"direct"` entries that fail at runtime | Check `which pandoc`, `which soffice` at startup; register as `"unavailable"` if absent |
|
||
| LibreOffice integration | LibreOffice spawns a UNO bridge on first call; second call within the same socket fails | Serialize all LibreOffice calls with the semaphore pattern above; never concurrent |
|
||
| File upload security | User-controlled filenames in temp paths | Use `crypto.randomUUID()` for all output paths; never use original filename |
|
||
| `file-type ^19.x` | Pure ESM package in a CJS/ESM mixed server | Use dynamic `await import("file-type")` or configure server's tsconfig for ESM interop |
|
||
| Large video files | Buffer entire file into memory | Pipe uploads directly to disk via streaming; pass file path (not buffer) to ffmpeg |
|
||
| AI-bridged output validation | LLM returns text with preamble before the converted content | Enforce `OUTPUT_SPEC` in prompt; strip leading non-content lines; validate with format parser |
|
||
| `/convert/:from/:to` route | Collides with existing routes if Express route order is wrong | Mount conversion routes before wildcard routes; use `/api/convert/` prefix throughout |
|
||
| `xlsx` (SheetJS) license | Community Edition license changed in recent versions | Check npm package license field at install time; log at startup if non-OSI |
|
||
|
||
---
|
||
|
||
## Sources
|
||
|
||
- [fluent-ffmpeg GitHub (archived May 2025)](https://github.com/fluent-ffmpeg/node-fluent-ffmpeg/issues/1324) — archival notice
|
||
- [fluent-ffmpeg npm](https://www.npmjs.com/package/fluent-ffmpeg) — v2.1.3 confirmed
|
||
- [sharp official docs](https://sharp.pixelplumbing.com/) — SVG support via librsvg confirmed
|
||
- [sharp GitHub](https://github.com/lovell/sharp) — v0.34.5 confirmed; 4-5× faster than ImageMagick
|
||
- [mammoth npm](https://www.npmjs.com/package/mammoth) — v1.12.0, last published 20 days ago
|
||
- [mammoth GitHub](https://github.com/mwilliamson/mammoth.js/) — one-way DOCX→HTML converter
|
||
- [pandoc official](https://pandoc.org/) — universal markup converter, Haskell binary
|
||
- [node-pandoc npm](https://www.npmjs.com/package/node-pandoc) — thin child_process wrapper, v0.2.7, last updated 2021
|
||
- [libreoffice-convert npm](https://www.npmjs.com/package/libreoffice-convert) — v1.8.1, last published ~February 2026
|
||
- [SheetJS Community Edition docs](https://docs.sheetjs.com/) — spreadsheet format coverage
|
||
- [SheetJS vs ExcelJS comparison 2026](https://www.pkgpulse.com/blog/sheetjs-vs-exceljs-vs-node-xlsx-excel-files-node-2026) — SheetJS 7.8M weekly downloads (MEDIUM confidence — single source)
|
||
- [csv-parse npm](https://www.npmjs.com/package/csv-parse) — v6.2.1, last published 4 days ago
|
||
- [json-2-csv npm](https://www.npmjs.com/package/json-2-csv) — v5.5.10, maintained alternative to archived json2csv
|
||
- [Prettier API docs](https://prettier.io/docs/api) — programmatic `format()` function confirmed
|
||
- [json-schema-to-typescript GitHub](https://github.com/bcherny/json-schema-to-typescript) — JSON Schema→TypeScript types
|
||
- [ConvertX GitHub (C4illin/ConvertX)](https://github.com/C4illin/ConvertX) — reference architecture: tool-per-category dispatch with 1000+ format pairs
|
||
- [Worker Threads in Node.js 2026 (DEV Community)](https://dev.to/young_gao/worker-threads-in-nodejs-when-and-how-to-use-them-2jdm) — pooling recommendation for CPU-bound tasks
|
||
- [CVE-2025-68428: jsPDF path traversal](https://www.endorlabs.com/learn/cve-2025-68428-critical-path-traversal-in-jspdf) — avoid jsPDF for user-input file paths
|
||
- [CVE-2026-21440: AdonisJS bodyparser path traversal (CVSS 9.2)](https://thehackernews.com/2026/01/critical-adonisjs-bodyparser-flaw-cvss.html) — file upload filename sanitization
|
||
- [CVE-2026-3304: Multer temp file cleanup DoS](https://cvereports.com/reports/CVE-2026-3304) — Multer <2.1.0 does not clean temp files on async filter error
|
||
- [Node.js path traversal prevention](https://www.nodejs-security.com/blog/secure-coding-practices-nodejs-path-traversal-vulnerabilities) — path.normalize alone is insufficient
|
||
- [Hybrid AI deterministic/LLM boundary (New Math Data)](https://newmathdata.com/blog/hybrid-ai-deterministic-code-llm-reasoning-systems/) — deterministic tools for deterministic tasks, LLM for semantic tasks
|
||
|
||
---
|
||
|
||
*Format conversion research for: Nexus v1.7 Content Generation*
|
||
*Researched: 2026-04-04*
|
||
*Scope: Supplemental — format conversion ecosystem only. Does not supersede STACK.md or ARCHITECTURE.md.*
|