Skip to main content

ADR 005: File Ingestion Parsers

Context

RAGler originally supported only web and manual text ingestion. Users need to upload local documents (PDF, DOCX, TXT, Markdown, CSV) directly.

Decision

Use pdf-parse for PDF and mammoth for DOCX extraction. Plain text formats (TXT, MD, CSV) are read as UTF-8. Each parser implements a shared FileParser interface and is selected by a resolver based on file extension.

Consequences

Lean dependencies (~50KB total) compared to heavyweight alternatives like Docling.
Easy to extend by adding new parsers implementing FileParser.
Limited to text extraction; no OCR or complex layout analysis.

Alternatives considered

Docling/Unstructured.io — too heavy for initial scope.
Tika — requires JVM, complicates deployment.

Context
Decision
Consequences
Alternatives considered