Skip to main content

ADR 005: File Ingestion Parsers

Context

RAGler originally supported only web and manual text ingestion. Users need to upload local documents (PDF, DOCX, TXT, Markdown, CSV) directly.

Decision

Use pdf-parse for PDF and mammoth for DOCX extraction. Plain text formats (TXT, MD, CSV) are read as UTF-8. Each parser implements a shared FileParser interface and is selected by a resolver based on file extension.

Consequences

  • Lean dependencies (~50KB total) compared to heavyweight alternatives like Docling.
  • Easy to extend by adding new parsers implementing FileParser.
  • Limited to text extraction; no OCR or complex layout analysis.

Alternatives considered

  • Docling/Unstructured.io — too heavy for initial scope.
  • Tika — requires JVM, complicates deployment.