Supported Formats

RAG Engine can ingest a wide range of document formats. Text is extracted automatically and indexed for semantic search.

Document Formats

Format Extensions Parser Notes
PDF .pdf Built-in Page-by-page text extraction. Scanned PDFs not supported (use image upload with OCR instead).
DOCX .docx Built-in Paragraph extraction. Preserves headings and list structure. Does not extract images from DOCX.
HTML .html .htm Built-in Strips HTML tags, extracts text content. Scripts and styles are removed.
Markdown .md .markdown Built-in Renders to text. Code blocks, links, and formatting are preserved as plain text.
CSV .csv Built-in Row-by-row parsing with column headers. Each row becomes searchable text.
JSON .json Built-in Key-value flattening. Nested objects are serialized into readable text.
Plain Text .txt Built-in Direct text ingestion. No conversion needed.

Image Formats (OCR)

Images are processed using OCR (optical character recognition). Text detected in the image is extracted and indexed like any other document.

Format Extensions Notes
PNG .png Lossless images, screenshots, diagrams with text
JPEG .jpg .jpeg Photos of documents, scanned pages, whiteboard captures
TIFF .tiff .tif High-quality scans, multi-page support
BMP .bmp Bitmap images

Tip: OCR works best on clear, high-contrast images with readable text. Handwritten text, heavily stylized fonts, or very low-resolution images may produce incomplete or inaccurate results.

Limits

Limit Value
Maximum file size (upload) 100 MB
Maximum text content (ingest) 5 MB
Top-k search results 1 – 100

Automatic Format Detection

RAG Engine detects the file format from the filename extension. You do not need to specify the MIME type. If a file has an unrecognized extension, it is treated as plain text.