Supported Formats

RAG Engine can ingest a wide range of document formats. Text is extracted automatically and indexed for semantic search.

Document Formats

Format	Extensions	Parser	Notes
PDF	`.pdf`	Built-in	Page-by-page text extraction. Scanned PDFs not supported (use image upload with OCR instead).
DOCX	`.docx`	Built-in	Paragraph extraction. Preserves headings and list structure. Does not extract images from DOCX.
HTML	`.html` `.htm`	Built-in	Strips HTML tags, extracts text content. Scripts and styles are removed.
Markdown	`.md` `.markdown`	Built-in	Renders to text. Code blocks, links, and formatting are preserved as plain text.
CSV	`.csv`	Built-in	Row-by-row parsing with column headers. Each row becomes searchable text.
JSON	`.json`	Built-in	Key-value flattening. Nested objects are serialized into readable text.
Plain Text	`.txt`	Built-in	Direct text ingestion. No conversion needed.

Image Formats (OCR)

Images are processed using OCR (optical character recognition). Text detected in the image is extracted and indexed like any other document.

Format	Extensions	Notes
PNG	`.png`	Lossless images, screenshots, diagrams with text
JPEG	`.jpg` `.jpeg`	Photos of documents, scanned pages, whiteboard captures
TIFF	`.tiff` `.tif`	High-quality scans, multi-page support
BMP	`.bmp`	Bitmap images

Tip: OCR works best on clear, high-contrast images with readable text. Handwritten text, heavily stylized fonts, or very low-resolution images may produce incomplete or inaccurate results.

Limits

Limit	Value
Maximum file size (upload)	100 MB
Maximum text content (ingest)	5 MB
Top-k search results	1 – 100

Automatic Format Detection

RAG Engine detects the file format from the filename extension. You do not need to specify the MIME type. If a file has an unrecognized extension, it is treated as plain text.

Supported Formats

Document Formats

Image Formats (OCR)

Limits

Automatic Format Detection

What's Next?