Supported Formats
RAG Engine can ingest a wide range of document formats. Text is extracted automatically and indexed for semantic search.
Document Formats
| Format | Extensions | Parser | Notes |
|---|---|---|---|
.pdf | Built-in | Page-by-page text extraction. Scanned PDFs not supported (use image upload with OCR instead). | |
| DOCX | .docx | Built-in | Paragraph extraction. Preserves headings and list structure. Does not extract images from DOCX. |
| HTML | .html .htm | Built-in | Strips HTML tags, extracts text content. Scripts and styles are removed. |
| Markdown | .md .markdown | Built-in | Renders to text. Code blocks, links, and formatting are preserved as plain text. |
| CSV | .csv | Built-in | Row-by-row parsing with column headers. Each row becomes searchable text. |
| JSON | .json | Built-in | Key-value flattening. Nested objects are serialized into readable text. |
| Plain Text | .txt | Built-in | Direct text ingestion. No conversion needed. |
Image Formats (OCR)
Images are processed using OCR (optical character recognition). Text detected in the image is extracted and indexed like any other document.
| Format | Extensions | Notes |
|---|---|---|
| PNG | .png | Lossless images, screenshots, diagrams with text |
| JPEG | .jpg .jpeg | Photos of documents, scanned pages, whiteboard captures |
| TIFF | .tiff .tif | High-quality scans, multi-page support |
| BMP | .bmp | Bitmap images |
Tip: OCR works best on clear, high-contrast images with readable text. Handwritten text, heavily stylized fonts, or very low-resolution images may produce incomplete or inaccurate results.
Limits
| Limit | Value |
|---|---|
| Maximum file size (upload) | 100 MB |
| Maximum text content (ingest) | 5 MB |
| Top-k search results | 1 – 100 |
Automatic Format Detection
RAG Engine detects the file format from the filename extension. You do not need to specify the MIME type. If a file has an unrecognized extension, it is treated as plain text.