What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that lets AI models answer questions using your own documents and data, instead of relying solely on their training knowledge.
The Problem
Large language models are trained on public data up to a cutoff date. They can't answer questions about:
- Your internal documents, policies, and procedures
- Recent information published after their training cutoff
- Private data like customer records, contracts, or research
- Domain-specific knowledge unique to your organization
You could paste documents into every prompt, but that's limited by context window size, slow, and expensive. RAG solves this by retrieving only the relevant parts of your documents at query time.
How RAG Works
RAG has two phases: building a knowledge base (ingestion) and answering questions (retrieval).
1 Ingestion — Building the Knowledge Base
Parse
Documents are parsed into plain text. PDFs, Word docs, HTML, images (via OCR) — each format has a dedicated parser.
Chunk
Text is split into small, overlapping chunks (typically 512 tokens). This ensures relevant passages can be retrieved precisely.
Embed
Each chunk is converted to a vector (a list of numbers) using an embedding model. Semantically similar text produces similar vectors.
2 Retrieval — Answering Questions
Embed Query
The user's question is converted to a vector using the same embedding model.
Search
The query vector is compared against all stored chunk vectors. The most similar chunks are returned, ranked by relevance.
Generate
Retrieved chunks are passed to an AI model as context, which generates an answer grounded in your actual data.
Key Concepts
Embeddings
An embedding is a vector (array of numbers) that captures the meaning of text. Texts with similar meanings produce vectors that are close together in vector space.
"How do I reset my password?" → [0.12, -0.45, 0.78, ...]
"Password recovery steps" → [0.11, -0.43, 0.76, ...] ← similar!
"What's for lunch today?" → [0.89, 0.23, -0.15, ...] ← different Vector Database
A specialized database optimized for storing and searching vectors. Instead of exact keyword matching, it finds the most semantically similar results. AppXen uses pgvector — the PostgreSQL vector extension — for fast approximate nearest-neighbor search.
Chunking
Documents are split into small pieces (chunks) because embedding models work best on shorter text, and smaller chunks mean more precise retrieval. Chunks overlap slightly so context isn't lost at boundaries. AppXen defaults to 512 tokens per chunk with 50 tokens of overlap.
RAG Engine vs Storage
AppXen offers two ways to make files available to AI agents. Choose based on how agents need to use the content.
RAG Engine
Documents are parsed, chunked, embedded, and indexed. Agents search across content semantically.
- - Best for large document collections
- - Agents find relevant passages automatically
- - Works across hundreds of documents at once
- - Returns ranked, scored results
Storage
Files are stored as-is. Agents can list and read files directly via MCP tools.
- - Best for reference files agents read verbatim
- - Images, videos, and binary files
- - When you need the exact original content
- - Simpler — no processing pipeline
Where AppXen RAG Engine Fits
AppXen RAG Engine is a managed RAG pipeline that integrates directly with MCP Gateway. It handles the entire process:
- - Multi-format ingestion — Upload PDFs, Word docs, HTML, Markdown, CSV, JSON, plain text, and images (with OCR)
- - Automatic pipeline — Parse, chunk, embed, and index (pgvector) with no configuration
- - MCP integration — Exposed as a built-in MCP server so any connected AI agent can search your knowledge base
- - Dashboard — Upload documents, manage sources, run test queries, and monitor stats from the console