Building AI Agents with RAG: From Theory to Production
When we started building the AI Agent Assistant at Skyware IT, the challenge was clear: **create an intelligent system that understands organisational context and executes complex tasks reliably**.
The solution? Combine **Retrieval-Augmented Generation (RAG)** with **function calling**. This post walks through what we learned.
The Architecture
At a high level, our agent:
1. Receives user queries
2. Searches our knowledge base (10,000+ documents) using vector embeddings
3. Retrieves relevant context with 92% precision
4. Passes context + query to LLM
5. LLM generates response OR decides to call a function
6. Functions execute (database queries, API calls, etc.)
7. Results feedback to LLM
8. Final response sent to user
User query → Embedding → Vector search (Pinecone)
→ Retrieve relevant chunks → LLM + tools
→ Execute function → Generate final response
Step 1: Vector Database Setup
I used **Pinecone** with OpenAI's `text-embedding-3-small` model.
**Chunking strategy:**
For our property management docs, this gave **92%+ retrieval precision**.
Step 2: Function Calling
Define tools that the LLM can invoke:
{
"name": "get_property_occupancy",
"description": "Get current occupancy for a property",
"parameters": {
"type": "object",
"properties": {
"property_id": { "type": "string" }
},
"required": ["property_id"]
}
}
When the LLM identifies a need to get occupancy data, it calls this function with the property ID. We execute it server-side and feed the result back.
Step 3: Handling Real-Time Context
One challenge: **keeping the agent aware of recent actions without redundant API calls**.
Solution: **Conversation memory buffer** (last 10 turns) + separate **tool call history**.
Result: **40% reduction in redundant API calls**.
const recentToolCalls = conversationHistory
.slice(-10)
.filter(msg => msg.role === 'tool')
.map(msg => msg.name);
if (recentToolCalls.includes('get_property_occupancy')) {
return previousResult; // use cache
}
Step 4: Production Considerations
Latency
Cached frequent queries in **Redis** (TTL 5 min):
Cost
Used **model routing**:
Safety
Added **human-in-the-loop** for sensitive operations:
Real-World Results
What We Learned
1. **Retrieval quality is everything.** Spend time on chunking and embedding strategy before anything else.
2. **Context windows matter.** With proper memory management, GPT-3.5-turbo handles complex multi-step workflows.
3. **Tool calls add latency.** Batch where possible. Cache aggressively.
4. **Users want transparency.** Show them what was retrieved and why.
**Questions or want to discuss AI architecture?** [Reach out](mailto:contact@imamfaheem.com).