AI & Architecture12 min read

Building AI Agents with RAG: From Theory to Production

How I built an AI agent that handles complex organisational workflows using retrieval-augmented generation and function calling. Architecture decisions, production trade-offs, and what actually worked.

Imam Faheem

Senior Software Engineer

2024-12-15

Building AI Agents with RAG: From Theory to Production

When we started building the AI Agent Assistant at Skyware IT, the challenge was clear: **create an intelligent system that understands organisational context and executes complex tasks reliably**.

The solution? Combine **Retrieval-Augmented Generation (RAG)** with **function calling**. This post walks through what we learned.

The Architecture

At a high level, our agent:

1. Receives user queries

2. Searches our knowledge base (10,000+ documents) using vector embeddings

3. Retrieves relevant context with 92% precision

4. Passes context + query to LLM

5. LLM generates response OR decides to call a function

6. Functions execute (database queries, API calls, etc.)

7. Results feedback to LLM

8. Final response sent to user

User query → Embedding → Vector search (Pinecone)

→ Retrieve relevant chunks → LLM + tools

→ Execute function → Generate final response

Step 1: Vector Database Setup

I used **Pinecone** with OpenAI's `text-embedding-3-small` model.

**Chunking strategy:**

Chunk size: 512 tokens

Overlap: 128 tokens

Filters by document type and date

For our property management docs, this gave **92%+ retrieval precision**.

Step 2: Function Calling

Define tools that the LLM can invoke:

{

"name": "get_property_occupancy",

"description": "Get current occupancy for a property",

"parameters": {

"type": "object",

"properties": {

"property_id": { "type": "string" }

},

"required": ["property_id"]

}

When the LLM identifies a need to get occupancy data, it calls this function with the property ID. We execute it server-side and feed the result back.

Step 3: Handling Real-Time Context

One challenge: **keeping the agent aware of recent actions without redundant API calls**.

Solution: **Conversation memory buffer** (last 10 turns) + separate **tool call history**.

Result: **40% reduction in redundant API calls**.

const recentToolCalls = conversationHistory

.slice(-10)

.filter(msg => msg.role === 'tool')

.map(msg => msg.name);

if (recentToolCalls.includes('get_property_occupancy')) {

return previousResult; // use cache

}

Step 4: Production Considerations

Latency

Cached frequent queries in **Redis** (TTL 5 min):

P95 latency: **2.1s → 0.8s**

Cost

Used **model routing**:

GPT-3.5-turbo for simple queries

GPT-4 only for complex reasoning

**Average cost per interaction: $0.02**

Safety

Added **human-in-the-loop** for sensitive operations:

Destructive actions (delete, cancel booking) require approval

All tool calls logged for compliance

Real-World Results

**Document Q&A**: 85% first-response accuracy

**Scheduling**: Saved 15+ hours/week for ops team

**Support Tickets**: Resolves 80% without human involvement

What We Learned

1. **Retrieval quality is everything.** Spend time on chunking and embedding strategy before anything else.

2. **Context windows matter.** With proper memory management, GPT-3.5-turbo handles complex multi-step workflows.

3. **Tool calls add latency.** Batch where possible. Cache aggressively.

4. **Users want transparency.** Show them what was retrieved and why.

**Questions or want to discuss AI architecture?** [Reach out](mailto:contact@imamfaheem.com).

Have thoughts or questions?

I'd love to hear from you. Reach out via email.

Send me an email

Building AI Agents with RAG: From Theory to Production

Building AI Agents with RAG: From Theory to Production

The Architecture

Step 1: Vector Database Setup

Step 2: Function Calling

Step 3: Handling Real-Time Context

Step 4: Production Considerations

Latency

Cost

Safety

Real-World Results

What We Learned

Tags

Have thoughts or questions?

Related Articles

Scaling Multi-Tenant SaaS: Lessons from Smart PMS

Real-Time Features in Production: WebSockets and Redis