Skip to main content

Imam Faheem

Senior software engineer

← Back to blog
AI & Architecture12 min read

Building AI Agents with RAG: From Theory to Production

How I built an AI agent that handles complex organisational workflows using retrieval-augmented generation and function calling. Architecture decisions, production trade-offs, and what actually worked.

Imam Faheem

Senior Software Engineer

2024-12-15

Building AI Agents with RAG: From Theory to Production

When we started building the AI Agent Assistant at Skyware IT, the challenge was clear: **create an intelligent system that understands organisational context and executes complex tasks reliably**.

The solution? Combine **Retrieval-Augmented Generation (RAG)** with **function calling**. This post walks through what we learned.

The Architecture

At a high level, our agent:

1. Receives user queries

2. Searches our knowledge base (10,000+ documents) using vector embeddings

3. Retrieves relevant context with 92% precision

4. Passes context + query to LLM

5. LLM generates response OR decides to call a function

6. Functions execute (database queries, API calls, etc.)

7. Results feedback to LLM

8. Final response sent to user

User query → Embedding → Vector search (Pinecone)

→ Retrieve relevant chunks → LLM + tools

→ Execute function → Generate final response

Step 1: Vector Database Setup

I used **Pinecone** with OpenAI's `text-embedding-3-small` model.

**Chunking strategy:**

  • Chunk size: 512 tokens
  • Overlap: 128 tokens
  • Filters by document type and date
  • For our property management docs, this gave **92%+ retrieval precision**.

    Step 2: Function Calling

    Define tools that the LLM can invoke:

    {

    "name": "get_property_occupancy",

    "description": "Get current occupancy for a property",

    "parameters": {

    "type": "object",

    "properties": {

    "property_id": { "type": "string" }

    },

    "required": ["property_id"]

    }

    }

    When the LLM identifies a need to get occupancy data, it calls this function with the property ID. We execute it server-side and feed the result back.

    Step 3: Handling Real-Time Context

    One challenge: **keeping the agent aware of recent actions without redundant API calls**.

    Solution: **Conversation memory buffer** (last 10 turns) + separate **tool call history**.

    Result: **40% reduction in redundant API calls**.

    const recentToolCalls = conversationHistory

    .slice(-10)

    .filter(msg => msg.role === 'tool')

    .map(msg => msg.name);

    if (recentToolCalls.includes('get_property_occupancy')) {

    return previousResult; // use cache

    }

    Step 4: Production Considerations

    Latency

    Cached frequent queries in **Redis** (TTL 5 min):

  • P95 latency: **2.1s → 0.8s**
  • Cost

    Used **model routing**:

  • GPT-3.5-turbo for simple queries
  • GPT-4 only for complex reasoning
  • **Average cost per interaction: $0.02**
  • Safety

    Added **human-in-the-loop** for sensitive operations:

  • Destructive actions (delete, cancel booking) require approval
  • All tool calls logged for compliance
  • Real-World Results

  • **Document Q&A**: 85% first-response accuracy
  • **Scheduling**: Saved 15+ hours/week for ops team
  • **Support Tickets**: Resolves 80% without human involvement
  • What We Learned

    1. **Retrieval quality is everything.** Spend time on chunking and embedding strategy before anything else.

    2. **Context windows matter.** With proper memory management, GPT-3.5-turbo handles complex multi-step workflows.

    3. **Tool calls add latency.** Batch where possible. Cache aggressively.

    4. **Users want transparency.** Show them what was retrieved and why.


    **Questions or want to discuss AI architecture?** [Reach out](mailto:contact@imamfaheem.com).

    Tags

    #AI#RAG#LLM#Function Calling#Vector DB#Production

    Share this post

    Twitter

    Have thoughts or questions?

    I'd love to hear from you. Reach out via email.

    Send me an email