Multi-Domain Customer Assistant

Architecture

Routing Pipeline

Every request passes through session lookup and intent classification before being routed to a domain-specific RAG context and LLM call.

Domains

Domain Coverage

Three specialized knowledge domains — each with its own Qdrant namespace, retrieval tuning, and escalation policy.

Customer Support

Account & Order Management

billing query order status return request account issue

Billing disputes and payment adjustments
Account management and password resets
Order tracking and delivery updates
Returns, refunds, and warranty claims

Escalates to live agent when sentiment is negative and resolution requires account write access.

Product & Sales

Discovery & Recommendations

product search pricing inquiry compare plans upgrade path

Product discovery and feature comparisons
Pricing tiers, discounts, and promotions
Personalised upsell and cross-sell recommendations
Subscription plan changes and upgrade flows

Escalates to sales team when purchase intent is high and deal value exceeds configurable threshold.

Technical / FAQ

Troubleshooting & Docs

setup guide error code how-to integration help

Step-by-step troubleshooting for common errors
How-to guides sourced from product documentation
API and integration setup walkthroughs
Feature configuration and advanced settings

Escalates to technical support when confidence falls below threshold after two clarification turns.

Components

Core Components

Six building blocks that make multi-domain routing, graceful fallback, and human handoff possible in production.

Intent Classifier

Zero-shot classification assigns confidence scores to each domain. A fine-tuned domain adapter sharpens accuracy on product-specific vocabulary. Low-confidence inputs trigger a clarification turn before routing.

LangGraph zero-shot classification fine-tuned domain adapter

RAG Pipeline

Qdrant vector store with per-domain namespaces ensures retrieval is scoped and never bleeds across domains. Relevance gating drops chunks below threshold. Top-k context assembled before the LLM call.

Qdrant LlamaIndex relevance gating context assembly

Domain Adapters

LoRA fine-tuned adapters per domain are registered in a central adapter registry and hot-swapped at runtime. No service restart required when updating domain knowledge.

LoRA fine-tuning hot-swap at runtime adapter registry

Human Escalation

Confidence threshold rules and sentiment analysis determine when to hand off. Full session context — including conversation history and retrieved chunks — transfers to the live agent in a single payload.

confidence threshold rules live agent handoff full session context transfer

Fallback Chain

Azure OpenAI is the primary inference endpoint. When Azure is unavailable or latency exceeds the configured threshold, requests fail over automatically to a local Mistral instance — zero data egress during fallback.

Azure OpenAI primary local Mistral fallback retry logic graceful degradation

Frontend & UX

React chat widget streams tokens via SSE for a responsive feel. Typing indicators and optimistic UI keep perceived latency low. Conversation history is persisted in Redis with TTL-based expiry.

React chat widget SSE streaming typing indicators Redis session persistence

State Machine

Conversation State Machine

Four LangGraph graph nodes process every turn — from session hydration through to streamed response or escalation decision.

Receive

Session lookup from Redis, conversation history assembly, and user input tokenisation. Context window budget calculated before any LLM call.

Classify

Intent detection via fine-tuned classifier. Confidence score thresholds gate routing — low-confidence queries trigger a clarification turn before routing.

Retrieve

Qdrant vector search scoped to the assigned domain's namespace. Relevance gating drops chunks below threshold. Top-k context assembled for the LLM.

Respond

LLM call (Azure OpenAI or Mistral fallback). Safety filter on output. Escalation decision based on sentiment and confidence. Response streamed to frontend.

Deployment

Deployment & Fallback

Cloud-primary with an on-premises fallback that shares the same REST interface — switching is transparent to the LangGraph agent loop.

Cloud Path — Azure OpenAI

Primary inference endpoint for all production traffic
GPT-4o for complex multi-turn conversations
GPT-4o-mini for intent classification (lower latency)
Azure content filtering as first safety layer
Managed scaling and SLA guarantees

Local Fallback — Mistral

Triggered when Azure is unavailable or latency > threshold
Mistral 7B Instruct via Ollama — same REST interface
Fully on-prem: zero data egress during fallback
Automatic retry on Azure recovery; no session disruption

Tools & Resources

Key Tools & Libraries

The primary frameworks and services that power the assistant in production.

Azure OpenAI

Enterprise-grade GPT-4o access with SLA, content filtering, and VNet integration for secure on-prem connectivity.

View Service →

LangGraph

Graph-based agent framework. Models conversation as a stateful directed graph — each node maps to a step in the conversation state machine.

View Docs →

Mistral AI

Open-weight models (7B–70B) for on-prem fallback. Apache 2.0 licence enables full local deployment with no data egress.

View Models →

Qdrant

High-performance vector store. Namespace support enables per-domain RAG isolation — support, sales, and technical queries never cross-contaminate.

View Docs →

React

Frontend chat widget with SSE streaming support and component-level session state. Token-by-token rendering reduces perceived latency.

View Docs →

FastAPI

Async Python API layer with native SSE streaming, WebSocket support, and auto-generated OpenAPI schema for client SDK generation.

View Docs →

Redis

Session and conversation history store. TTL-based expiry enforces GDPR-compliant data retention. Pub/sub used for live agent escalation signalling.

View Docs →

LlamaIndex

RAG orchestration — document ingestion, chunk indexing, and retrieval pipeline abstraction. Handles chunking strategy and context assembly before the LLM call.

View Docs →