System Design

Multi-Domain Customer Assistant

Production chatbot with dynamic intent routing, domain-specific RAG, and graceful fallback across cloud and local models — no dead ends, no dropped context.

Azure OpenAI LangGraph Mistral Qdrant React FastAPI Redis Python

Routing Pipeline

Every request passes through session lookup and intent classification before being routed to a domain-specific RAG context and LLM call.

User Input Session Lookup Intent Classifier Router Support Domain Sales Domain Technical Domain RAG Retrieval LLM Call Safety Filter Response Escalate

Domain Coverage

Three specialized knowledge domains — each with its own Qdrant namespace, retrieval tuning, and escalation policy.

Customer Support

Account & Order Management

billing query order status return request account issue
  • Billing disputes and payment adjustments
  • Account management and password resets
  • Order tracking and delivery updates
  • Returns, refunds, and warranty claims
Escalates to live agent when sentiment is negative and resolution requires account write access.
Product & Sales

Discovery & Recommendations

product search pricing inquiry compare plans upgrade path
  • Product discovery and feature comparisons
  • Pricing tiers, discounts, and promotions
  • Personalised upsell and cross-sell recommendations
  • Subscription plan changes and upgrade flows
Escalates to sales team when purchase intent is high and deal value exceeds configurable threshold.
Technical / FAQ

Troubleshooting & Docs

setup guide error code how-to integration help
  • Step-by-step troubleshooting for common errors
  • How-to guides sourced from product documentation
  • API and integration setup walkthroughs
  • Feature configuration and advanced settings
Escalates to technical support when confidence falls below threshold after two clarification turns.

Core Components

Six building blocks that make multi-domain routing, graceful fallback, and human handoff possible in production.

Intent Classifier

Zero-shot classification assigns confidence scores to each domain. A fine-tuned domain adapter sharpens accuracy on product-specific vocabulary. Low-confidence inputs trigger a clarification turn before routing.

LangGraph zero-shot classification fine-tuned domain adapter

RAG Pipeline

Qdrant vector store with per-domain namespaces ensures retrieval is scoped and never bleeds across domains. Relevance gating drops chunks below threshold. Top-k context assembled before the LLM call.

Qdrant LlamaIndex relevance gating context assembly

Domain Adapters

LoRA fine-tuned adapters per domain are registered in a central adapter registry and hot-swapped at runtime. No service restart required when updating domain knowledge.

LoRA fine-tuning hot-swap at runtime adapter registry

Human Escalation

Confidence threshold rules and sentiment analysis determine when to hand off. Full session context — including conversation history and retrieved chunks — transfers to the live agent in a single payload.

confidence threshold rules live agent handoff full session context transfer

Fallback Chain

Azure OpenAI is the primary inference endpoint. When Azure is unavailable or latency exceeds the configured threshold, requests fail over automatically to a local Mistral instance — zero data egress during fallback.

Azure OpenAI primary local Mistral fallback retry logic graceful degradation

Frontend & UX

React chat widget streams tokens via SSE for a responsive feel. Typing indicators and optimistic UI keep perceived latency low. Conversation history is persisted in Redis with TTL-based expiry.

React chat widget SSE streaming typing indicators Redis session persistence

Conversation State Machine

Four LangGraph graph nodes process every turn — from session hydration through to streamed response or escalation decision.

1
Receive
Session lookup from Redis, conversation history assembly, and user input tokenisation. Context window budget calculated before any LLM call.
2
Classify
Intent detection via fine-tuned classifier. Confidence score thresholds gate routing — low-confidence queries trigger a clarification turn before routing.
3
Retrieve
Qdrant vector search scoped to the assigned domain's namespace. Relevance gating drops chunks below threshold. Top-k context assembled for the LLM.
4
Respond
LLM call (Azure OpenAI or Mistral fallback). Safety filter on output. Escalation decision based on sentiment and confidence. Response streamed to frontend.

Deployment & Fallback

Cloud-primary with an on-premises fallback that shares the same REST interface — switching is transparent to the LangGraph agent loop.

Cloud Path — Azure OpenAI

  • Primary inference endpoint for all production traffic
  • GPT-4o for complex multi-turn conversations
  • GPT-4o-mini for intent classification (lower latency)
  • Azure content filtering as first safety layer
  • Managed scaling and SLA guarantees

Local Fallback — Mistral

  • Triggered when Azure is unavailable or latency > threshold
  • Mistral 7B Instruct via Ollama — same REST interface
  • Fully on-prem: zero data egress during fallback
  • Automatic retry on Azure recovery; no session disruption
Request Azure OK? Yes Azure OpenAI No Mistral Local

Key Tools & Libraries

The primary frameworks and services that power the assistant in production.

Azure OpenAI

Enterprise-grade GPT-4o access with SLA, content filtering, and VNet integration for secure on-prem connectivity.

View Service →

LangGraph

Graph-based agent framework. Models conversation as a stateful directed graph — each node maps to a step in the conversation state machine.

View Docs →

Mistral AI

Open-weight models (7B–70B) for on-prem fallback. Apache 2.0 licence enables full local deployment with no data egress.

View Models →

Qdrant

High-performance vector store. Namespace support enables per-domain RAG isolation — support, sales, and technical queries never cross-contaminate.

View Docs →

React

Frontend chat widget with SSE streaming support and component-level session state. Token-by-token rendering reduces perceived latency.

View Docs →

FastAPI

Async Python API layer with native SSE streaming, WebSocket support, and auto-generated OpenAPI schema for client SDK generation.

View Docs →

Redis

Session and conversation history store. TTL-based expiry enforces GDPR-compliant data retention. Pub/sub used for live agent escalation signalling.

View Docs →

LlamaIndex

RAG orchestration — document ingestion, chunk indexing, and retrieval pipeline abstraction. Handles chunking strategy and context assembly before the LLM call.

View Docs →