Routing Pipeline
Every request passes through session lookup and intent classification before being routed to a domain-specific RAG context and LLM call.
Domain Coverage
Three specialized knowledge domains — each with its own Qdrant namespace, retrieval tuning, and escalation policy.
Account & Order Management
billing query
order status
return request
account issue
- Billing disputes and payment adjustments
- Account management and password resets
- Order tracking and delivery updates
- Returns, refunds, and warranty claims
Discovery & Recommendations
product search
pricing inquiry
compare plans
upgrade path
- Product discovery and feature comparisons
- Pricing tiers, discounts, and promotions
- Personalised upsell and cross-sell recommendations
- Subscription plan changes and upgrade flows
Troubleshooting & Docs
setup guide
error code
how-to
integration help
- Step-by-step troubleshooting for common errors
- How-to guides sourced from product documentation
- API and integration setup walkthroughs
- Feature configuration and advanced settings
Core Components
Six building blocks that make multi-domain routing, graceful fallback, and human handoff possible in production.
Intent Classifier
Zero-shot classification assigns confidence scores to each domain. A fine-tuned domain adapter sharpens accuracy on product-specific vocabulary. Low-confidence inputs trigger a clarification turn before routing.
RAG Pipeline
Qdrant vector store with per-domain namespaces ensures retrieval is scoped and never bleeds across domains. Relevance gating drops chunks below threshold. Top-k context assembled before the LLM call.
Domain Adapters
LoRA fine-tuned adapters per domain are registered in a central adapter registry and hot-swapped at runtime. No service restart required when updating domain knowledge.
Human Escalation
Confidence threshold rules and sentiment analysis determine when to hand off. Full session context — including conversation history and retrieved chunks — transfers to the live agent in a single payload.
Fallback Chain
Azure OpenAI is the primary inference endpoint. When Azure is unavailable or latency exceeds the configured threshold, requests fail over automatically to a local Mistral instance — zero data egress during fallback.
Frontend & UX
React chat widget streams tokens via SSE for a responsive feel. Typing indicators and optimistic UI keep perceived latency low. Conversation history is persisted in Redis with TTL-based expiry.
Conversation State Machine
Four LangGraph graph nodes process every turn — from session hydration through to streamed response or escalation decision.
Deployment & Fallback
Cloud-primary with an on-premises fallback that shares the same REST interface — switching is transparent to the LangGraph agent loop.
Cloud Path — Azure OpenAI
- Primary inference endpoint for all production traffic
- GPT-4o for complex multi-turn conversations
- GPT-4o-mini for intent classification (lower latency)
- Azure content filtering as first safety layer
- Managed scaling and SLA guarantees
Local Fallback — Mistral
- Triggered when Azure is unavailable or latency > threshold
- Mistral 7B Instruct via Ollama — same REST interface
- Fully on-prem: zero data egress during fallback
- Automatic retry on Azure recovery; no session disruption
Key Tools & Libraries
The primary frameworks and services that power the assistant in production.
Azure OpenAI
Enterprise-grade GPT-4o access with SLA, content filtering, and VNet integration for secure on-prem connectivity.
View Service →LangGraph
Graph-based agent framework. Models conversation as a stateful directed graph — each node maps to a step in the conversation state machine.
View Docs →Mistral AI
Open-weight models (7B–70B) for on-prem fallback. Apache 2.0 licence enables full local deployment with no data egress.
View Models →Qdrant
High-performance vector store. Namespace support enables per-domain RAG isolation — support, sales, and technical queries never cross-contaminate.
View Docs →React
Frontend chat widget with SSE streaming support and component-level session state. Token-by-token rendering reduces perceived latency.
View Docs →FastAPI
Async Python API layer with native SSE streaming, WebSocket support, and auto-generated OpenAPI schema for client SDK generation.
View Docs →Redis
Session and conversation history store. TTL-based expiry enforces GDPR-compliant data retention. Pub/sub used for live agent escalation signalling.
View Docs →LlamaIndex
RAG orchestration — document ingestion, chunk indexing, and retrieval pipeline abstraction. Handles chunking strategy and context assembly before the LLM call.
View Docs →