Platform Layer Stack
Four interdependent layers from bare-metal infrastructure up to end-user applications — all running on-premise.
What to Obey
Non-negotiable constraints that every on-premise LLM deployment must satisfy before going to production.
Air-Gap by Design
- No outbound model API calls — all inference stays on-prem
- Firewall egress rules block LLM SaaS endpoints by default
- DMZ placement for any externally accessible inference endpoints
- Internal DNS only — no public resolution for model endpoints
Data Residency & Audit
- GDPR data residency — all embeddings and logs stay in the defined region
- SOC 2 audit logging — every prompt, model, user, and latency recorded
- Prompt injection defence — input sanitisation and output validation layers
- Data classification enforced — PII routed to restricted-access models only
GPU Sizing & HA
- VRAM budget per model: 7B≈6GB, 13B≈12GB, 70B≈48GB (4-bit quant)
- HA setup — at least two inference nodes with load balancing
- Resource quotas per team/role — prevent GPU monopolisation
- Cold-start SLA — model load time factored into p99 latency budget
Core Platform Components
Six capability areas that together form a production-grade on-premise LLM platform.
Model Serving
Self-hosted inference runtime with GPU acceleration, GGUF/GPTQ quantisation support, and a REST API compatible with OpenAI clients. Hot-swap models without downtime.
API Gateway
Centralised request routing across multiple models with rate limiting, JWT validation, and usage metering per team. Routes traffic based on model capability requirements.
Auth & RBAC
Single sign-on via OIDC with LDAP/AD bridge. Role-scoped model access — e.g., only approved roles can query uncensored or high-capability models. Token-based API auth for service accounts.
Knowledge & RAG
Vector stores for semantic document retrieval. RAG pipelines ingest internal knowledge bases, code repositories, and enterprise docs — keeping sensitive data entirely within the perimeter.
Safety & Guardrails
Input/output validation layer intercepts harmful prompts, PII leakage, and policy violations before they reach end users. Output filtering prevents hallucinated credentials or confidential data exposure.
Observability
Full-stack visibility into model performance, token throughput, error rates, and latency percentiles. Experiment tracking ties every inference call back to the model version and prompt template used.
Data & Integration Layer
How the platform connects to live enterprise data without exporting sensitive content to external systems.
Enterprise Connectors via MCP
Model Context Protocol bridges the LLM to live enterprise data without copying it. Agents call tools in real time — no data duplication, no stale exports.
- Confluence — knowledge base articles, technical docs
- Jira — tickets, epics, sprints, status updates
- SharePoint — document libraries, intranet pages
- SQL Databases — structured queries via read-only connectors
- REST APIs — any internal service with OpenAPI spec
Knowledge Store Options
| Type | When to use | Examples |
|---|---|---|
| Vector Store | Semantic similarity, document Q&A | FAISS, Qdrant, Weaviate, Chroma |
| Knowledge Graph | Entity relationships, multi-hop reasoning | Neo4j, Neptune, GraphDB |
| Hybrid | Complex enterprise RAG | Neo4j + Qdrant, LangChain graph transformers |
MLOps & Testing Pipeline
A four-phase lifecycle for managing, testing, deploying, and monitoring on-premise LLMs in production.
Key Open-Source Tooling
The primary open-source projects that make a production-grade on-premise LLM platform possible.
Ollama
Run LLMs locally with GGUF model management, a REST API compatible with OpenAI clients, and GPU acceleration on NVIDIA and Apple Silicon.
ollama.ai →Keycloak
Open source identity and access management. SSO, OIDC, LDAP/AD bridge, RBAC, all fully self-hosted. No external IdP dependency.
keycloak.org →Qdrant
High-performance vector search engine with filtering, payload indexing, and on-disk storage support. Fully self-hostable via Docker or Kubernetes.
qdrant.tech →MCP Protocol
Open standard for connecting AI models to external tools and data sources. Enables agents to call enterprise systems without data duplication.
modelcontextprotocol.io →MLflow
End-to-end ML lifecycle management — experiment tracking, model registry, prompt versioning, and deployment. Integrates with any training framework.
mlflow.org →Grafana
Observability dashboards for metrics, logs, and traces. Pair with Prometheus for LLM latency, token throughput, and error rate monitoring dashboards.
grafana.com →Guardrails AI
Output validation and safety rails for LLM responses. Schema enforcement, PII redaction, toxicity detection, and custom validators via a declarative RAIL spec.
guardrailsai.com →Neo4j
Graph database for entity relationships and multi-hop reasoning. Used in hybrid RAG pipelines to capture structured knowledge that vector search alone cannot represent.
neo4j.com →