Platform Architecture
Five layers form a complete semantic search system — from raw document storage up to the query API serving results.
Retrieval Modes
The platform automatically selects and blends retrieval strategies based on query characteristics and tenant configuration.
Semantic Vector Search
- Sentence-Transformers bi-encoder embeddings
- Cosine similarity over HNSW index in Qdrant
- Multilingual support via multilingual-e5-base
- Handles paraphrase, synonyms, and concept drift
BM25 Keyword Search
- BM25Okapi inverted index with TF-IDF scoring
- Exact token matching — no embedding overhead
- Per-tenant stop-word and stemming configuration
- Incremental index updates on document ingest
RRF Fusion & Cross-Encoder
- Reciprocal Rank Fusion merges dense + sparse sets
- Cross-encoder re-ranker scores top-50 candidates
- Full query–passage interaction for precision
- Re-ranker threshold configurable per tenant
Core Components
Six specialized subsystems that together deliver production-grade semantic search at scale.
Document Ingestion
Multi-format parsing pipeline supporting PDF, HTML, and DOCX. Sentence-level chunking with configurable overlap. Content-hash deduplication prevents re-embedding unchanged documents.
Embedding Pipeline
Sentence-Transformers models run batch GPU inference on document chunks. Incremental re-embedding triggers automatically when the active model checkpoint is updated in the model registry.
Vector Store
Qdrant with HNSW index delivers sub-100ms approximate nearest-neighbour search at 10M+ scale. Per-tenant collection namespaces ensure data isolation. Payload filters enable faceted search without post-processing.
BM25 Index
BM25Okapi inverted index built per tenant. Term-frequency store supports incremental document addition and deletion without full rebuild. Stop-word lists and stemming rules are configurable per corpus language.
Re-ranking Layer
Cross-encoder (ms-marco-MiniLM-L-6) re-scores the top-50 candidates from RRF fusion with full query–passage interaction. Score threshold is configurable — tenants can trade latency for precision.
Search Frontend
React UI with real-time token highlighting powered by SSE streaming. Faceted sidebar filters on metadata payload fields. Infinite scroll pagination and keyboard-first navigation for power users.
Query Pipeline
Every search request fans out to parallel dense and sparse retrieval, then converges through fusion and re-ranking before results are streamed to the UI.
Multi-Tenancy & Scale
Hard tenant isolation and horizontal scalability without compromise on search quality.
Tenant Isolation
- Qdrant collection-per-tenant namespaces — no shared index
- Per-tenant query quotas enforced by Redis rate limiter
- API key scoped to tenant ID — zero cross-tenant data bleed
- Tenant-level index config: chunk size, re-ranker on/off, model selection
- Tenant onboarding via admin API — no manual infra provisioning
Scale & Performance
Key Tools & Resources
The primary libraries, databases, and frameworks powering this platform.
Qdrant
High-performance vector DB with payload filtering, multi-tenant namespaces, and HNSW indexing. Rust core, Python client.
qdrant.tech →Sentence-Transformers
Pre-trained bi-encoder models for dense embeddings. Multilingual support, fast batch inference, easy HuggingFace Hub integration.
sbert.net →BM25 (rank_bm25)
Python BM25Okapi implementation. Lightweight inverted index for sparse keyword retrieval with configurable k1/b parameters.
GitHub →FastAPI
Async search API layer. Native SSE streaming for progressive result delivery. OpenAPI schema auto-generated for client SDKs.
fastapi.tiangolo.com →React
Search UI with real-time token highlight rendering, faceted filters, and infinite scroll. SSE client for streaming search snippets.
react.dev →Redis
Query result cache with TTL-based expiry, per-tenant rate limiting via Redis token-bucket, and session store for search history.
redis.io →HuggingFace Hub
Model registry for embedding and re-ranker checkpoints with versioned deploys. Supports model pinning per tenant for reproducibility.
huggingface.co →ms-marco-MiniLM
Cross-encoder re-ranker fine-tuned on MS MARCO. Scores query–passage pairs for precise re-ranking of the top-50 fusion candidates.
Model card →