Embedding service: Focus on Design

stemaway · March 25, 2026, 6:43pm

LLM & Agentic Systems > Model serving > Embedding service

What This Covers

Designing an embedding service that supports both real-time query traffic and large-scale indexing, with emphasis on deployment safety, compatibility, and measurement gaps.

The System

Current State

Embedding service runs as a stateless Kubernetes deployment (24 pods, 8 vCPU / 16 GB each) behind a single API endpoint.
Handles ~1,800 QPS online (P99 latency target: 120 ms) and nightly batch indexing of ~60M text chunks (about 9 hours).
Vectors are stored in a managed vector DB with a fixed schema per index/namespace.
Online path calls POST /embed synchronously; batch path calls the same API with larger payloads (up to 256 chunks/request).

Proposed Change

Move to weekly embedding model refreshes to improve relevance, deployed via canary (10% → 50% → 100% over 6 hours).
Keep indexing “continuous” (near-real-time) rather than nightly, to reduce freshness lag.
Reduce infra cost by consolidating worker pools and relying on autoscaling rather than separate online vs batch fleets.

Worked Example: Design Tradeoff

A previous iteration used a single shared worker pool for both batch re-indexing and online query embeddings to simplify operations and maximize utilization. The risk was that large batch jobs could monopolize embedding throughput because there was no separation or admission control between workloads. The issue surfaced as a pattern of 10x online latency spikes during re-indexing windows, with timeouts only in the query path while batch throughput looked healthy.

The design was adjusted by adding explicit workload isolation: separate request queues and concurrency limits for batch vs online, plus per-source metrics (queue depth and latency percentiles). This preserved high batch throughput while keeping online P99 within target during indexing windows.

The Design Question

With weekly model refreshes and continuous indexing, the team wants to avoid any “big bang” re-index while still improving relevance quickly. How would you design the embedding service + indexing workflow so that online queries remain compatible with what’s already in the vector DB during and after a model deployment? Consider operational complexity, data correctness, and how you’d prove the system is behaving as intended over time.

Anchor Data

Canary results from last week (new embedding model candidate vs current):

Metric (online)	Baseline (current model)	Canary (10% traffic)
P50 embed latency	34 ms	36 ms
P99 embed latency	112 ms	115 ms
Timeout rate	0.08%	0.09%
Cost per 1M embeddings (est.)	$1.92	$1.87
Offline retrieval recall@10 (frozen set)	0.412	0.447

Indexing job summary (same week):

Chunks sent to embedding: 18,400,000
Vectors acknowledged stored: 18,398,600
Index build metadata recorded: build_time=2026-03-18T02:14Z, model_id=<redacted>

Current Observations

Search relevance complaints increased “a bit” within a day of the canary reaching 50%, but latency and error dashboards stayed flat.
Offline benchmarks re-run after deployment look consistently improved, even when repeated on the same query set.
Some teams suspect the reranker deploy from the same day; others point to a content pipeline change that increased short/boilerplate chunks.
Vector DB shows stable query latency; cache hit rate is unchanged week-over-week.

Constraints

Product requires weekly model updates; pausing indexing for full rebuild is not acceptable.
SRE wants minimal new moving parts (no new datastore), but will accept modest schema changes in the vector DB.
Migration must be reversible within 30 minutes if relevance regresses.

What You’ll Be Evaluated On

Tradeoffs: Identifying competing concerns (freshness vs compatibility vs complexity) and justifying choices.
Gaps: Spotting what’s missing or unproven in the proposal/data (metadata, measurement, invariants).
Prevention: Safeguards to avoid subtle correctness regressions across deploys (compatibility strategy, rollback plan).
Clarity: Communicating a coherent design with crisp assumptions.
Prioritization: Sequencing what to build/measure first under constraints.
Reasoning quality: Sound technical logic, not guesswork.