Latency vs Cost in RAG Pipelines

2025ragllmcost

I built RAGX at FSMK as a production-grade RAG service - not a demo, but something with real evaluation and cost controls. The tension between model quality and everything else was the main engineering problem.

The obvious lever is the model. GPT-4 gives better answers than GPT-3.5. But at FSMK's budget (roughly zero dollars), GPT-4 would exhaust the free tier in an afternoon of testing. We picked GPT-3.5-turbo with a strict context window budget: keep retrieved chunks under 2000 tokens, don't over-retrieve.

Streaming made the latency problem manageable without actually solving it. Time-to-first-token on a typical query was around 800ms. With streaming, users see text appearing at 800ms instead of staring at a blank screen for 4 seconds while the full response buffers. Same actual latency, completely different perceived experience. That's not a trick - it's just good UX.

The most impactful optimization was caching embeddings. Every query that hits the same document chunk doesn't need to re-embed it. We stored embeddings in the vector DB with their source text hashes as keys. Cache hit rate was around 70% on repeated queries over the same document set. That's 70% fewer embedding API calls, which is where most of the cost was.

Rate limiting wasn't optional. The first week without it, a single test loop burned through the daily quota in 20 minutes. We added per-user and per-IP rate limits and that problem went away.

On evaluation: we tracked two metrics - relevance and faithfulness. Relevance is whether the retrieved chunks actually relate to the query. Faithfulness is whether the answer sticks to what those chunks say, or whether the model goes off-script and makes things up. You can have high relevance and low faithfulness if the retrieval is good but the LLM ignores the context. That distinction matters when you're debugging why answers are wrong.

The thing that surprised me: better chunking improved both metrics more than switching models. Overlapping 200-token chunks with a 50-token stride beat fixed 500-token chunks cleanly. I expected the model to matter more.