Fine-tuning vs RAG vs prompt engineering: when each actually wins

The three are not in opposition, they are in a stack

Vendor blog posts often pose 'fine-tuning vs RAG' as a binary. It is not. A real production system typically uses a frontier model with carefully engineered prompts as the reasoning core, RAG to inject domain knowledge the model does not have, and fine-tuning where domain behavior cannot be stabilized with prompting alone. Each layer does what it is good at.

The framing problem is that each technique has its evangelists, and each evangelist tends to dismiss the others. The actually useful question is which of the three is the highest-leverage move for the next thing you need to ship — not which is the right religion.

Prompt engineering: the highest-leverage starting point, almost always

Start with prompts. Frontier models can be steered remarkably far with good prompt design — explicit instructions, few-shot examples, chain-of-thought structure, structured output schemas. Iteration is fast, the cost is engineer time, and the result is reproducible because the prompt is version-controlled code. Most teams underestimate how far prompt engineering goes; the ones who get good at it ship faster than the ones who jump to fine-tuning.

The limit of prompt engineering is reliability under distribution shift. The prompt that works on 95% of inputs may fail on the 5% that look subtly different. When that 5% matters and prompt iteration is not closing the gap, the next move is RAG or fine-tuning depending on the failure mode.

Prompt iteration cycle
Hours engineer + eval set
RAG quality dependence
> 60% on retrieval, not model
Fine-tune training data
500–5K examples, typical SFT
Production stack
All 3 in most deployments

RAG: the right move when the model does not know your corpus

Retrieval-augmented generation injects relevant documents into the model's context at inference time. The model does not need to know your corpus during training; it just needs the right chunks at the right moment. RAG wins when the corpus is large, frequently updated, access-controlled, or proprietary — all of which describe most enterprise corpora.

RAG quality is overwhelmingly a function of retrieval, not the model. Sixty to seventy percent of RAG quality is determined before the model gets called: chunking, hybrid search, reranking, freshness, access control. Teams that focus on the model component and skim retrieval ship RAG systems that hallucinate against retrieved chunks because the chunks were wrong.

Fine-tuning: the right move when behavior cannot be prompted

Fine-tuning teaches a model behavior — style, format, domain reasoning shortcuts, refusal patterns — that prompt engineering cannot reliably stabilize. It does not teach knowledge as efficiently as RAG; the right rule of thumb is that fine-tuning is for changing how the model thinks, RAG is for changing what the model knows.

Production fine-tuning lands in two regimes. Supervised fine-tuning (SFT) on 500–5,000 high-quality examples shifts behavior reliably for narrow tasks. Reinforcement learning from human feedback or model-generated preferences refines behavior further. Both require eval discipline because fine-tuning regressions are silent — the model behaves differently in subtle ways that surface as production issues, not training metrics.

Open-weights fine-tuning vs hosted fine-tuning is a sovereignty question

Hosted fine-tuning APIs (OpenAI, Anthropic, Google) are convenient but produce a fine-tuned model that lives on the vendor's infrastructure under the vendor's terms. Open-weights fine-tuning (Llama, Mistral, Qwen) produces a model the customer owns end-to-end, hostable wherever they choose, with full control over inference cost and data flow.

The right choice depends on capability requirements, cost ceilings, and sovereignty posture. Hosted fine-tuning is faster to ship and benefits from frontier base capability. Open-weights is more flexible and more sovereign but requires self-managed infrastructure. We see organizations deploying both, sometimes for the same use case at different stages.

Evaluation discipline matters more than method choice

Whichever method you pick, the eval harness is what makes it trustworthy. Prompt changes need eval regression detection. RAG changes need retrieval eval (hit rate, MRR, citation accuracy) plus end-to-end eval. Fine-tuned models need eval on capability, refusal, and regression against the base model. Without eval discipline, every change is a guess and every regression is a customer report.

The teams that ship reliably are not the ones with the cleverest method choice. They are the ones with eval harnesses running in CI and a refusal to ship a release that regresses the eval. Method choice is interesting; eval discipline is decisive.

Cost: prompt is cheapest to iterate, RAG is cheapest to update, fine-tuning is cheapest to serve

Prompt iteration is engineer hours. RAG iteration is corpus updates and retrieval tuning. Fine-tuning iteration is dataset curation, training time, eval cycles. Inference cost runs the other direction: prompts on a frontier model are most expensive per token; fine-tuned smaller models can be 10x cheaper per token at the cost of being less generally capable. The right blend depends on volume and capability needs.

We spent six weeks fine-tuning a model when two days of prompt engineering would have closed 80% of the gap. We learned that the hard way. The next project we did the prompt work first, found we needed RAG, added RAG, and only fine-tuned the last 5% of behavior the prompt-and-RAG stack could not stabilize. That sequence is the playbook now.

— Tech Lead, healthcare AI client

Frequently asked

When should I fine-tune a model?

Fine-tune when prompt engineering and RAG cannot reliably produce the behavior you need — style, format, domain reasoning shortcuts, refusal patterns. Fine-tuning is for changing how the model thinks, not what it knows; the latter is RAG's job. Prerequisites are 500–5,000 high-quality training examples and an eval harness that catches regression. Without those, fine-tuning produces silent behavior shifts that surface as production issues.

When does RAG win over fine-tuning?

RAG wins when you need the model to answer questions over a corpus the model does not know — large, frequently updated, access-controlled, or proprietary content. RAG injects the relevant context at inference time without retraining. Fine-tuning is inefficient at storing factual knowledge and stale the moment the corpus updates. RAG handles knowledge; fine-tuning handles behavior. They complement rather than substitute.

How do I know if prompt engineering will be enough?

Try it first, time-box at one to two weeks, measure against an eval set. If prompt iteration closes the gap, ship the prompt and move on. If it plateaus with persistent failure modes that cluster around 'model does not know X' (try RAG) or 'model cannot stably do X' (consider fine-tuning), escalate. Skipping prompt engineering and jumping to fine-tuning or RAG is the most expensive sequence error we see.

Should I use hosted fine-tuning or open-weights?

Hosted fine-tuning is faster to ship and inherits frontier base capability; the model lives on the vendor's infrastructure under their terms. Open-weights fine-tuning produces a model you own end-to-end, deployable wherever you choose, with full control over inference cost and data flow. Choose hosted for speed and capability; choose open-weights for sovereignty, cost control, and infrastructure flexibility. Many organizations end up deploying both.

How much training data does fine-tuning need?

Production supervised fine-tuning typically lands at 500–5,000 high-quality examples for narrow behavior tasks. Quality matters more than quantity — a curated 800 examples beats a noisy 8,000. RLHF or preference-tuning runs after SFT on top of that base. The actual blocker is rarely volume; it is the curation discipline to produce clean, representative examples that capture the behavior you want.

Why does retrieval matter more than the model in RAG?

Because the model can only reason over what retrieval returns. If retrieval brings back wrong, stale, or irrelevant chunks, the model produces confident answers grounded in those wrong chunks — the worst kind of failure. Sixty to seventy percent of RAG quality is determined before the model is called. Hybrid search, chunking strategy, reranking, freshness windows, and access control are where the engineering effort earns its keep.