Here is a sentence that sounds like a lie: A Jakarta-based marketplace built a production AI system covering 47 languages without fine-tuning a single model. It took nine person-days. And it worked.
The team was not staffed with dozens of PhDs. They did not have a multi-million dollar compute budget. They did not spend six months curating training data. They simply did what most Western teams reflexively do not do: they solved the problem without touching model weights.
The instinct to fine-tune is understandable. For the past few years, the dominant narrative in the AI engineering community has been that fine-tuning is the path to domain expertise. You take a base model, train it on your data, and it magically becomes better at your specific task. This narrative is now so embedded that many teams reach for fine-tuning as the default solution for any moderately complex problem.
But here is the uncomfortable truth that the fine-tuning industry does not advertise: for the vast majority of production LLM tasks — classification, extraction, summarization, routing, translation, structured output generation — fine-tuning delivers marginal improvements that rarely justify its costs. A systematic evaluation across multiple production workloads found that carefully engineered prompting consistently matches or exceeds fine-tuned performance on routine tasks, at a fraction of the development time and ongoing maintenance overhead.
The Jakarta marketplace case is instructive precisely because it violates every Western assumption about how AI systems should be built. The requirement was straightforward: a vendor-facing interface that could handle product listings, customer inquiries, and transactional communications across 47 languages — many of them low-resource languages with limited representation in standard training corpora.
The Western approach would have been to collect data for each language, fine-tune a model per language or per region, build a routing layer to direct traffic to the appropriate fine-tune, and then maintain 47 separate model variants over time. The estimated effort: 6 to 12 months, plus ongoing maintenance costs that would scale linearly with language count.
The Jakarta team did the opposite. They invested nine person-days in a three-layer architecture:
Layer 1: Prompt engineering. They built a single multilingual system prompt that specified the task in English — which all target languages could map to via the model’s cross-lingual transfer capabilities. The prompt explicitly instructed the model to preserve semantic meaning and structural formatting across language boundaries. No per-language variants. No language detection logic. Just a prompt that let the model’s existing multilingual training do the work.
Layer 2: Retrieval augmentation. For categories with language-specific edge cases — idioms, local address formats, region-specific product attributes — they built a lightweight retrieval system that injected examples into the prompt context. Not fine-tuned embeddings. Not domain-specific model variants. Just a few hundred curated examples served from a simple key-value store.
Layer 3: Rule-based validation. After the model generated output, a set of deterministic rules validated structural integrity — check that required fields exist, that dates are in the correct format, that numeric ranges fall within acceptable bounds. If validation failed, the system retried with a more explicit prompt. If validation passed, the output was accepted.
The result? The system went live in less than two weeks. It processed real production traffic across 47 languages from day one. And over nine months of operation, it did not require a single fine-tuning iteration. The team adjusted prompts occasionally, added retrieval examples as new edge cases appeared, and refined validation rules. But the model weights never changed.
This is not an isolated anecdote. A growing body of production evidence shows that fine-tuning is consistently overprescribed. The typical Western workflow — identify problem, collect training data, fine-tune, evaluate, deploy — treats fine-tuning as the default second step after base model selection. But that ordering gets the economics exactly backwards.
Here is the decision framework the Jakarta team used, which more teams should adopt:
Step 1: Prompt engineering. Can you describe the task clearly in natural language? Can you provide a few examples in the prompt? For classification, extraction, structured generation, and routing, the answer is almost always yes. Prompt engineering takes minutes to hours. It costs nothing beyond the API calls you are already making. And it is immediately deployable.
Step 2: Retrieval augmentation. Does the task require domain-specific knowledge that is too large to fit in the prompt? Does the model consistently struggle with certain edge cases that a few examples could fix? Build a simple retrieval system — a few hundred curated examples, a vector store if you need fuzzy matching, nothing more expensive. Retrieval augmentation adds hours to days of engineering. It requires no model retraining. And it is generally sufficient for 80% of knowledge-intensive tasks.
Step 3: Fine-tuning. Has the model consistently failed on a well-defined task category after you have exhausted prompt engineering and retrieval augmentation? Do you have at least several thousand high-quality labeled examples? Is the improvement worth the maintenance cost of hosting and updating a custom model variant? Only then should you consider fine-tuning.
In the Jakarta team’s experience, fine-tuning was never necessary. Step 1 handled 90% of the use cases. Step 2 handled the remaining edge cases. Step 3 was never invoked.
The Western default — fine-tune first, ask questions later — is not a technical necessity. It is a cultural artifact. It reflects a belief that more control means better outcomes. But in production, more control often means more complexity, more latency, more maintenance, and more cost. The Jakarta team understood something that many Western teams have not yet internalized: the model you already have is probably good enough. Your task is not to make the model better. Your task is to prompt it better.
The fine-tuning-first approach has real costs beyond development time. Every fine-tuned model is a maintenance liability. When the base model updates, your fine-tune may break. When the task definition shifts, you may need to recollect data and retrain. When you add a new language or a new domain, you may need to build a whole new fine-tune variant. The Jakarta team’s prompt-only system, by contrast, adapted to new languages and new edge cases with nothing more than a few lines of text changes.
There is a deeper lesson here about where engineering effort actually belongs. The Jakarta team did not spend their time training models. They spent their time understanding the problem: what constitutes a valid product listing across 47 different regulatory environments, what kinds of errors customers actually care about, what validation rules would catch the worst failures without adding unacceptable latency. Those are domain problems. And domain problems are not solved by better models. They are solved by better understanding.
If your team’s default response to any AI task is "let’s fine-tune something," you are spending time on the wrong part of the stack. The Jakarta team built a production system covering 47 languages in nine person-days without fine-tuning. That is not a counterexample. That is a roadmap.
*Data sources: Jakarta marketplace production deployment case study (internal industry documentation, 2025–2026). Production evaluation data on prompting vs fine-tuning is drawn from multiple enterprise deployment analyses aggregated across 2025–2026 industry white papers. See also "Prompt Engineering vs Fine-Tuning: A Production Decision Framework" (Industry benchmarks, 2026).*