47 Languages, Zero Fine-Tuning, Nine Person-Days

A Jakarta-based marketplace needed to support 47 languages. Their vendor-facing interface had to handle product listings, customer inquiries, and transactional communications across Indonesia's archipelago and beyond — Bahasa Indonesia, Javanese, Sundanese, Minangkabau, Balinese, plus regional dialects of English, Chinese, and dozens of others.

A Western team would have done the obvious thing: fine-tune.

They would have collected data for each language, fine-tuned a model per language or per region, built a routing layer, and then maintained 47 separate model variants. Estimated effort: 6 to 12 months, plus ongoing maintenance costs scaling linearly with every new language.

The Jakarta team did not do that.

They built the system in nine person-days. No fine-tuning. No GPU bills. No 47 model variants to maintain. Just three layers: prompt design, retrieval augmentation, and rule-based validation.

And it worked.

The default that costs you months

The instinct to fine-tune is so embedded in Western AI engineering that most teams don't even question it. You have a domain-specific problem? You fine-tune. You have a new language? You fine-tune. You have an edge case? You fine-tune.

This instinct is expensive.

Industry estimates suggest that fine-tuning requires 10,000 to 100,000 examples and can be the most costly, time-consuming, and least flexible customization method. Upfront costs can reach tens of thousands of dollars. Iteration cycles take weeks. And once you deploy, you're locked into maintaining that model variant forever.

The Jakarta team understood something that most Western teams haven't internalized: fine-tuning is the last step, not the first.

What they actually built

The system used a three-layer architecture. None of it required changing model weights.

Layer 1: Prompt engineering. They built a single multilingual system prompt that specified the task in a neutral language format. The prompt explicitly instructed the model to preserve semantic meaning and structural formatting across language boundaries. No per-language variants. No language detection logic. Just a prompt that let the model's existing multilingual training do the work.

Layer 2: Retrieval augmentation. For categories with language-specific edge cases — idioms, local address formats, region-specific product attributes — they built a lightweight retrieval system that injected examples into the prompt context. Not fine-tuned embeddings. Not domain-specific model variants. Just a few hundred curated examples served from a simple store.

Layer 3: Rule-based validation. After the model generated output, deterministic rules validated structural integrity — required fields exist, dates are in correct format, numeric ranges fall within acceptable bounds. If validation failed, the system retried with a more explicit prompt. If validation passed, the output was accepted.

The entire system cost nothing in GPU time. The only recurring cost was API inference — and with prompt caching on the shared system prompts, even that was minimized. For repeated context in multilingual workflows, effective prices can be 60–80% cheaper after caching is applied.

The numbers that matter

The Jakarta team's results are not theoretical. They are production numbers.

47 languages supported from day one. No phased rollout. No "we'll add more later."
Nine person-days of engineering time. That's less than two work weeks for a single person — or one week for two people.
Zero fine-tuning runs. No GPU provisioning. No data labeling. No model version management.
Production traffic from day one. The system handled real vendor interactions immediately.

Now compare that to the fine-tuning path.

A team that fine-tunes for each language would need:

Data collection and labeling for each language: weeks to months
Training runs: days to weeks per language
Evaluation and iteration: weeks per language
Ongoing maintenance: every base model update risks breaking the fine-tune
New language addition: repeat the entire process

The cost differential is not 2x or 3x. It's 10x to 50x.

Why this works

The Jakarta team's approach works because modern multilingual LLMs are already capable of cross-lingual transfer without additional training. Zero-shot cross-lingual generation — using a model trained primarily on one language to make predictions in others — is well-documented in the literature. Multilingual pretrained models can zero-shot learn for unseen languages with above-chance performance, which can be further improved via model adaptation with target-language text.

The key insight is that you don't need to teach the model new languages. You need to teach it how to apply what it already knows to your specific task.

Prompt engineering tells the model what to do. Retrieval tells it how to handle edge cases. Rules tell it when to stop and ask for help. None of these require changing the model.

When fine-tuning actually makes sense

The Jakarta team's approach doesn't mean fine-tuning is never useful. It means fine-tuning is overprescribed.

Here's the decision framework they used — and that more teams should adopt:

Step 1: Prompt engineering. Can you describe the task clearly in natural language? Can you provide a few examples in the prompt? For classification, extraction, structured generation, and routing, the answer is almost always yes. Prompt engineering takes minutes to hours. It costs nothing beyond the API calls you're already making. And it is immediately deployable.

Step 2: Retrieval augmentation. Does the task require domain-specific knowledge that is too large to fit in the prompt? Does the model consistently struggle with certain edge cases that a few examples could fix? Build a simple retrieval system — a few hundred curated examples, nothing more expensive. Retrieval augmentation adds hours to days of engineering. It requires no model retraining. And it is generally sufficient for 80% of knowledge-intensive tasks.

Step 3: Fine-tuning. Has the model consistently failed on a well-defined task category after you have exhausted prompt engineering and retrieval augmentation? Do you have at least several thousand high-quality labeled examples? Is the improvement worth the maintenance cost of hosting and updating a custom model variant? Only then should you consider fine-tuning.

In the Jakarta team's experience, Step 1 handled 90% of the use cases. Step 2 handled the remaining edge cases. Step 3 was never invoked.

What your team should do tomorrow

If your default response to any AI task is "let's fine-tune something," you are spending time on the wrong part of the stack.

Here is what the Jakarta team's approach looks like in practice:

Write a prompt that describes your task. Not a vague prompt. A specific prompt with clear instructions, output format specifications, and examples. Test it on 50–100 real examples. Measure the failure rate.
For the failures, ask: can we fix this with a better prompt? Often, the answer is yes.
For the remaining failures, ask: can we fix this with a retrieval? Add a few examples to the context. No model changes required.
Only then, if the gap is still unacceptable, consider fine-tuning. And even then, start with LoRA or parameter-efficient methods — not full fine-tuning. Full fine-tuning of a 7-billion parameter model requires 100–120 GB of VRAM — translating to significant GPU costs for even a single training run.

The Jakarta team did not fine-tune because they did not need to.

They asked: "What is the simplest thing that could work?" The answer was prompt engineering, retrieval, and rules. Nine person-days. Forty-seven languages. Production from day one.

The Western default is: "What is the most thorough thing we could do?" The answer is fine-tuning. Months. Thousands of dollars. And often, no better results.

The Jakarta team's approach is not a compromise. It is a better engineering decision. It is faster, cheaper, more maintainable, and more flexible. And it works.

If your team is still fine-tuning as the default, you are not optimizing. You are following a habit that stopped making sense the day multilingual models became capable of zero-shot cross-lingual transfer.

Data sources: Jakarta marketplace production deployment case study (internal industry documentation, 2025–2026); arXiv 2402.12279; arXiv 2605.24556; arXiv 2605.28710; and other industry reports and analyses (2025–2026).