We Added Three AI Models to Our Translation QA Layer. H...

TRANSLIA

Order Translation

TRANSLIA

Order Translation

We Added Three AI Models to Our Translation QA Layer. Here's What We Actually Found.

This is about quality review, not AI translation — and that distinction matters.

I want to be upfront about what this piece is and isn't about.

It's not about using AI to translate. It's about what happens when you add AI into the quality review stage of a professional translation workflow — after human translators have done their work, and after the TMS has run its built-in QA checks.

That distinction matters, because the problems each layer is designed to catch are fundamentally different.

What Standard TMS QA Already Handles Well

Modern translation management systems have robust built-in quality assurance for specific error categories: number and date format inconsistencies, missing translations, punctuation mismatches, and terminology flags. When a glossary term appears in the source, the system highlights it in the editor and suggests the approved target-language equivalent.

For formatting, number consistency, and approved terminology — TMS-native QA handles this reliably.

So what's the gap?

The layer that standard QA tools don't cover well: semantic quality judgment.

Is this sentence grammatically correct but awkward in the target language?
Does the terminology fit in this specific context, not just in isolation?
Is the tone and register of the source preserved?
Is the writing style consistent across segments worked on at different times or by different linguists?

These are judgment calls — and they're the errors that most directly affect how end readers experience the content. They're also what human reviewers most often miss when fatigued or working across multiple language pairs.

What We Built and Why

After projects complete the standard TMS workflow, we export the bilingual file and run it through a second-pass quality review layer:

Gemini, Kimi, and DeepSeek receive the same file and audit it independently, in parallel, against the same evaluation criteria. No model sees what the others flag.

A fourth model — which we call the Synthesis AI — aggregates the three reports and arbitrates. Its job: confirm issues where two or more models agree, and dismiss flags that represent stylistic preference rather than objective error. Only consensus-confirmed issues make it through to the final report.

The report goes to the project manager and, where relevant, back to the reviewing linguist for final sign-off.

The human role in this process is arbitration and decision — not elimination.

The Finding That Changed How We Read the Reports

After several months, one pattern emerged that we hadn't anticipated.

When all three models flag the same segment → nearly always a real problem worth addressing.

When the models disagree — one flags an issue, another ignores it, the third flags something different entirely — this turned out to be the most useful signal in the whole report.

Those disagreement zones consistently map to the grey areas in the translation: segments that are technically defensible but linguistically uncertain. No clear right answer. Exactly where an experienced reviewer's judgment adds the most value.

The disagreements tell us where to direct human attention — more reliably than any single model's confidence score.

The Cross-Document Consistency Problem

Segment-level QA — whether human or AI — reviews each unit in relative isolation. It rarely catches the case where the same source string is translated two different ways across a 400-segment file, where each instance is individually acceptable.

Brand consistency at scale requires looking at the whole document, not just the segments. We run a separate consistency sweep across the full file after the segment-level audit — surfacing every case where the same source appears with divergent target translations.

For clients whose content reaches end users across multiple markets and channels, this layer matters more than most initially expect.

The Takeaway

Several months in, the conclusion is straightforward: AI doesn't replace human review. It makes systematic coverage achievable at a scale that human review alone cannot reliably sustain.

Individual reviewers have finite attention and language-specific expertise. Parallel AI auditing with arbitration catches the patterns that fatigue and familiarity cause humans to miss. But the final call is always human — because AI suggestions require judgment to evaluate, and that's not a workaround. It's the design.

If you're working through multilingual content quality at scale — marketing materials, product UI, technical documentation, regulatory content — I'm happy to compare notes.

—
Translia — ISO 17100 & ISO 18587 certified language services, 100+ languages.
translia.com | hi@translia.com