The AI voice you talk to was trained by humans. That part is getting harder, not easier.

In the last 18 months, AI speech systems have crossed thresholds that surprised even the people building them.

Amazon overhauled Alexa’s automatic speech recognition. The system now runs on a large text-to-speech model trained on thousands of hours of multispeaker, multilingual, multiaccent audio. It can switch between languages mid-sentence. It picks up emotional prosody from the speaker. It laughs when the speaker laughs.

OpenAI launched ChatGPT Translate as a standalone product. Google released TranslateGemma, a family of open-weight translation-optimized models. RWS partnered with Cohere to build Language Weaver Pro. SAP unveiled an AI-driven localization strategy that integrates translation, risk-based language planning, and process simplification across the enterprise stack.

Each of these is a real capability shift. AI speech and translation have moved from research curiosities to systems you can deploy in production.

But underneath every one of these systems is a story that does not make the press release.

The hidden workforce behind every multilingual AI

The models are not trained from nothing. They are trained on speech. Real speech. Recorded by real people, in real languages, in real environments, doing real conversations.

And here is what becomes obvious once you start working in this part of the industry: the closer AI gets to the edge of human language, the harder and more specific the data requirements get.

Six years ago, training a basic English speech recognition system required hours of clean recordings in studio conditions. Today, training a system that can handle a customer service call in Hong Kong — where a speaker might switch between Cantonese and English three times in a single sentence, in an office environment with background noise, with regional pronunciation that differs from mainland Mandarin and from standardized Cantonese alike — requires a fundamentally different kind of data.

That data does not exist on the open web. It cannot be scraped from YouTube. It cannot be synthesized reliably, even with the latest TTS models, because synthetic data trained on synthetic data leads to model collapse on the long tail.

It has to be recorded. By people. Who actually talk that way. In the right environments. To strict technical specifications. With clear consent and provenance.

Why code-switching is the current frontier

In speech AI research right now, one of the most active areas is “code-switching” — the phenomenon where bilingual speakers alternate between two or more languages within a single utterance.

It is everywhere in the real world. A Hong Kong office worker explaining a project switches between Cantonese and English several times per sentence. A Singaporean ordering food moves between English, Malay, Mandarin, and Hokkien depending on who is at the table. A Spanish-English bilingual professional in Miami flips between languages depending on the emotional register of what they are saying.

For decades, automatic speech recognition systems handled code-switching badly or not at all. The systems were trained on monolingual data, and code-switched utterances broke them.

That is changing in 2026. Researchers at NTU Singapore are publishing on TTS-augmented code-switching ASR. Hugging Face released FineTranslations, a trillion-token multilingual parallel dataset. Microsoft’s LINGUA program is funding 11 projects to build datasets for European low-resource languages. The global research community is converging on the conclusion that the next jump in AI speech capability runs through code-switching and low-resource language data.

But for any of this work to translate from research papers into shipping products like Alexa, Google Assistant, or enterprise call center automation, someone has to record the actual speech. Native speakers, bilingually fluent, doing realistic dual-role conversations. In specific dialects. In specific environments. To specific quality standards.

This is one of the kinds of work we do.

What it actually looks like on the ground

Recording speech data for AI training is not what most people imagine. It is not a casual phone recording. It is a tightly specified production process.

A typical batch we deliver involves dozens of scripts written for dual-speaker conversations — one person in the role of a customer, one in the role of a service agent — with specific instructions for sampling rate, audio channels, recording environment, speaker gender balance, and turn structure. The speakers must be native in the target dialect. They must be fluent enough in English to switch naturally between the two, without sounding like they are reading. The environment must produce clean audio with appropriate background characteristics.

We screen speakers. We orient them on the script structure and the conversational style required. We supervise the recording sessions to ensure technical specifications are met. We handle the metadata — speaker IDs, batch numbers, file naming conventions, channel separation — that downstream pipelines depend on. We deliver against deadlines that sit inside larger AI training timelines where any delay cascades.

None of this is glamorous. All of it requires judgment that AI cannot supply.

The AI being trained downstream may one day handle Hong Kong customer service calls fluently. But for it to get there, real humans had to model that fluency first.

The implications for brands going global

There are two takeaways here, one specific to the language industry and one broader.

For the language industry, the takeaway is that the so-called “AI replacing translators” narrative misses where the real work is moving. Yes, AI is taking over routine translation. At the same time, AI is creating massive demand for highly specific human language work — recording, annotation, evaluation, judgment — that did not exist a decade ago. The frontier of language services is moving from producing translations to producing the data that makes AI translations possible.

For brands going global, the takeaway is more subtle. The AI translation tools you can subscribe to today are powerful because they were trained on someone’s data. Whose data they were trained on shapes what they handle well and what they handle poorly. A general-purpose AI translation engine trained primarily on English-Spanish parallel text from European Union documents will perform very differently when faced with a customer service conversation in Hong Kong than when faced with a press release in Madrid.

This means that for any market your brand actually cares about — not just the top three or four languages your AI vendor advertises — you should be asking what data the system was trained on, how that data was collected, and what kinds of conversations are still handled badly. The answer often surprises buyers.

It also means that the most defensible localization partnerships in the next phase of the industry will be the ones that combine AI workflow expertise with deep access to the human language workforce that AI still depends on. Not because human linguists translate better than AI in most cases. But because the people who can record, annotate, judge, and refine specific language varieties are the same people who can ensure that AI output for those varieties is actually trustworthy when it ships.

Where Translia sits in this picture

We support both sides of this story.

On one side, we run AI-driven translation and localization workflows for global brands and BPO partners. Our work involves AI throughout — for initial translation, terminology enforcement, cross-file consistency, first-pass quality checks. The human team focuses on judgment, brand alignment, cultural register, and decisions AI cannot defensibly make alone. This is the orchestration layer we wrote about in our last piece.

On the other side, we provide language data services to companies building the next generation of AI speech and translation systems. We work with native speakers in specific dialects, manage recording and annotation production, and deliver against the strict specifications that downstream AI training pipelines require. This is the work that makes the future AI possible.

These two sides connect. The same regional language expertise that lets us deliver clean, well-specified Cantonese-English code-switching data for AI training is what lets us deliver multilingual content with cultural register and consistency for global brand operations. The same operational discipline that handles AI workflow orchestration is what handles batch handoffs to AI training pipelines.

What we sell, on both sides, is the layer between AI and the messy reality of human language. AI generates. AI translates. AI listens. But the data that makes AI work, and the judgment that makes AI outputs trustworthy, still come from people. We organize that part.

The next phase

The narrative that AI will eliminate human language work is wrong. The narrative that AI will leave human language work untouched is also wrong.

What is actually happening is more interesting. Routine translation is being automated. Routine quality checks are being automated. Routine consistency enforcement is being automated. At the same time, the demand for specialized human language work — code-switching speech recording, low-resource language data, cultural register judgment, brand voice alignment, AI output evaluation — is rising sharply.

The companies that thrive in the next phase of localization will be the ones that can operate across both sides of this divide. Production fluency with AI tools, and operational access to the human language workforce that AI still requires.

That is the position we have been building toward.

This is exactly the work behind our AI language data services — speech collection, code-switching audio, transcription validation, and evaluation, built by native speakers at production scale. Or explore how we structure multilingual workflows and the services that support brands operating across languages.