When AI Translates Chinese Prompts into Korean: The Surprising Role of Code in Language Models

From Moocchen, the free encyclopedia of technology

Introduction: An Unexpected Language Switch

Imagine typing a query in Chinese to your AI coding assistant, only to receive a response in Korean. This isn't a glitch — it's a fascinating glimpse into how modern language models interpret and organize linguistic information. The phenomenon arises from the way these models embed words and phrases from different languages into a shared semantic space, where code vocabulary can unexpectedly reshape the linguistic landscape. In this article, we’ll dive deep into the mechanics of embedding spaces, explore why code snippets can nudge a model toward a different language, and discuss what this tells us about the future of multilingual AI.

When AI Translates Chinese Prompts into Korean: The Surprising Role of Code in Language Models
Source: towardsdatascience.com

How Language Models “Understand” Words

From Tokens to Vectors

At their core, large language models (LLMs) like GPT-4 or Claude don’t process words directly. Instead, they break text into tokens (subword units) and convert each token into a high-dimensional vector — a numeric representation that captures meaning, syntax, and context. These vectors live in what’s called an embedding space, where similar concepts cluster together. For example, the vectors for “dog” and “cat” will be close to each other in English, while “computer” and “keyboard” will form another cluster.

Multilingual Embedding Spaces

Modern LLMs are multilingual: they’re trained on text in dozens of languages, from Chinese and English to Korean and Arabic. Instead of keeping each language isolated, the model learns to map similar meanings across languages into nearby regions of the same embedding space. An apple in English (apple) and an apple in Chinese (苹果) should have vectors that are close together. This cross-lingual alignment is what enables the model to translate and transfer knowledge between languages.

The Code Catalyst: How Programming Languages Disrupt Natural Language

Code as a Universal Pidgin

Here’s where things get interesting. Many AI coding assistants are fine-tuned on massive repositories of source code, often written in English-like syntax (Python, JavaScript, etc.). Code contains a mixture of English keywords (if, for, class), variable names in various languages, and comments that can be in any natural language. When a user types a Chinese prompt that includes code or code‑like tokens, the model’s embedding space can be pulled toward the region where code-related vectors reside — and that region may be heavily influenced by Korean.

Why Korean, Specifically?

The answer lies in the training data. Korean‑language code repositories (e.g., from Korean developers on GitHub) often contain a high proportion of Korean identifiers and comments alongside code syntax. Because code vocabulary (like class, def, return) is shared across many languages, the model learns to associate that code region with Korean linguistic patterns. When the Chinese prompt contains similar code tokens, the model “thinks” it’s in that bilingual coding context and prefers Korean for the response. It’s a kind of semantic magnet — the embedded code vocabulary skews the language preference.

A Closer Look at the Embedding-Space Investigation

Case Study: Chinese Prompt with Mixed Tokens

Consider a user who types: “请写一个Python函数来计算斐波那契数列” (Please write a Python function to compute the Fibonacci sequence). This prompt mixes Chinese natural language with the code token “Python”. The model’s encoder projects “Python” to the English/kCode region. In that region, the Korean embedding cluster is closer than the Chinese cluster due to the training data imbalance. As a result, the model generates a Korean response: “피보나치 수열을 계산하는 파이썬 함수를 작성하겠습니다.”

When AI Translates Chinese Prompts into Korean: The Surprising Role of Code in Language Models
Source: towardsdatascience.com

Visualizing the Shift

Researchers have used techniques like PCA (Principal Component Analysis) to reduce embedding vectors to 2D and plot them. In such visualizations, you can see Chinese and Korean clusters that are normally separated. But when a code token like “Python” appears, the Chinese prompt’s vector moves toward the English/kCode region, which happens to be adjacent to the Korean cluster. The model then selects Korean as the output language, even though the user never wrote in Korean.

Implications for Multilingual AI Development

Language Detection and Control

This behavior highlights a weakness in current LLMs: they don’t have a robust internal “language switch.” Instead, they infer the output language from the embedding context. Developers are now designing language detection heads or explicit language tokens to give users more control. For example, prepending “Korean:” or “Chinese:” to a prompt can help nudge the model back on track.

Data Imbalance in Code Repositories

The Korean bias in code‑adjacent regions stems from the skew in open‑source code: Korea has a vibrant programming community, while Chinese‑language code repositories are often less numerous or less code-heavy. To fix this, training datasets need better representation of all languages at every level — not just natural language but also code comments and variable names in Chinese, Japanese, and beyond.

Practical Tips for Users

  • Explicitly state your desired output language at the beginning of the prompt (e.g., “Please reply in Chinese.”).
  • Avoid mixing natural language with code tokens unless necessary. Instead, describe the programming task in plain Chinese.
  • Use language‑specific prompts when working with non‑English codebases: include a short phrase in your target language to anchor the model.

Conclusion: A Window into the Model’s Mind

The case of a Chinese prompt yielding a Korean response is not a bug — it’s a feature of how contextual embeddings work. Code vocabulary acts as a bridge between language clusters, and the model’s language output is a probabilistic choice based on the nearest embedding neighbourhood. As AI assistants become more multilingual, understanding these embedding dynamics will be crucial for designing systems that understand, and respect, a user’s language choice. Next time your coding assistant speaks a different tongue, you’ll know it’s just following the vectors.