AI Outputs ‘用户’ ‘调用’? One Command to Convert to Taiwan Chinese

🌏 閱讀中文版本


⚡ 3-Second Summary

  • LLM training data comes from Chinese-speaking regions worldwide, naturally mixing terminology
  • Taiwan readers stumble over Mainland Chinese terms like “用户” and “调用”
  • zhtw fix converts to Taiwan-style terminology with one command
  • Supports CLI, Python integration, LangChain, and batch processing

Your README Just Confused Your Colleague

You used AI to generate a project README.

The content is complete—installation steps, API docs, code examples. You’re satisfied and ready to commit.

Then a Taiwan colleague Slacks you: “Hey, what does ‘用户需要先调用这个接口’ mean? Can you change it to Taiwan terminology?”

You open the file:

## 安装说明

用户需要先安装依赖:

npm install

然后调用 API 获取数据...

And it’s not just this file. The 400 files you generated with AI all use this mixed terminology.

Fix them manually? You’ll be at it forever.


Why Do LLMs Output Mixed Terminology?

This isn’t an AI bug—it’s a feature of training data.

Key Insight: LLM Chinese training data comes from everywhere—Mainland China, Taiwan, Hong Kong, Singapore. The model learns “Chinese” but not “which region’s Chinese.”

Even when your prompt says “Please respond in Traditional Chinese,” the model understands character form (Traditional characters), not terminology conventions (Taiwan expressions).

So you get:

  • Traditional characters ✅
  • But terminology like “用户” “调用” “软件” ❌
Your prompt AI understands You expected
Use Traditional Chinese Use Traditional character forms Use Taiwan terminology
Use Taiwan Traditional Might help, but inconsistent Use Taiwan terminology

Different LLMs behave differently. Based on my observations:

LLM Mixed terminology level Notes
ChatGPT Medium Sometimes auto-adjusts
Claude Lower Still occurs
Gemini Higher More Mainland terms

Key Insight: Rather than expecting AI to perfectly output Taiwan terminology, accept that “post-processing” is a necessary step. Just like code needs linting, AI output needs localization.


Solution: One Command for Localization

zhtw is a CLI tool I developed specifically for this problem.

pip install zhtw

# Convert single file
zhtw fix README.md

# Convert entire directory
zhtw fix docs/

# Read from stdin (for piping)
echo "用户需要调用接口" | zhtw fix -
# Output: 使用者需要呼叫介面

Conversion examples:

Original Converted
用户 使用者
调用 呼叫
软件 軟體
服务器 伺服器
数据库 資料庫
代码 程式碼
视频 影片
信息 資訊

This isn’t just Simplified-to-Traditional conversion—it’s terminology convention conversion.


Integrate Into Your Workflow

Method 1: CLI One-time Processing

The simplest approach, suitable for one-time processing:

# Process single file
zhtw fix ai-generated-doc.md

# Process entire folder
zhtw fix ./docs/

# Preview changes (don't modify files)
zhtw check ./docs/

Method 2: Python Integration

In your Python program, post-process LLM output:

import subprocess

def localize_to_taiwan(text: str) -> str:
    """Convert text to Taiwan terminology"""
    result = subprocess.run(
        ['zhtw', 'fix', '-'],
        input=text,
        capture_output=True,
        text=True
    )
    return result.stdout if result.returncode == 0 else text

With OpenAI

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "解釋什麼是 API Gateway"}
    ]
)

# LLM output → Taiwan localization
content = response.choices[0].message.content
localized = localize_to_taiwan(content)
print(localized)

With Anthropic

from anthropic import Anthropic

client = Anthropic()
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "解釋什麼是 API Gateway"}
    ]
)

# LLM output → Taiwan localization
content = message.content[0].text
localized = localize_to_taiwan(content)
print(localized)

Method 3: LangChain OutputParser

If you use LangChain, wrap it as an OutputParser:

from langchain.schema import BaseOutputParser
import subprocess

class TaiwanLocalizer(BaseOutputParser):
    """Convert LLM output to Taiwan terminology"""

    def parse(self, text: str) -> str:
        result = subprocess.run(
            ['zhtw', 'fix', '-'],
            input=text,
            capture_output=True,
            text=True
        )
        return result.stdout.strip() if result.returncode == 0 else text

    @property
    def _type(self) -> str:
        return "taiwan_localizer"

# Usage
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate

llm = ChatAnthropic(model="claude-sonnet-4-20250514")
localizer = TaiwanLocalizer()

prompt = ChatPromptTemplate.from_template("解釋 {topic}")
chain = prompt | llm | localizer

result = chain.invoke({"topic": "微服務架構"})
print(result)

Method 4: Batch Processing Script

Process large numbers of AI-generated files:

# Process all Markdown files
find ./generated-docs -name "*.md" -exec zhtw fix {} \;

# Process all Python docstrings (extract then process)
zhtw fix ./src/**/*.py

# With git, only process changed files
git diff --name-only | grep '\.md$' | xargs zhtw fix

Real-world Comparison

Example 1: README.md

Before:

## 安装说明

用户需要先安装依赖,然后调用初始化函数。

### 配置数据库

在 `config.js` 中设置数据库连接信息:

After:

## 安裝說明

使用者需要先安裝依賴,然後呼叫初始化函數。

### 設定資料庫

在 `config.js` 中設定資料庫連線資訊:

Example 2: API Documentation

Before:

该接口返回用户信息。调用时需要传递 token 参数。
如果请求失败,服务器会返回错误代码。

After:

該介面回傳使用者資訊。呼叫時需要傳遞 token 參數。
如果請求失敗,伺服器會回傳錯誤代碼。

Example 3: Code Comments

Before:

def get_user_data(user_id):
    """获取用户数据

    调用此函数前,请确保数据库连接正常。
    """
    pass

After:

def get_user_data(user_id):
    """取得使用者資料

    呼叫此函數前,請確保資料庫連線正常。
    """
    pass

Advanced Configuration

Check Mode (No File Modification)

Want to see what would be converted first?

zhtw check ./docs/

Output shows:

  • Which files have terms needing conversion
  • Which specific terms
  • Doesn’t modify any files

Conservative Mode

Worried about false conversions? Use conservative mode for high-confidence terms only:

zhtw fix --conservative ./docs/

When Should You Use This Tool?

Scenario Recommendation
AI-generated README, documentation ✅ Recommended
AI-generated code comments ✅ Recommended
AI conversation content to share with Taiwan readers ✅ Recommended
Translating Mainland tech articles ✅ Recommended
Content originally written by Taiwan authors ❌ Not needed
Target audience is Mainland users ❌ Not needed

Key Insight: This isn’t a “Simplified vs Traditional” issue—it’s a localization need to “match content to target audience conventions.” Just like English content considers US English vs UK English.


Conclusion

LLM training data comes from Chinese-speaking regions worldwide—mixed terminology in output is normal, not a bug.

Rather than manually fixing every time, make localization part of your workflow:

# Install
pip install zhtw

# Use
zhtw fix your-ai-generated-file.md

Three seconds to make your AI output speak Taiwan Chinese.


Sources

zhtw Tool

  • zhtw GitHub Repository | Archive

    CLI tool that converts Simplified Chinese terminology to Taiwan Traditional Chinese. Supports CLI, Python API, and batch processing.

LLM Chinese Output Research

Leave a Comment