🌏 閱讀中文版本
⚡ 3-Second Summary
- LLM training data comes from Chinese-speaking regions worldwide, naturally mixing terminology
- Taiwan readers stumble over Mainland Chinese terms like “用户” and “调用”
zhtw fixconverts to Taiwan-style terminology with one command- Supports CLI, Python integration, LangChain, and batch processing
Your README Just Confused Your Colleague
You used AI to generate a project README.
The content is complete—installation steps, API docs, code examples. You’re satisfied and ready to commit.
Then a Taiwan colleague Slacks you: “Hey, what does ‘用户需要先调用这个接口’ mean? Can you change it to Taiwan terminology?”
You open the file:
## 安装说明
用户需要先安装依赖:
npm install
然后调用 API 获取数据...
And it’s not just this file. The 400 files you generated with AI all use this mixed terminology.
Fix them manually? You’ll be at it forever.
Why Do LLMs Output Mixed Terminology?
This isn’t an AI bug—it’s a feature of training data.
Key Insight: LLM Chinese training data comes from everywhere—Mainland China, Taiwan, Hong Kong, Singapore. The model learns “Chinese” but not “which region’s Chinese.”
Even when your prompt says “Please respond in Traditional Chinese,” the model understands character form (Traditional characters), not terminology conventions (Taiwan expressions).
So you get:
- Traditional characters ✅
- But terminology like “用户” “调用” “软件” ❌
| Your prompt | AI understands | You expected |
|---|---|---|
| Use Traditional Chinese | Use Traditional character forms | Use Taiwan terminology |
| Use Taiwan Traditional | Might help, but inconsistent | Use Taiwan terminology |
Different LLMs behave differently. Based on my observations:
| LLM | Mixed terminology level | Notes |
|---|---|---|
| ChatGPT | Medium | Sometimes auto-adjusts |
| Claude | Lower | Still occurs |
| Gemini | Higher | More Mainland terms |
Key Insight: Rather than expecting AI to perfectly output Taiwan terminology, accept that “post-processing” is a necessary step. Just like code needs linting, AI output needs localization.
Solution: One Command for Localization
zhtw is a CLI tool I developed specifically for this problem.
pip install zhtw
# Convert single file
zhtw fix README.md
# Convert entire directory
zhtw fix docs/
# Read from stdin (for piping)
echo "用户需要调用接口" | zhtw fix -
# Output: 使用者需要呼叫介面
Conversion examples:
| Original | Converted |
|---|---|
| 用户 | 使用者 |
| 调用 | 呼叫 |
| 软件 | 軟體 |
| 服务器 | 伺服器 |
| 数据库 | 資料庫 |
| 代码 | 程式碼 |
| 视频 | 影片 |
| 信息 | 資訊 |
This isn’t just Simplified-to-Traditional conversion—it’s terminology convention conversion.
Integrate Into Your Workflow
Method 1: CLI One-time Processing
The simplest approach, suitable for one-time processing:
# Process single file
zhtw fix ai-generated-doc.md
# Process entire folder
zhtw fix ./docs/
# Preview changes (don't modify files)
zhtw check ./docs/
Method 2: Python Integration
In your Python program, post-process LLM output:
import subprocess
def localize_to_taiwan(text: str) -> str:
"""Convert text to Taiwan terminology"""
result = subprocess.run(
['zhtw', 'fix', '-'],
input=text,
capture_output=True,
text=True
)
return result.stdout if result.returncode == 0 else text
With OpenAI
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": "解釋什麼是 API Gateway"}
]
)
# LLM output → Taiwan localization
content = response.choices[0].message.content
localized = localize_to_taiwan(content)
print(localized)
With Anthropic
from anthropic import Anthropic
client = Anthropic()
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{"role": "user", "content": "解釋什麼是 API Gateway"}
]
)
# LLM output → Taiwan localization
content = message.content[0].text
localized = localize_to_taiwan(content)
print(localized)
Method 3: LangChain OutputParser
If you use LangChain, wrap it as an OutputParser:
from langchain.schema import BaseOutputParser
import subprocess
class TaiwanLocalizer(BaseOutputParser):
"""Convert LLM output to Taiwan terminology"""
def parse(self, text: str) -> str:
result = subprocess.run(
['zhtw', 'fix', '-'],
input=text,
capture_output=True,
text=True
)
return result.stdout.strip() if result.returncode == 0 else text
@property
def _type(self) -> str:
return "taiwan_localizer"
# Usage
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
llm = ChatAnthropic(model="claude-sonnet-4-20250514")
localizer = TaiwanLocalizer()
prompt = ChatPromptTemplate.from_template("解釋 {topic}")
chain = prompt | llm | localizer
result = chain.invoke({"topic": "微服務架構"})
print(result)
Method 4: Batch Processing Script
Process large numbers of AI-generated files:
# Process all Markdown files
find ./generated-docs -name "*.md" -exec zhtw fix {} \;
# Process all Python docstrings (extract then process)
zhtw fix ./src/**/*.py
# With git, only process changed files
git diff --name-only | grep '\.md$' | xargs zhtw fix
Real-world Comparison
Example 1: README.md
Before:
## 安装说明
用户需要先安装依赖,然后调用初始化函数。
### 配置数据库
在 `config.js` 中设置数据库连接信息:
After:
## 安裝說明
使用者需要先安裝依賴,然後呼叫初始化函數。
### 設定資料庫
在 `config.js` 中設定資料庫連線資訊:
Example 2: API Documentation
Before:
该接口返回用户信息。调用时需要传递 token 参数。
如果请求失败,服务器会返回错误代码。
After:
該介面回傳使用者資訊。呼叫時需要傳遞 token 參數。
如果請求失敗,伺服器會回傳錯誤代碼。
Example 3: Code Comments
Before:
def get_user_data(user_id):
"""获取用户数据
调用此函数前,请确保数据库连接正常。
"""
pass
After:
def get_user_data(user_id):
"""取得使用者資料
呼叫此函數前,請確保資料庫連線正常。
"""
pass
Advanced Configuration
Check Mode (No File Modification)
Want to see what would be converted first?
zhtw check ./docs/
Output shows:
- Which files have terms needing conversion
- Which specific terms
- Doesn’t modify any files
Conservative Mode
Worried about false conversions? Use conservative mode for high-confidence terms only:
zhtw fix --conservative ./docs/
When Should You Use This Tool?
| Scenario | Recommendation |
|---|---|
| AI-generated README, documentation | ✅ Recommended |
| AI-generated code comments | ✅ Recommended |
| AI conversation content to share with Taiwan readers | ✅ Recommended |
| Translating Mainland tech articles | ✅ Recommended |
| Content originally written by Taiwan authors | ❌ Not needed |
| Target audience is Mainland users | ❌ Not needed |
Key Insight: This isn’t a “Simplified vs Traditional” issue—it’s a localization need to “match content to target audience conventions.” Just like English content considers US English vs UK English.
Conclusion
LLM training data comes from Chinese-speaking regions worldwide—mixed terminology in output is normal, not a bug.
Rather than manually fixing every time, make localization part of your workflow:
# Install
pip install zhtw
# Use
zhtw fix your-ai-generated-file.md
Three seconds to make your AI output speak Taiwan Chinese.
Sources
zhtw Tool
- zhtw GitHub Repository | Archive
CLI tool that converts Simplified Chinese terminology to Taiwan Traditional Chinese. Supports CLI, Python API, and batch processing.
LLM Chinese Output Research
- Anthropic Claude Documentation | Archive
Claude’s multilingual support documentation, including Chinese output characteristics.