Exploring the Frontier of Chinese LLM Writing
Zhiyin is an LLM-as-a-judge benchmark for Chinese writing evaluation, featuring 280 test cases across 18 diverse writing tasks in this V1 release.
Our method relies on pairwise comparison. A powerful language model (O3) acts as the judge, scoring a model's response relative to a fixed baseline (GPT-4.1), which is anchored at a score of 5.
The judge assigns the model's response an integer score from 0 to 10, where:
To ensure a comprehensive analysis, the final score is informed by a multi-dimensional assessment. The judge evaluates the response across six key criteria:
If you use these results, please cite our paper:
"Zhiyin: Exploring the Frontier of Chinese LLM Writing, 2025. https://github.com/zake7749/Chinese-Writing-Bench"
Loading benchmark data...