Zhiyin

Benchmark Overview GitHub • Hugging Face

Zhiyin is an LLM-as-a-judge benchmark for Chinese writing evaluation, featuring 280 test cases across 18 diverse writing tasks in this V1 release.

Our method relies on pairwise comparison. A powerful language model (O3) acts as the judge, scoring a model's response relative to a fixed baseline (GPT-4.1), which is anchored at a score of 5.

Scoring System

The judge assigns the model's response an integer score from 0 to 10, where:

A score > 5 indicates the response is superior to the baseline.
A score = 5 indicates the response is on par with the baseline.
A score < 5 indicates the response is inferior to the baseline.

Evaluation Dimensions

To ensure a comprehensive analysis, the final score is informed by a multi-dimensional assessment. The judge evaluates the response across six key criteria:

Comprehension & Relevance: How well the response understands the prompt's intent and stays on topic.
Structure & Coherence: How clear, logical, and well-organized the writing is.
Prose & Style: The quality of the language, grammar, and adherence to the requested tone.
Creativity & Originality: The novelty of the ideas and the uniqueness of the perspective.
Depth & Insight: The level of detail, analysis, and substance provided.
Helpfulness: How effectively the response fulfills the user's overall goal.

All Writing Tasks

Complicated Writing Tasks

Citation

If you use these results, please cite our paper:
"Zhiyin: Exploring the Frontier of Chinese LLM Writing, 2025. https://github.com/zake7749/Chinese-Writing-Bench"