Zhiyin

Exploring the Frontier of Chinese LLM Writing

Benchmark Overview GitHub Hugging Face

Zhiyin is an LLM-as-a-judge benchmark for Chinese writing evaluation, featuring 280 test cases across 18 diverse writing tasks in this V1 release.

Our method relies on pairwise comparison. A powerful language model (O3) acts as the judge, scoring a model's response relative to a fixed baseline (GPT-4.1), which is anchored at a score of 5.

Scoring System

The judge assigns the model's response an integer score from 0 to 10, where:

  • A score > 5 indicates the response is superior to the baseline.
  • A score = 5 indicates the response is on par with the baseline.
  • A score < 5 indicates the response is inferior to the baseline.

Evaluation Dimensions

To ensure a comprehensive analysis, the final score is informed by a multi-dimensional assessment. The judge evaluates the response across six key criteria:

  1. Comprehension & Relevance: How well the response understands the prompt's intent and stays on topic.
  2. Structure & Coherence: How clear, logical, and well-organized the writing is.
  3. Prose & Style: The quality of the language, grammar, and adherence to the requested tone.
  4. Creativity & Originality: The novelty of the ideas and the uniqueness of the perspective.
  5. Depth & Insight: The level of detail, analysis, and substance provided.
  6. Helpfulness: How effectively the response fulfills the user's overall goal.

All Writing Tasks

Complicated Writing Tasks

Citation

If you use these results, please cite our paper:
"Zhiyin: Exploring the Frontier of Chinese LLM Writing, 2025. https://github.com/zake7749/Chinese-Writing-Bench"

Loading benchmark data...