CoSER Models

CoSER models are state-of-the-art models for role-playing language agents (RPLAs), built upon LLaMA-3.1 base models (8B and 70B). These models are trained on the CoSER dataset, which contains authentic multi-turn, multi-character dialogues extracted from 771 renowned novels.

CoSER models exhibit excellent role-playing capabilities. They can produce highly human-like responses across a wide range of personas, including both established fictional characters or original characters. They excel at capturing nuanced personalities, maintaining consistent character traits, and adapting to diverse role-playing scenarios. Results of extensive experiments demonstrate that CoSER models exhibit state-of-the-art role-playing performance across multiple benchmarks.

Model Variants

  • CoSER-8B: Fine-tuned from LLaMA-3.1-8B
  • CoSER-70B: Fine-tuned from LLaMA-3.1-70B

Training Data

The models are trained on the CoSER dataset, which differs from existing RPLA datasets in two fundamental ways:

  1. It extracts authentic multi-turn, multi-character dialogues from acclaimed literary works, maintaining high source fidelity while exhibiting greater quality and complexity.

  2. It incorporates comprehensive types of data:

    • Character profiles, dialogues, plot summaries, character experiences, and conversation backgrounds.
    • Conversations that capture characters' internal thoughts and physical actions beyond surface-level speech

Training Methodology

Our training approach is based on "given-circumstance acting" (GCA):

Given a conversation with messages M, characters C, and setting S, the actor LLM is required to sequentially portray each character c∈C to recreate the conversation. During training, for each character c, we optimize the language modeling loss on their corresponding messages.

Performance and Evaluation

We evaluate our models via GCA Evaluation. It is a comprehensive approach that includes multi-agent simulation and penalty-based LLM assessment:

  1. We generate conversations via multi-agent simulation, where the actor LLM portrays each character within a given setting, coordinated by a next-actor-prediction model to manage turn-taking.

  2. We assess the generated conversations using penalty-based LLM judges, which are provided detailed rubrics and original conversations for reference.

Performance on Given-Circumstance Acting

CoSER models outperform existing open-source LLMs on multiple RPLA benchmarks and are comparable to state-of-the-art closed-source models like GPT-4o.

Model Storyline Consistency Anthropomorphism Character Fidelity Storyline Quality Average Score BLEU ROUGE-L
Close-source Models
Abab7-preview 56.81 44.23 43.83 74.83 54.92 4.96 11.50
Doubao-pro 60.95 49.72 47.02 79.28 59.24 6.38 12.95
Step-1-Flash 57.75 48.12 44.48 75.93 56.57 5.95 12.71
Step-2 61.43 49.06 47.33 77.96 58.94 5.75 12.50
GPT-3.5 57.22 43.30 42.29 73.91 54.18 4.58 11.80
GPT-4o 61.59 48.93 48.95 80.33 59.95 5.90 12.11
GPT-4o Mini 60.09 48.21 44.88 78.55 57.93 3.90 10.81
Gemini Pro 59.11 52.41 47.83 77.59 59.24 5.39 11.65
Claude-3-Haiku 58.18 44.66 41.88 74.14 54.71 4.80 12.02
Claude-3.5-Sonnet 57.45 48.50 45.69 77.23 57.22 5.17 11.45
Open-source Models
Mistral-7B 59.90 40.00 44.75 61.93 51.64 2.71 9.28
Qwen-2-7B 51.96 35.48 31.51 63.18 45.53 4.21 10.71
LLaMA-3.1-8B 54.10 45.36 40.22 72.29 52.99 4.59 10.18
CoSER-8B 58.61 47.23 46.90 73.04 56.45 9.40 14.21
Vicuna-13B-1.5 52.75 39.12 38.04 60.43 47.58 1.67 5.59
Mixtral-8x7B 51.25 38.44 36.92 67.69 48.58 5.28 11.66
Qwen-2-72B 57.75 47.28 46.62 76.60 57.06 5.38 11.85
LLaMA-3.1-70B 57.46 45.95 43.72 74.84 55.49 4.82 10.98
Higgs-Llama-3-70B 57.10 43.82 42.41 75.62 54.74 3.99 10.92
CoSER-70B 58.66 53.33 48.75 75.49 59.06 10.10 14.78
DeepSeek-V3 56.40 47.87 44.02 76.66 56.24 4.54 11.02

Note: Bold values indicate best performance across all models.

Performance on Existing RPLA Benchmarks

Model InCharacter Dim InCharacter Full Life Choice CroSS MR
LLaMA-3.1-8B 64.97 15.62 61.10 30.15
CoSER-8B 75.80 21.88 69.54 44.94
CoSER-8B trained w/o I.T. 70.70 15.62 59.92 43.14
LLaMA-3.1-70B 72.16 31.25 86.48 61.30
Higgs-Llama-3-70B 74.52 28.12 74.03 60.12
CoSER-70B 75.80 34.38 93.47 64.49
CoSER-70B trained w/o I.T. 73.12 32.14 93.18 63.14
Qwen-2-72B 74.52 31.25 81.14 62.57
GPT-3.5 71.20 21.88 78.07 30.09
GPT-4o 76.54 32.62 75.96 64.49
Claude-3.5-Sonnet 72.61 21.88 86.07 30.59

Note: Bold values indicate best performance. I.T. denotes inner thoughts. For InCharacter, we report accuracy for individual (Dim) and full (Full) dimensions on BFI.

Ethical Considerations

We have conducted safety checks on the training dataset and removed potentially problematic content. However, users should be aware that:

  • The models may still generate content that reflects biases present in the literary works they were trained on.
  • Role-playing as certain characters might involve generating content that includes negative traits or behaviors.
  • Users should implement appropriate safeguards when deploying these models in applications.

Citation

If you use CoSER models in your research, please cite our paper:

@misc{wang2025cosercoordinatingllmbasedpersona,
      title={CoSER: Coordinating LLM-Based Persona Simulation of Established Roles}, 
      author={Xintao Wang and Heng Wang and Yifei Zhang and Xinfeng Yuan and Rui Xu and Jen-tse Huang and Siyu Yuan and Haoran Guo and Jiangjie Chen and Wei Wang and Yanghua Xiao and Shuchang Zhou},
      year={2025},
      eprint={2502.09082},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.09082}, 
}
Downloads last month
35
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Dataset used to train Neph0s/CoSER-Llama-3.1-8B