CoSER Models
CoSER models are state-of-the-art models for role-playing language agents (RPLAs), built upon LLaMA-3.1 base models (8B and 70B). These models are trained on the CoSER dataset, which contains authentic multi-turn, multi-character dialogues extracted from 771 renowned novels.
CoSER models exhibit excellent role-playing capabilities. They can produce highly human-like responses across a wide range of personas, including both established fictional characters or original characters. They excel at capturing nuanced personalities, maintaining consistent character traits, and adapting to diverse role-playing scenarios. Results of extensive experiments demonstrate that CoSER models exhibit state-of-the-art role-playing performance across multiple benchmarks.
Model Variants
- CoSER-8B: Fine-tuned from LLaMA-3.1-8B
- CoSER-70B: Fine-tuned from LLaMA-3.1-70B
Training Data
The models are trained on the CoSER dataset, which differs from existing RPLA datasets in two fundamental ways:
It extracts authentic multi-turn, multi-character dialogues from acclaimed literary works, maintaining high source fidelity while exhibiting greater quality and complexity.
It incorporates comprehensive types of data:
- Character profiles, dialogues, plot summaries, character experiences, and conversation backgrounds.
- Conversations that capture characters' internal thoughts and physical actions beyond surface-level speech
Training Methodology
Our training approach is based on "given-circumstance acting" (GCA):
Given a conversation with messages M, characters C, and setting S, the actor LLM is required to sequentially portray each character c∈C to recreate the conversation. During training, for each character c, we optimize the language modeling loss on their corresponding messages.
Performance and Evaluation
We evaluate our models via GCA Evaluation. It is a comprehensive approach that includes multi-agent simulation and penalty-based LLM assessment:
We generate conversations via multi-agent simulation, where the actor LLM portrays each character within a given setting, coordinated by a next-actor-prediction model to manage turn-taking.
We assess the generated conversations using penalty-based LLM judges, which are provided detailed rubrics and original conversations for reference.
Performance on Given-Circumstance Acting
CoSER models outperform existing open-source LLMs on multiple RPLA benchmarks and are comparable to state-of-the-art closed-source models like GPT-4o.
Model | Storyline Consistency | Anthropomorphism | Character Fidelity | Storyline Quality | Average Score | BLEU | ROUGE-L |
---|---|---|---|---|---|---|---|
Close-source Models | |||||||
Abab7-preview | 56.81 | 44.23 | 43.83 | 74.83 | 54.92 | 4.96 | 11.50 |
Doubao-pro | 60.95 | 49.72 | 47.02 | 79.28 | 59.24 | 6.38 | 12.95 |
Step-1-Flash | 57.75 | 48.12 | 44.48 | 75.93 | 56.57 | 5.95 | 12.71 |
Step-2 | 61.43 | 49.06 | 47.33 | 77.96 | 58.94 | 5.75 | 12.50 |
GPT-3.5 | 57.22 | 43.30 | 42.29 | 73.91 | 54.18 | 4.58 | 11.80 |
GPT-4o | 61.59 | 48.93 | 48.95 | 80.33 | 59.95 | 5.90 | 12.11 |
GPT-4o Mini | 60.09 | 48.21 | 44.88 | 78.55 | 57.93 | 3.90 | 10.81 |
Gemini Pro | 59.11 | 52.41 | 47.83 | 77.59 | 59.24 | 5.39 | 11.65 |
Claude-3-Haiku | 58.18 | 44.66 | 41.88 | 74.14 | 54.71 | 4.80 | 12.02 |
Claude-3.5-Sonnet | 57.45 | 48.50 | 45.69 | 77.23 | 57.22 | 5.17 | 11.45 |
Open-source Models | |||||||
Mistral-7B | 59.90 | 40.00 | 44.75 | 61.93 | 51.64 | 2.71 | 9.28 |
Qwen-2-7B | 51.96 | 35.48 | 31.51 | 63.18 | 45.53 | 4.21 | 10.71 |
LLaMA-3.1-8B | 54.10 | 45.36 | 40.22 | 72.29 | 52.99 | 4.59 | 10.18 |
CoSER-8B | 58.61 | 47.23 | 46.90 | 73.04 | 56.45 | 9.40 | 14.21 |
Vicuna-13B-1.5 | 52.75 | 39.12 | 38.04 | 60.43 | 47.58 | 1.67 | 5.59 |
Mixtral-8x7B | 51.25 | 38.44 | 36.92 | 67.69 | 48.58 | 5.28 | 11.66 |
Qwen-2-72B | 57.75 | 47.28 | 46.62 | 76.60 | 57.06 | 5.38 | 11.85 |
LLaMA-3.1-70B | 57.46 | 45.95 | 43.72 | 74.84 | 55.49 | 4.82 | 10.98 |
Higgs-Llama-3-70B | 57.10 | 43.82 | 42.41 | 75.62 | 54.74 | 3.99 | 10.92 |
CoSER-70B | 58.66 | 53.33 | 48.75 | 75.49 | 59.06 | 10.10 | 14.78 |
DeepSeek-V3 | 56.40 | 47.87 | 44.02 | 76.66 | 56.24 | 4.54 | 11.02 |
Note: Bold values indicate best performance across all models.
Performance on Existing RPLA Benchmarks
Model | InCharacter Dim | InCharacter Full | Life Choice | CroSS MR |
---|---|---|---|---|
LLaMA-3.1-8B | 64.97 | 15.62 | 61.10 | 30.15 |
CoSER-8B | 75.80 | 21.88 | 69.54 | 44.94 |
CoSER-8B trained w/o I.T. | 70.70 | 15.62 | 59.92 | 43.14 |
LLaMA-3.1-70B | 72.16 | 31.25 | 86.48 | 61.30 |
Higgs-Llama-3-70B | 74.52 | 28.12 | 74.03 | 60.12 |
CoSER-70B | 75.80 | 34.38 | 93.47 | 64.49 |
CoSER-70B trained w/o I.T. | 73.12 | 32.14 | 93.18 | 63.14 |
Qwen-2-72B | 74.52 | 31.25 | 81.14 | 62.57 |
GPT-3.5 | 71.20 | 21.88 | 78.07 | 30.09 |
GPT-4o | 76.54 | 32.62 | 75.96 | 64.49 |
Claude-3.5-Sonnet | 72.61 | 21.88 | 86.07 | 30.59 |
Note: Bold values indicate best performance. I.T. denotes inner thoughts. For InCharacter, we report accuracy for individual (Dim) and full (Full) dimensions on BFI.
Ethical Considerations
We have conducted safety checks on the training dataset and removed potentially problematic content. However, users should be aware that:
- The models may still generate content that reflects biases present in the literary works they were trained on.
- Role-playing as certain characters might involve generating content that includes negative traits or behaviors.
- Users should implement appropriate safeguards when deploying these models in applications.
Citation
If you use CoSER models in your research, please cite our paper:
@misc{wang2025cosercoordinatingllmbasedpersona,
title={CoSER: Coordinating LLM-Based Persona Simulation of Established Roles},
author={Xintao Wang and Heng Wang and Yifei Zhang and Xinfeng Yuan and Rui Xu and Jen-tse Huang and Siyu Yuan and Haoran Guo and Jiangjie Chen and Wei Wang and Yanghua Xiao and Shuchang Zhou},
year={2025},
eprint={2502.09082},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.09082},
}
- Downloads last month
- 69