CoSER Models

CoSER models are state-of-the-art models for role-playing language agents (RPLAs), built upon LLaMA-3.1 base models (8B and 70B). These models are trained on the CoSER dataset, which contains authentic multi-turn, multi-character dialogues extracted from 771 renowned novels.

CoSER models exhibit excellent role-playing capabilities. They can produce highly human-like responses across a wide range of personas, including both established fictional characters or original characters. They excel at capturing nuanced personalities, maintaining consistent character traits, and adapting to diverse role-playing scenarios. Results of extensive experiments demonstrate that CoSER models exhibit state-of-the-art role-playing performance across multiple benchmarks.

Model Variants

CoSER-8B: Fine-tuned from LLaMA-3.1-8B
CoSER-70B: Fine-tuned from LLaMA-3.1-70B

Training Data

The models are trained on the CoSER dataset, which differs from existing RPLA datasets in two fundamental ways:

It extracts authentic multi-turn, multi-character dialogues from acclaimed literary works, maintaining high source fidelity while exhibiting greater quality and complexity.
It incorporates comprehensive types of data:
- Character profiles, dialogues, plot summaries, character experiences, and conversation backgrounds.
- Conversations that capture characters' internal thoughts and physical actions beyond surface-level speech

Training Methodology

Our training approach is based on "given-circumstance acting" (GCA):

Given a conversation with messages M, characters C, and setting S, the actor LLM is required to sequentially portray each character c∈C to recreate the conversation. During training, for each character c, we optimize the language modeling loss on their corresponding messages.

Performance and Evaluation

We evaluate our models via GCA Evaluation. It is a comprehensive approach that includes multi-agent simulation and penalty-based LLM assessment:

We generate conversations via multi-agent simulation, where the actor LLM portrays each character within a given setting, coordinated by a next-actor-prediction model to manage turn-taking.
We assess the generated conversations using penalty-based LLM judges, which are provided detailed rubrics and original conversations for reference.

Performance on Given-Circumstance Acting

CoSER models outperform existing open-source LLMs on multiple RPLA benchmarks and are comparable to state-of-the-art closed-source models like GPT-4o.

Model	Storyline Consistency	Anthropomorphism	Character Fidelity	Storyline Quality	Average Score	BLEU	ROUGE-L
Close-source Models
Abab7-preview	56.81	44.23	43.83	74.83	54.92	4.96	11.50
Doubao-pro	60.95	49.72	47.02	79.28	59.24	6.38	12.95
Step-1-Flash	57.75	48.12	44.48	75.93	56.57	5.95	12.71
Step-2	61.43	49.06	47.33	77.96	58.94	5.75	12.50
GPT-3.5	57.22	43.30	42.29	73.91	54.18	4.58	11.80
GPT-4o	61.59	48.93	48.95	80.33	59.95	5.90	12.11
GPT-4o Mini	60.09	48.21	44.88	78.55	57.93	3.90	10.81
Gemini Pro	59.11	52.41	47.83	77.59	59.24	5.39	11.65
Claude-3-Haiku	58.18	44.66	41.88	74.14	54.71	4.80	12.02
Claude-3.5-Sonnet	57.45	48.50	45.69	77.23	57.22	5.17	11.45
Open-source Models
Mistral-7B	59.90	40.00	44.75	61.93	51.64	2.71	9.28
Qwen-2-7B	51.96	35.48	31.51	63.18	45.53	4.21	10.71
LLaMA-3.1-8B	54.10	45.36	40.22	72.29	52.99	4.59	10.18
CoSER-8B	58.61	47.23	46.90	73.04	56.45	9.40	14.21
Vicuna-13B-1.5	52.75	39.12	38.04	60.43	47.58	1.67	5.59
Mixtral-8x7B	51.25	38.44	36.92	67.69	48.58	5.28	11.66
Qwen-2-72B	57.75	47.28	46.62	76.60	57.06	5.38	11.85
LLaMA-3.1-70B	57.46	45.95	43.72	74.84	55.49	4.82	10.98
Higgs-Llama-3-70B	57.10	43.82	42.41	75.62	54.74	3.99	10.92
CoSER-70B	58.66	53.33	48.75	75.49	59.06	10.10	14.78
DeepSeek-V3	56.40	47.87	44.02	76.66	56.24	4.54	11.02

Note: Bold values indicate best performance across all models.

Performance on Existing RPLA Benchmarks

Model	InCharacter Dim	InCharacter Full	Life Choice	CroSS MR
LLaMA-3.1-8B	64.97	15.62	61.10	30.15
CoSER-8B	75.80	21.88	69.54	44.94
CoSER-8B trained w/o I.T.	70.70	15.62	59.92	43.14
LLaMA-3.1-70B	72.16	31.25	86.48	61.30
Higgs-Llama-3-70B	74.52	28.12	74.03	60.12
CoSER-70B	75.80	34.38	93.47	64.49
CoSER-70B trained w/o I.T.	73.12	32.14	93.18	63.14
Qwen-2-72B	74.52	31.25	81.14	62.57
GPT-3.5	71.20	21.88	78.07	30.09
GPT-4o	76.54	32.62	75.96	64.49
Claude-3.5-Sonnet	72.61	21.88	86.07	30.59

Note: Bold values indicate best performance. I.T. denotes inner thoughts. For InCharacter, we report accuracy for individual (Dim) and full (Full) dimensions on BFI.

Ethical Considerations

We have conducted safety checks on the training dataset and removed potentially problematic content. However, users should be aware that:

The models may still generate content that reflects biases present in the literary works they were trained on.
Role-playing as certain characters might involve generating content that includes negative traits or behaviors.
Users should implement appropriate safeguards when deploying these models in applications.

Citation

If you use CoSER models in your research, please cite our paper:

@misc{wang2025cosercoordinatingllmbasedpersona,
      title={CoSER: Coordinating LLM-Based Persona Simulation of Established Roles}, 
      author={Xintao Wang and Heng Wang and Yifei Zhang and Xinfeng Yuan and Rui Xu and Jen-tse Huang and Siyu Yuan and Haoran Guo and Jiangjie Chen and Wei Wang and Yanghua Xiao and Shuchang Zhou},
      year={2025},
      eprint={2502.09082},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.09082}, 
}

Neph0s
/

CoSER-Llama-3.1-8B