Model Card for Llama-3.1-8B-Instruct-NLRL-TicTacToe-Policy

Model Details

Model Description

  • Developed by: NLRL Team
  • Model type: Language Policy Model for TicTacToe
  • Language(s): English
  • License: MIT
  • Finetuned from model: LLaMA-3.1-8B-Instruct

This model serves as a language policy in Natural Language Reinforcement Learning (NLRL) framework, specifically trained for the TicTacToe game. It generates actions through chain-of-thought reasoning and outputs move decisions.

Uses

Direct Use

This model can be used as a TicTacToe player that explains its strategic thinking through natural language before making moves. The model generates both reasoning chains and final move decisions.

Out-of-Scope Use

This model is specifically trained for TicTacToe and should not be used for other games or tasks.

Training Details

Training Data

Training data consists of state-action pairs collected through NLRL actor-critic learning process, with language-based Monte Carlo value estimates used for policy improvement.

Training Procedure

  • Trained using FSDP (Fully Sharded Data Parallel) across 4 H100 GPUs
  • Learning rate: 1e-5
  • Training epochs per iteration: 2
  • Batch size: 8
  • Max sequence length: 1024

Evaluation

  • Tested against deterministic (first-move) and random opponent strategies
  • Achieves >90% win rate against both opponent types after convergence

Model Architecture

  • Base model: LLaMA-3.1-8B-Instruct
  • Input: Text description of TicTacToe board state
  • Output: Chain-of-thought reasoning followed by move decision

Citation

@misc{nlrl,
      title={Natural Language Reinforcement Learning}, 
      author={Xidong Feng and Ziyu Wan and Haotian Fu and Bo Liu and Mengyue Yang and Girish A. Koushik and Zhiyuan Hu and Ying Wen and Jun Wang},
      year={2024},
      eprint={2411.14251},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2411.14251}, 
}

Model Card Contact

[email protected]

Downloads last month
21
Safetensors
Model size
7.5B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Benjamin-eecs/Llama-3.1-8B-Instruct-NLRL-TicTacToe-Policy

Finetuned
(586)
this model

Collection including Benjamin-eecs/Llama-3.1-8B-Instruct-NLRL-TicTacToe-Policy