Spaces:

MAIL-CS-ECNU
/

Text-Gym-Agents

Runtime error

App Files Files Community

Jarvis-K commited on Nov 22, 2023

Commit

2a33798

1 Parent(s): 64eba11

init

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +97 -0
RL_based/test_RL.sh +39 -0
RL_based/train_PPO.py +251 -0
RL_based/train_RL.sh +39 -0
RL_based/utils.py +621 -0
deciders/__init__.py +26 -0
deciders/act.py +248 -0
deciders/cot.py +147 -0
deciders/jarvis.py +177 -0
deciders/jarvis_without_insights.py +179 -0
deciders/jarvis_without_shortmem.py +182 -0
deciders/jarvis_without_suggestions.py +180 -0
deciders/jarvis_without_summary.py +179 -0
deciders/misc.py +21 -0
deciders/pal.py +149 -0
deciders/parser.py +53 -0
deciders/reflexion.py +179 -0
deciders/self_consistency.py +170 -0
deciders/selfask.py +150 -0
deciders/spp.py +142 -0
deciders/utils.py +65 -0
distillers/__init__.py +10 -0
distillers/guidance_summary_few_shot_examples.txt +85 -0
distillers/guider.py +144 -0
distillers/raw_prompt_generator.py +16 -0
distillers/reflexion_few_shot_examples.txt +75 -0
distillers/self_reflection.py +53 -0
distillers/traj_prompt_summarizer.py +46 -0
distillers/traj_summary_few_shot_examples.txt +76 -0
draw_overall_performance.py +59 -0
environment.yml +193 -0
envs/__init__.py +51 -0
envs/base_env.py +97 -0
envs/box2d/LunarLander_policies.py +36 -0
envs/box2d/LunarLander_translator.py +67 -0
envs/box2d/__init__.py +0 -0
envs/box2d/few_shot_examples/lunarlander_l2.json +0 -0
envs/box2d/few_shot_examples/lunarlander_l4.json +0 -0
envs/classic_control/__init__.py +0 -0
envs/classic_control/acrobot_policies.py +36 -0
envs/classic_control/acrobot_translator.py +58 -0
envs/classic_control/cartpole_policies.py +25 -0
envs/classic_control/cartpole_translator.py +57 -0
envs/classic_control/few_shot_examples/acrobot_l2.json +0 -0
envs/classic_control/few_shot_examples/acrobot_l4.json +0 -0
envs/classic_control/few_shot_examples/cartpole_l2.json +0 -0
envs/classic_control/few_shot_examples/cartpole_l4.json +0 -0
envs/classic_control/few_shot_examples/mountaincarContinuous_l2.json +0 -0
envs/classic_control/few_shot_examples/mountaincarContinuous_l4.json +0 -0
envs/classic_control/few_shot_examples/mountaincar_l2.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,97 @@

+# Bench LLM Deciders with gym translators
+This project provides a set of translators to convert OpenAI Gym environments into text-based environments. It is designed to investigate the capabilities of large language models in decision-making tasks within these text-based environments.
+## Summarizer Levels
+We translate the game with basic level descriptions. It provides a simple description of the current state of the game. It's suitable for beginners who are just getting familiar with the game.
+## Environment Categories
+The environments are categorized based on the information that revealed to agents. We propose *5 level* scenarios.
+**L1**: No external information is given. Only abstract game description. (zero shot)
+**L2**: Agents can take a sampling traj of the random policy as external knowledge. (few shots, off-policy info)
+**L3**: self sampling and updating w/ feedback. (few shots, on-policy info)
+**L4**: sampling traj of an expert policy (few shots, expert-info)
+**L5**: expert teaching (few shots, expert-info with guidance)
+The five level scenarios are mainly considering making decision with perception. For future world, we leave it to stage 2 investigation.
+**Perception and Future World**: These environments provide a perception of the current state, and also predict future infos. The futrue info is given in the info dict at step and reset.
+It should be noted that the past memory part should be implemented as a component of deciders.
+## Fewshot Examples Generation
+For `L1` level, the `[]` is given.
+For `L2` and `L4` level, we use `gen_few_shots_examples.py` to generate corresponding examples in json format and place them in the `envs/*/few_shot_examples/`.
+For `L3` level, agent should collect the examples on their own and only a few methods support it. Thus we leave it to the agent design.
+For `L5` level, we handcraft the few shot examples with domain knowledge in `prompts/task_relevant`.
+## Usage
+1. create `./deciders/gpt.py` to provide your gpt agent:
+```python
+import openai
+class gpt:
+    def __init__(self,):
+        openai.api_type = "azure"
+        openai.api_version = "2023-05-15"
+        # Your Azure OpenAI resource's endpoint value.
+        openai.api_base = "https://js-partner.openai.azure.com/"
+        openai.api_key = "your azure openai key"
+```
+2. Install Requirements
+```
+conda env create --file environment.yml
+```
+3. Testing
+The project can be run using the provided test.sh script. This script runs a series of commands, each of which initiates a Gym environment and applies different translators to it.
+Here is an example of how to run the script:
+```
+./test.sh
+```
+The commands in test.sh are structured as follows:
+```
+python main.py --env_name ENV_NAME --init_summarizer INIT_SUMMARIZER --curr_summarizer CURR_SUMMARIZER [--future_summarizer FUTURE_SUMMARIZER --future_horizon FUTURE_HORIZON]
+```
+Where:
+* ENV_NAME: The name of the Gym environment to be used (e.g., CartPole-v0).
+* INIT_SUMMARIZER: The initial summarizer to be used (e.g., cart_init_translator).
+* CURR_SUMMARIZER: The current summarizer to be used (e.g., cart_basic_translator).
+* FUTURE_SUMMARIZER (optional): The future summarizer to be used (e.g., cart_basic_translator).
+* FUTURE_HORIZON (optional): The horizon that each policy will look to (e.g., 3).
+## Supported Environment Translators and LLM Deciders
+|                              |          Acrobot         |              Cart Pole             |       Mountain Car       |         Pendulum         |       Lunar Lander       |         Blackjack        |           Taxi           |       Cliff Walking      |        Frozen Lake       |
+|------------------------------|:------------------------:|:----------------------------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------:|:------------------------:|
+| Translator                   | :heavy_multiplication_x: |         :white_check_mark:         | :heavy_multiplication_x: | :heavy_multiplication_x: |    :white_check_mark:    | :heavy_multiplication_x: | :heavy_multiplication_x: | :heavy_multiplication_x: | :heavy_multiplication_x: |
+| Chain-of-Thought             |    :heavy_minus_sign:    | :white_check_mark:(L1)<br>:gift:<sup>[1]</sup>(~30) |    :heavy_minus_sign:    |    :heavy_minus_sign:    | :white_check_mark:(L1)<br/>:gift:<sup>[1]</sup>(-367) |    :heavy_minus_sign:    |    :heavy_minus_sign:    |    :heavy_minus_sign:    |    :heavy_minus_sign:    |
+| Program-aided Language Model |    :heavy_minus_sign:    | :white_check_mark:(L1)<br>:gift:(168) |    :heavy_minus_sign:    |    :heavy_minus_sign:    |        :white_check_mark:(L1)<br/>:gift:(-68)         |    :heavy_minus_sign:    |    :heavy_minus_sign:    |    :heavy_minus_sign:    |    :heavy_minus_sign:    |
+| Self-ask Prompting           |    :heavy_minus_sign:    | :white_check_mark:(L1)<br>:gift:(~10) |    :heavy_minus_sign:    |    :heavy_minus_sign:    | :heavy_multiplication_x: |    :heavy_minus_sign:    |    :heavy_minus_sign:    |    :heavy_minus_sign:    |    :heavy_minus_sign:    |
+| Self-consistency Prompting   |    :heavy_minus_sign:    |      :white_check_mark:(L1)<br>:gift:(~30)      |    :heavy_minus_sign:    |    :heavy_minus_sign:    | :heavy_multiplication_x: |    :heavy_minus_sign:    |    :heavy_minus_sign:    |    :heavy_minus_sign:    |    :heavy_minus_sign:    |
+| Reflexion                    |    :heavy_minus_sign:    |      :heavy_multiplication_x:      |    :heavy_minus_sign:    |    :heavy_minus_sign:    | :heavy_multiplication_x: |    :heavy_minus_sign:    |    :heavy_minus_sign:    |    :heavy_minus_sign:    |    :heavy_minus_sign:    |
+| Solo Performance Prompting | :heavy_minus_sign: | :white_check_mark:(L1)<br/>:gift:(43) | :heavy_minus_sign: | :heavy_minus_sign: | :white_check_mark:(L1)<br/>:gift:(-583) | :heavy_minus_sign: | :heavy_minus_sign: | :heavy_minus_sign: | :heavy_minus_sign: |
+<sup>[1]: Cumulative reward.</sup>
+![Image text](https://github.com/mail-ecnu/LLM-Decider-Bench/blob/master/vis/Classic%20Control.png)
+![Image text](https://github.com/mail-ecnu/LLM-Decider-Bench/blob/master/vis/Box%202D.png)
+![Image text](https://github.com/mail-ecnu/LLM-Decider-Bench/blob/master/vis/Toy%20Text.png)
+>
+> 1. Except for the reflexion L3 decider, all other L3 deciders in this task do not have memory.
+> 2. reflexion L1 and L3 both have memory.
+> 3. reflexion L1 run 5 trails.
+> 4. Blackjack、MountainCar、Cliffwalking(PAL)、CartPole(PAL)、Taxi(SPP、PAL)、Frozen Lake use deciders modified at 15:29 09.18
+> 5. update Frozen Lake translator, add prior knowledge.
+# Remarks
+1. how to use future info
+We provide future info in the env_info part. It is a dict and you can convert it to a text further to make your agent aware the world model.

RL_based/test_RL.sh ADDED Viewed

	@@ -0,0 +1,39 @@

+# # ppo for cartpole-v0
+# CUDA_VISIBLE_DEVICES=1 python RL_based/train_PPO.py --env_name CartPole-v0 --init_summarizer cart_init_translator --curr_summarizer cart_basic_translator\
+#  --trans_model_name distilbert-base-uncased --model_name nn_embedding --eval --policy-path RL_based/checkpoints/CartPole-v0/expert/policy.pth --collect_one_episode
+# # ppo for lunarlander: treasured-music-91 score: 164.66
+# TRANSFORMERS_OFFLINE=1 \
+# CUDA_VISIBLE_DEVICES=2 python RL_based/train_PPO.py --env_name LunarLander-v2 --init_summarizer lunarLander_init_translator --curr_summarizer lunarLander_basic_translator \
+#  --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 128 --eval --collect_one_episode --policy-path /home/ubuntu/LLM-Decider-Bench/RL_based/checkpoints/LunarLander-v2/expert/policy.pth
+# ppo for Acrobot-v1: charmed-salad-93 score: -85.8
+# TRANSFORMERS_OFFLINE=1 \
+# CUDA_VISIBLE_DEVICES=0 python RL_based/train_PPO.py --env_name Acrobot-v1 --init_summarizer acrobot_init_translator --curr_summarizer acrobot_basic_translator --decider naive_actor --prompt_level 1\
+#  --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 128 --eval --collect_one_episode --policy-path /home/ubuntu/LLM-Decider-Bench/RL_based/checkpoints/Acrobot-v1/expert/policy.pth
+# # # # ppo for MountainCar-v0:
+# TRANSFORMERS_OFFLINE=1 \
+# CUDA_VISIBLE_DEVICES=1 python RL_based/train_PPO.py --env_name MountainCar-v0 --init_summarizer mountaincar_init_translator --curr_summarizer mountaincar_basic_translator --decider naive_actor --prompt_level 1\
+#  --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 128 --eval --collect_one_episode --policy-path /home/ubuntu/LLM-Decider-Bench/RL_based/checkpoints/MountainCar-v0/expert/policy.pth
+# # ppo for Blackjack-v1
+# TRANSFORMERS_OFFLINE=1 \
+# CUDA_VISIBLE_DEVICES=2 python RL_based/train_PPO.py --env_name Blackjack-v1 --init_summarizer blackjack_init_translator --curr_summarizer blackjack_basic_translator --decider naive_actor --prompt_level 1\
+#  --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 128 --eval --collect_one_episode --policy-path /home/ubuntu/LLM-Decider-Bench/RL_based/checkpoints/Blackjack-v1/expert/policy.pth
+# # # ppo for Taxi-v3
+TRANSFORMERS_OFFLINE=1 \
+CUDA_VISIBLE_DEVICES=6 python RL_based/train_PPO.py --env_name Taxi-v3 --init_summarizer taxi_init_translator --curr_summarizer taxi_basic_translator --decider naive_actor --prompt_level 1\
+ --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 128 --eval --collect_one_episode --policy-path /home/ubuntu/LLM-Decider-Bench/RL_based/checkpoints/Taxi-v3/expert/policy.pth
+# # # ppo for CliffWalking-v0
+# TRANSFORMERS_OFFLINE=1 \
+# CUDA_VISIBLE_DEVICES=4 python RL_based/train_PPO.py --env_name CliffWalking-v0 --init_summarizer cliffwalking_init_translator --curr_summarizer cliffwalking_basic_translator --decider naive_actor --prompt_level 1\
+#  --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 128 --eval --collect_one_episode --policy-path /home/ubuntu/LLM-Decider-Bench/RL_based/checkpoints/CliffWalking-v0/expert/policy.pth
+# # # ppo for FrozenLake-v1
+# TRANSFORMERS_OFFLINE=1 \
+# CUDA_VISIBLE_DEVICES=5 python RL_based/train_PPO.py --env_name FrozenLake-v1 --init_summarizer frozenlake_init_translator --curr_summarizer frozenlake_basic_translator --decider naive_actor --prompt_level 1\
+#  --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 128 --eval --collect_one_episode --policy-path /home/ubuntu/LLM-Decider-Bench/RL_based/checkpoints/FrozenLake-v1/expert/policy.pth

RL_based/train_PPO.py ADDED Viewed

	@@ -0,0 +1,251 @@

+import argparse
+import sys
+sys.path.insert(0, sys.path[0]+"/../")
+import prompts as task_prompts
+import envs
+import os
+from envs.translator import InitSummarizer, CurrSummarizer, FutureSummarizer, Translator
+import gym
+from torch.optim.lr_scheduler import LambdaLR
+import torch
+from tianshou.data import Collector, VectorReplayBuffer, ReplayBuffer
+from tianshou.env import DummyVectorEnv, SubprocVectorEnv
+from tianshou.policy import PPOPolicy, ICMPolicy
+from tianshou.trainer import onpolicy_trainer
+from tianshou.utils.net.common import ActorCritic
+from tianshou.utils.net.discrete import Actor, Critic, IntrinsicCuriosityModule
+from RL_based.utils import Net_GRU_Bert_tianshou, Net_Bert_CLS_tianshou, Net_Bert_CNN_tianshou, Net_GRU_nn_emb_tianshou
+from tianshou.utils import WandbLogger
+from torch.utils.tensorboard import SummaryWriter
+from tianshou.trainer.utils import test_episode
+import warnings
+warnings.filterwarnings('ignore')
+class MaxStepLimitWrapper(gym.Wrapper):
+    def __init__(self, env, max_steps=200):
+        super(MaxStepLimitWrapper, self).__init__(env)
+        self.max_steps = max_steps
+        self.current_step = 0
+    def reset(self, **kwargs):
+        self.current_step = 0
+        return self.env.reset(**kwargs)
+    def step(self, action):
+        observation, reward, terminated, truncated, info = self.env.step(action)
+        self.current_step += 1
+        if self.current_step >= self.max_steps:
+            terminated = True
+            info['episode_step_limit'] = self.max_steps
+        return observation, reward, terminated, truncated, info
+class SimpleTextWrapper(gym.Wrapper):
+    def __init__(self, env):
+        super(SimpleTextWrapper, self).__init__(env)
+        self.env = env
+    def reset(self, **kwargs):
+        observation, _ = self.env.reset(**kwargs)
+        return str(observation), {}
+    def step(self, action):
+        observation, reward, terminated, truncated, info = self.env.step(action)
+        return str(observation), reward, terminated, truncated, info
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='Evaluate a translator in a gym environment with a ChatGPT model.')
+    parser.add_argument('--init_summarizer', type=str, required=True, help='The name of the init summarizer to use.')
+    parser.add_argument('--curr_summarizer', type=str, required=True, help='The name of the curr summarizer to use.')
+    parser.add_argument('--future_summarizer', type=str, help='The name of the future summarizer to use.')
+    parser.add_argument('--env', type=str, default='base_env', help='The name of the gym environment to use.')
+    parser.add_argument('--env_name', type=str, default='CartPole-v1', help='The name of the gym environment to use.')
+    parser.add_argument('--decider', type=str, default="naive_actor", help='The actor used to select action')
+    parser.add_argument('--render', type=str, default="rgb_array", help='The render mode')
+    parser.add_argument('--future_horizon', type=int, help='The horizon of looking to future')
+    parser.add_argument(
+        "--prompt_level",
+        type=int,
+        default=1,
+        help="The level of prompts",
+    )
+    parser.add_argument(
+        "--past_horizon", type=int, help="The horizon of looking back"
+    )
+    parser.add_argument(
+        "--max_episode_len", type=int, default=200, help="The max length of an episode"
+    )
+### for RL training
+    parser.add_argument('--max_length', type=int, default=128, help='The token length of the observation')
+    # trans_model_name
+    parser.add_argument('--trans_model_name', type=str, default='bert-base-uncased', help='The name of the pretrained transformer to use.')
+    parser.add_argument('--model_name', type=str, default='bert-embedding', help='The name of the model to use.')
+    parser.add_argument('--vector_env', type=str, default='dummy', help='The name of the vector env to use.')
+    parser.add_argument('--eval', action='store_true', default=False, help='Whether to only eval the model')
+    parser.add_argument('--policy-path', type=str, default=None, help='The path to the policy to be evaluated')
+    parser.add_argument('--collect_one_episode', action='store_true', default=False, help='Whether to only collect one episode')
+    parser.add_argument('--lr', type=float, default=0.0003, help='The learning rate of the model')
+    parser.add_argument('--step_per_epoch', type=int, default=10000, help='The number of steps per epoch')
+    parser.add_argument('--step_per_collect', type=int, default=2000, help='The number of steps per collect')
+    parser.add_argument('--lr_decay', action='store_true', default=False, help='Whether to decay the learning rate')
+    parser.add_argument('--epoch', type=int, default=400, help='The number of epochs to train')
+    parser.add_argument('--resume_path', type=str, default=None, help='The path to the policy to be resumed')
+    parser.add_argument('--taxi_specific_env', action='store_true', default=False, help='Whether to use taxi specific env')
+    args = parser.parse_args()
+    args_dict = vars(args)
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    # Get the specified translator, environment, and ChatGPT model
+    env_class = envs.REGISTRY[args.env]
+    init_summarizer = InitSummarizer(envs.REGISTRY[args.init_summarizer])
+    curr_summarizer = CurrSummarizer(envs.REGISTRY[args.curr_summarizer])
+    if args.future_summarizer:
+        future_summarizer = FutureSummarizer(
+            envs.REGISTRY[args.future_summarizer],
+            envs.REGISTRY["cart_policies"],
+            future_horizon=args.future_horizon,
+        )
+    else:
+        future_summarizer = None
+    wandb_log_config = {
+        "env": args.env_name,
+        "init_summarizer": args.init_summarizer,
+        "curr_summarizer": args.curr_summarizer,
+        "future_summarizer": args.future_summarizer,
+    }
+    wandb_log_config.update(args_dict)
+    if not args.eval:
+        logger = WandbLogger(
+            project="LLM-decider-bench-RL",
+            entity="llm-bench-team",
+            config=wandb_log_config,
+        )
+        random_name = logger.wandb_run.name
+        log_path = os.path.join('/home/ubuntu/LLM-Decider-Bench/RL_based/results', args.env_name, random_name)
+        writer = SummaryWriter(log_dir=log_path)
+        writer.add_text("args", str(args))
+        logger.load(writer)
+        def save_best_fn(policy):
+            torch.save(policy.state_dict(), os.path.join(log_path, 'policy.pth'))
+    sampling_env = envs.REGISTRY["sampling_wrapper"](gym.make(args.env_name))
+    if args.prompt_level == 5:
+        prompts_class = task_prompts.REGISTRY[(args.env_name,args.decider)]()
+    else:
+        prompts_class = task_prompts.REGISTRY[(args.decider)]()
+    translator = Translator(
+        init_summarizer, curr_summarizer, future_summarizer, env=sampling_env
+    )
+    if args.taxi_specific_env:
+        environment = gym.make(args.env_name, render_mode=args.render)
+    else:
+        environment = env_class(
+            gym.make(args.env_name, render_mode=args.render), translator
+        )
+    # Set the translation level
+    translate_level = 1
+    if args.past_horizon is None and args.future_horizon is None:
+        translate_level = 1
+    if args.past_horizon and args.future_horizon is None:
+        raise NotImplementedError
+        # translate_level = 2
+    if args.past_horizon is None and args.future_horizon:
+        raise NotImplementedError
+        # translate_level = 3
+    if args.past_horizon and args.future_horizon:
+        raise NotImplementedError
+        # translate_level = 3.5
+    if args.vector_env == 'dummy':
+        ThisEnv = DummyVectorEnv
+    elif args.vector_env == 'subproc':
+        ThisEnv = SubprocVectorEnv
+    def make_env():
+        if args.taxi_specific_env:
+            env = MaxStepLimitWrapper(SimpleTextWrapper(gym.make(args.env_name, render_mode=args.render)), max_steps=200)
+            env._max_episode_steps = args.max_episode_len
+        else:
+            env = env_class(MaxStepLimitWrapper(gym.make(args.env_name, render_mode=args.render), max_steps=200), translator)
+            env._max_episode_steps = args.max_episode_len
+        return env
+    train_envs = ThisEnv([make_env for _ in range(20)])
+    test_envs = ThisEnv([make_env for _ in range(10)])
+    # model & optimizer
+    def get_net():
+        if args.model_name == "bert-embedding":
+            net = Net_GRU_Bert_tianshou(state_shape=environment.observation_space.shape, hidden_sizes=[64, 64], device=device, max_length=args.max_length, trans_model_name=args.trans_model_name)
+        elif args.model_name == "bert-CLS-embedding":
+            net = Net_Bert_CLS_tianshou(state_shape=environment.observation_space.shape, hidden_sizes=[256, 128], device=device, max_length=args.max_length, trans_model_name=args.trans_model_name)
+        elif args.model_name == "bert-CNN-embedding":
+            net = Net_Bert_CNN_tianshou(state_shape=environment.observation_space.shape, hidden_sizes=[256, 128], device=device, max_length=args.max_length, trans_model_name=args.trans_model_name)
+        elif args.model_name == "nn_embedding":
+            net = Net_GRU_nn_emb_tianshou(hidden_sizes=[256, 128], device=device, max_length=args.max_length, trans_model_name=args.trans_model_name)
+        return net
+    net = get_net()
+    actor = Actor(net, environment.action_space.n, device=device).to(device)
+    critic = Critic(net, device=device).to(device)
+    actor_critic = ActorCritic(actor, critic)
+    optim = torch.optim.Adam(actor_critic.parameters(), lr=args.lr)
+    # PPO policy
+    dist = torch.distributions.Categorical
+    lr_scheduler = None
+    if args.lr_decay:
+            max_update_num = args.step_per_epoch // args.step_per_collect * args.epoch
+            lr_scheduler = LambdaLR(optim, lr_lambda=lambda epoch: 1 - epoch / max_update_num)
+    policy = PPOPolicy(actor, critic, optim, dist, action_space=environment.action_space, lr_scheduler=lr_scheduler).to(device)
+    # collector
+    train_collector = Collector(policy, train_envs, VectorReplayBuffer(20000, len(train_envs)), exploration_noise=True)
+    test_collector = Collector(policy, test_envs, exploration_noise=True)
+    if not args.eval:
+        # trainer
+        # test train_collector and start filling replay buffer
+        if args.resume_path:
+            policy.load_state_dict(torch.load(args.resume_path, map_location='cuda'))
+            print("Loaded agent from: ", args.resume_path)
+        train_collector.collect(256 * 20)
+        result = onpolicy_trainer(
+            policy,
+            train_collector,
+            test_collector,
+            max_epoch=args.epoch,
+            step_per_epoch=50000, # the number of transitions collected per epoch
+            repeat_per_collect=4,
+            episode_per_test=10,
+            batch_size=256,
+            logger=logger,
+            step_per_collect=1000,  # the number of transitions the collector would collect before the network update
+            save_best_fn=save_best_fn,
+            # stop_fn=lambda mean_reward: mean_reward >= environment.spec.reward_threshold,
+        )
+        print(result)
+    else:
+        assert args.policy_path is not None
+        policy.load_state_dict(torch.load(args.policy_path))
+        test_collector = Collector(policy, test_envs)
+        result = test_episode(policy, test_collector, None, None, n_episode=10)
+        print(result)
+        if args.collect_one_episode:
+            replaybuffer = ReplayBuffer(size=1000)
+            test_collector_1 = Collector(policy, environment, replaybuffer)
+            test_collector_1.reset_env()
+            test_collector_1.reset_buffer()
+            policy.eval()
+            result = test_collector_1.collect(n_episode=1)
+            print('sample results', f"/home/ubuntu/LLM-Decider-Bench/RL_based/checkpoints/{args.env_name}/output.txt")
+            sample_result = replaybuffer.sample(0)
+            f = open(f"/home/ubuntu/LLM-Decider-Bench/RL_based/checkpoints/{args.env_name}/output.txt", "w")
+            print(sample_result, file=f)
+            f.close()

RL_based/train_RL.sh ADDED Viewed

	@@ -0,0 +1,39 @@

+# # ppo for cartpole
+# CUDA_VISIBLE_DEVICES=1 python RL_based/train_PPO.py --env_name CartPole-v0 --init_summarizer cart_init_translator --curr_summarizer cart_basic_translator\
+#  --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding
+# # ppo for lunarlander
+# TRANSFORMERS_OFFLINE=1 \
+# CUDA_VISIBLE_DEVICES=3 python RL_based/train_PPO.py --env_name LunarLander-v2 --init_summarizer lunarLander_init_translator --curr_summarizer lunarLander_basic_translator \
+# --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 128 --lr 0.0003 --lr_decay --epoch 500
+# ppo for Acrobot-v1
+# TRANSFORMERS_OFFLINE=1 \
+# CUDA_VISIBLE_DEVICES=0 python RL_based/train_PPO.py --env_name Acrobot-v1 --init_summarizer acrobot_init_translator --curr_summarizer acrobot_basic_translator --decider naive_actor --prompt_level 1\
+#  --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 128 --lr 0.0003 --lr_decay --epoch 500 &
+# # # ppo for MountainCar-v0
+# TRANSFORMERS_OFFLINE=1 \
+# CUDA_VISIBLE_DEVICES=1 python RL_based/train_PPO.py --env_name MountainCar-v0 --init_summarizer mountaincar_init_translator --curr_summarizer mountaincar_basic_translator --decider naive_actor --prompt_level 1\
+#  --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 300 --lr 0.0003 --lr_decay --epoch 500 &
+# ppo for Blackjack-v1
+# TRANSFORMERS_OFFLINE=1 \
+# CUDA_VISIBLE_DEVICES=2 python RL_based/train_PPO.py --env_name Blackjack-v1 --init_summarizer blackjack_init_translator --curr_summarizer blackjack_basic_translator --decider naive_actor --prompt_level 1\
+#  --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 300 --lr 0.0003 --lr_decay --epoch 500 &
+# # ppo for Taxi-v3
+TRANSFORMERS_OFFLINE=1 \
+CUDA_VISIBLE_DEVICES=6 python RL_based/train_PPO.py --env_name Taxi-v3 --init_summarizer taxi_init_translator --curr_summarizer taxi_basic_translator --decider naive_actor --prompt_level 1\
+ --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 300 --lr 0.0003 --lr_decay --epoch 500 --taxi_specific_env
+# # ppo for CliffWalking-v0
+# TRANSFORMERS_OFFLINE=1 \
+# CUDA_VISIBLE_DEVICES=4 python RL_based/train_PPO.py --env_name CliffWalking-v0 --init_summarizer cliffwalking_init_translator --curr_summarizer cliffwalking_basic_translator --decider naive_actor --prompt_level 1\
+#  --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 300 --lr 0.0003 --lr_decay --epoch 500 &
+# # ppo for FrozenLake-v1
+# TRANSFORMERS_OFFLINE=1 \
+# CUDA_VISIBLE_DEVICES=5 python RL_based/train_PPO.py --env_name FrozenLake-v1 --init_summarizer frozenlake_init_translator --curr_summarizer frozenlake_basic_translator --decider naive_actor --prompt_level 1\
+#  --trans_model_name /home/ubuntu/LLM-Decider-Bench/RL_based/transformer_offline_distilbert --model_name nn_embedding --max_length 300 --lr 0.0003 --lr_decay --epoch 500 &

RL_based/utils.py ADDED Viewed

	@@ -0,0 +1,621 @@

+import sys
+import numpy as np
+import torch
+from torch import nn
+sys.path.insert(0, sys.path[0]+"/../")
+from typing import (
+    Any,
+    Dict,
+    List,
+    Optional,
+    Sequence,
+    Tuple,
+    Type,
+    Union,
+    no_type_check,
+)
+import torch.nn as nn
+from tianshou.utils.net.discrete import NoisyLinear
+ModuleType = Type[nn.Module]
+import random
+from collections import namedtuple, deque
+from itertools import count
+import math
+import torch
+import torch.optim as optim
+from transformers import AutoModel, AutoTokenizer
+import torch.nn.functional as F
+from tianshou.utils.net.common import ModuleType, Net, MLP
+def bert_embedding(x, max_length=512, device='cuda'):
+    from transformers import logging
+    logging.set_verbosity_error()
+    model_name = 'bert-base-uncased'
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    bert_model = AutoModel.from_pretrained(model_name)
+    text = x
+    if isinstance(text, np.ndarray):
+        text = list(text)
+    tokens = tokenizer(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='pt')
+    input_ids = tokens['input_ids']
+    attention_mask = tokens['attention_mask']
+    with torch.no_grad():
+        outputs = bert_model(input_ids, attention_mask=attention_mask)
+        embeddings = outputs.last_hidden_state
+    return embeddings
+class Net_GRU(nn.Module):
+    def __init__(self, input_size, n_actions, hidden_dim, n_layers, dropout, bidirectional):
+        super(Net_GRU, self).__init__()
+        self.input_size = input_size
+        self.hidden_dim = hidden_dim
+        self.num_classes = n_actions
+        self.n_layers = n_layers
+        self.dropout = dropout
+        self.bidirectional = bidirectional
+        # Layers
+        self.gru = nn.GRU(self.input_size, self.hidden_dim, self.n_layers,
+                          batch_first=True, dropout=self.dropout, bidirectional=self.bidirectional)
+        self.final_layer = nn.Linear(self.hidden_dim*(1 + int(self.bidirectional)), self.num_classes)
+    def forward(self, x):
+        # Input shape: (batch_size, seq_length)
+        batch_size, seq_length, emb_size = x.size()
+        gru_out, hidden = self.gru(x)
+        # Use the final state
+        # hidden -> (num_direction, batch, hidden_size)
+        if self.bidirectional:
+            hidden = hidden.view(self.n_layers, 2, batch_size, self.hidden_dim)
+            final_hidden = torch.cat((hidden[:, -1, :, :].squeeze(0), hidden[:, 0, :, :].squeeze(0)), 1)
+        else:
+            final_hidden = hidden.squeeze(0)
+        # final_hidden -> (batch_size, num_classes)
+        logits = self.final_layer(final_hidden)
+        return logits
+class MyGRU(nn.Module):
+    def __init__(self, input_size, hidden_dim, n_layers, dropout, bidirectional, output_dim):
+        super(MyGRU, self).__init__()
+        self.input_size = input_size
+        self.hidden_dim = hidden_dim
+        self.n_layers = n_layers
+        self.dropout = dropout
+        self.bidirectional = bidirectional
+        # Layers
+        self.gru = nn.GRU(self.input_size, self.hidden_dim, self.n_layers,
+                          batch_first=True, dropout=self.dropout, bidirectional=self.bidirectional)
+        self.final_layer = nn.Linear(self.hidden_dim*(1 + int(self.bidirectional)), output_dim)
+    def forward(self, x):
+        batch_size, seq_length, emb_size = x.size()
+        gru_out, hidden = self.gru(x)
+        # Use the final state
+        # hidden -> (num_direction, batch, hidden_size)
+        if self.bidirectional:
+            hidden = hidden.view(self.n_layers, 2, batch_size, self.hidden_dim)
+            final_hidden = torch.cat((hidden[:, -1, :, :].squeeze(0), hidden[:, 0, :, :].squeeze(0)), 1)
+        else:
+            final_hidden = hidden.squeeze(0)
+        # final_hidden -> (batch_size, num_classes)
+        logits = self.final_layer(final_hidden)
+        return logits
+class MyCNN(nn.Module):
+    def __init__(self,
+            input_dim: int,
+            output_dim: int = 0,
+            hidden_sizes: Sequence[int] = (),
+            norm_layer: Optional[Union[ModuleType, Sequence[ModuleType]]] = None,
+            activation: ModuleType = nn.ReLU,
+            device: Optional[Union[str, int, torch.device]] = None,
+            linear_layer: Type[nn.Linear] = nn.Linear,
+            flatten_input: bool = True,) -> None:
+        super().__init__()
+        self.model = []
+        input_dim_temp = input_dim
+        for h in hidden_sizes:
+            self.model.append(nn.Conv1d(in_channels=input_dim_temp, out_channels=h, kernel_size=3, padding=1))
+            self.model.append(activation())
+            self.model.append(nn.MaxPool1d(kernel_size=2))
+            input_dim_temp = h
+        self.model = nn.Sequential(*self.model)
+        self.fc = nn.Linear(in_features=input_dim_temp, out_features=output_dim)
+    def forward(self, x):
+        x = self.model(x.transpose(1, 2))
+        x.transpose_(1, 2)
+        x = self.fc(x)
+        return x
+class Net_GRU_Bert_tianshou(Net):
+    def __init__(
+        self,
+        state_shape: Union[int, Sequence[int]],
+        action_shape: Union[int, Sequence[int]] = 0,
+        hidden_sizes: Sequence[int] = (),
+        norm_layer: Optional[ModuleType] = None,
+        activation: Optional[ModuleType] = nn.ReLU,
+        device: Union[str, int, torch.device] = "cpu",
+        softmax: bool = False,
+        concat: bool = False,
+        num_atoms: int = 1,
+        dueling_param: Optional[Tuple[Dict[str, Any], Dict[str, Any]]] = None,
+        linear_layer: Type[nn.Linear] = nn.Linear,
+        hidden_dim: int = 128,
+        bidirectional: bool = True,
+        dropout: float = 0.,
+        n_layers: int = 1,
+        max_length: int = 512,
+        trans_model_name: str = 'bert-base-uncased',
+    ) -> None:
+        nn.Module.__init__(self)
+        self.device = device
+        self.softmax = softmax
+        self.num_atoms = num_atoms
+        self.hidden_dim = hidden_dim
+        self.bidirectional = bidirectional
+        self.dropout = dropout
+        self.n_layers = n_layers
+        self.trans_model_name = trans_model_name
+        self.max_length = max_length
+        input_dim = int(np.prod(state_shape))
+        action_dim = int(np.prod(action_shape)) * num_atoms
+        if concat:
+            input_dim += action_dim
+        self.use_dueling = dueling_param is not None
+        output_dim = action_dim if not self.use_dueling and not concat else 0
+        self.output_dim = output_dim or hidden_dim
+        self.model = MyGRU(768, self.hidden_dim, self.n_layers,
+                          self.dropout, self.bidirectional, self.output_dim)
+        if self.use_dueling:  # dueling DQN
+            q_kwargs, v_kwargs = dueling_param  # type: ignore
+            q_output_dim, v_output_dim = 0, 0
+            if not concat:
+                q_output_dim, v_output_dim = action_dim, num_atoms
+            q_kwargs: Dict[str, Any] = {
+                **q_kwargs, "input_dim": self.output_dim,
+                "output_dim": q_output_dim,
+                "device": self.device
+            }
+            v_kwargs: Dict[str, Any] = {
+                **v_kwargs, "input_dim": self.output_dim,
+                "output_dim": v_output_dim,
+                "device": self.device
+            }
+            self.Q, self.V = MLP(**q_kwargs), MLP(**v_kwargs)
+            self.output_dim = self.Q.output_dim
+        self.bert_model = AutoModel.from_pretrained(self.trans_model_name).to(self.device)
+        self.tokenizer = AutoTokenizer.from_pretrained(trans_model_name)
+        from transformers import logging
+        logging.set_verbosity_error()
+    def bert_embedding(self, x, max_length=512):
+        text = x
+        if isinstance(text, np.ndarray):
+            text = list(text)
+        tokens = self.tokenizer(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='pt')
+        input_ids = tokens['input_ids'].to(self.device)
+        attention_mask = tokens['attention_mask'].to(self.device)
+        with torch.no_grad():
+            outputs = self.bert_model(input_ids, attention_mask=attention_mask)
+            embeddings = outputs.last_hidden_state
+        return embeddings
+    def forward(
+        self,
+        obs: Union[np.ndarray, torch.Tensor],
+        state: Any = None,
+        info: Dict[str, Any] = {},
+    ) -> Tuple[torch.Tensor, Any]:
+        """Mapping: obs -> flatten (inside MLP)-> logits."""
+        embedding = self.bert_embedding(obs, max_length=self.max_length)
+        logits = self.model(embedding)
+        bsz = logits.shape[0]
+        if self.use_dueling:  # Dueling DQN
+            q, v = self.Q(logits), self.V(logits)
+            if self.num_atoms > 1:
+                q = q.view(bsz, -1, self.num_atoms)
+                v = v.view(bsz, -1, self.num_atoms)
+            logits = q - q.mean(dim=1, keepdim=True) + v
+        elif self.num_atoms > 1:
+            logits = logits.view(bsz, -1, self.num_atoms)
+        if self.softmax:
+            logits = torch.softmax(logits, dim=-1)
+        return logits, state
+class Net_Bert_CLS_tianshou(Net):
+    def __init__(
+        self,
+        state_shape: Union[int, Sequence[int]],
+        action_shape: Union[int, Sequence[int]] = 0,
+        hidden_sizes: Sequence[int] = (),
+        norm_layer: Optional[ModuleType] = None,
+        activation: Optional[ModuleType] = nn.ReLU,
+        device: Union[str, int, torch.device] = "cpu",
+        softmax: bool = False,
+        concat: bool = False,
+        num_atoms: int = 1,
+        dueling_param: Optional[Tuple[Dict[str, Any], Dict[str, Any]]] = None,
+        linear_layer: Type[nn.Linear] = nn.Linear,
+        hidden_dim: int = 128,
+        bidirectional: bool = True,
+        dropout: float = 0.,
+        n_layers: int = 1,
+        max_length: int = 512,
+        trans_model_name: str = 'bert-base-uncased',
+    ) -> None:
+        nn.Module.__init__(self)
+        self.device = device
+        self.softmax = softmax
+        self.num_atoms = num_atoms
+        self.hidden_dim = hidden_dim
+        self.bidirectional = bidirectional
+        self.dropout = dropout
+        self.n_layers = n_layers
+        self.trans_model_name = trans_model_name
+        self.max_length = max_length
+        input_dim = int(np.prod(state_shape))
+        action_dim = int(np.prod(action_shape)) * num_atoms
+        if concat:
+            input_dim += action_dim
+        self.use_dueling = dueling_param is not None
+        output_dim = action_dim if not self.use_dueling and not concat else 0
+        self.output_dim = output_dim or hidden_dim
+        self.model = MLP(768, output_dim, hidden_sizes, norm_layer, activation, device, linear_layer)
+        if self.use_dueling:  # dueling DQN
+            q_kwargs, v_kwargs = dueling_param  # type: ignore
+            q_output_dim, v_output_dim = 0, 0
+            if not concat:
+                q_output_dim, v_output_dim = action_dim, num_atoms
+            q_kwargs: Dict[str, Any] = {
+                **q_kwargs, "input_dim": self.output_dim,
+                "output_dim": q_output_dim,
+                "device": self.device
+            }
+            v_kwargs: Dict[str, Any] = {
+                **v_kwargs, "input_dim": self.output_dim,
+                "output_dim": v_output_dim,
+                "device": self.device
+            }
+            self.Q, self.V = MLP(**q_kwargs), MLP(**v_kwargs)
+            self.output_dim = self.Q.output_dim
+        self.bert_model = AutoModel.from_pretrained(self.trans_model_name).to(self.device)
+        self.tokenizer = AutoTokenizer.from_pretrained(trans_model_name)
+        from transformers import logging
+        logging.set_verbosity_error()
+    def bert_CLS_embedding(self, x, max_length=512):
+        text = x
+        if isinstance(text, np.ndarray):
+            text = list(text)
+        tokens = self.tokenizer(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='pt')
+        input_ids = tokens['input_ids'].to(self.device)
+        attention_mask = tokens['attention_mask'].to(self.device)
+        with torch.no_grad():
+            outputs = self.bert_model(input_ids, attention_mask=attention_mask)
+            embeddings = outputs[0][:, 0, :]
+        return embeddings
+    def forward(
+        self,
+        obs: Union[np.ndarray, torch.Tensor],
+        state: Any = None,
+        info: Dict[str, Any] = {},
+    ) -> Tuple[torch.Tensor, Any]:
+        """Mapping: obs -> flatten (inside MLP)-> logits."""
+        embedding = self.bert_CLS_embedding(obs, max_length=self.max_length)
+        logits = self.model(embedding)
+        bsz = logits.shape[0]
+        if self.use_dueling:  # Dueling DQN
+            q, v = self.Q(logits), self.V(logits)
+            if self.num_atoms > 1:
+                q = q.view(bsz, -1, self.num_atoms)
+                v = v.view(bsz, -1, self.num_atoms)
+            logits = q - q.mean(dim=1, keepdim=True) + v
+        elif self.num_atoms > 1:
+            logits = logits.view(bsz, -1, self.num_atoms)
+        if self.softmax:
+            logits = torch.softmax(logits, dim=-1)
+        return logits, state
+class Net_Bert_CNN_tianshou(Net_GRU_Bert_tianshou):
+    def __init__(
+        self,
+        state_shape: Union[int, Sequence[int]],
+        action_shape: Union[int, Sequence[int]] = 0,
+        hidden_sizes: Sequence[int] = (),
+        norm_layer: Optional[ModuleType] = None,
+        activation: Optional[ModuleType] = nn.ReLU,
+        device: Union[str, int, torch.device] = "cpu",
+        softmax: bool = False,
+        concat: bool = False,
+        num_atoms: int = 1,
+        dueling_param: Optional[Tuple[Dict[str, Any], Dict[str, Any]]] = None,
+        linear_layer: Type[nn.Linear] = nn.Linear,
+        hidden_dim: int = 128,
+        bidirectional: bool = True,
+        dropout: float = 0.,
+        n_layers: int = 1,
+        max_length: int = 512,
+        trans_model_name: str = 'bert-base-uncased',
+    ) -> None:
+        nn.Module.__init__(self)
+        self.device = device
+        self.softmax = softmax
+        self.num_atoms = num_atoms
+        self.hidden_dim = hidden_dim
+        self.bidirectional = bidirectional
+        self.dropout = dropout
+        self.n_layers = n_layers
+        self.trans_model_name = trans_model_name
+        self.max_length = max_length
+        input_dim = int(np.prod(state_shape))
+        action_dim = int(np.prod(action_shape)) * num_atoms
+        if concat:
+            input_dim += action_dim
+        self.use_dueling = dueling_param is not None
+        output_dim = action_dim if not self.use_dueling and not concat else 0
+        self.output_dim = output_dim or hidden_dim
+        self.model = MyCNN(768, output_dim, hidden_sizes, norm_layer, activation, device, linear_layer, flatten_input=False)
+        if self.use_dueling:  # dueling DQN
+            q_kwargs, v_kwargs = dueling_param  # type: ignore
+            q_output_dim, v_output_dim = 0, 0
+            if not concat:
+                q_output_dim, v_output_dim = action_dim, num_atoms
+            q_kwargs: Dict[str, Any] = {
+                **q_kwargs, "input_dim": self.output_dim,
+                "output_dim": q_output_dim,
+                "device": self.device
+            }
+            v_kwargs: Dict[str, Any] = {
+                **v_kwargs, "input_dim": self.output_dim,
+                "output_dim": v_output_dim,
+                "device": self.device
+            }
+            self.Q, self.V = MLP(**q_kwargs), MLP(**v_kwargs)
+            self.output_dim = self.Q.output_dim
+        self.bert_model = AutoModel.from_pretrained(self.trans_model_name).to(self.device)
+        self.tokenizer = AutoTokenizer.from_pretrained(trans_model_name)
+        from transformers import logging
+        logging.set_verbosity_error()
+class DQN_GRU(nn.Module):
+    """Reference: Human-level control through deep reinforcement learning.
+    """
+    def __init__(
+        self,
+        state_shape: Union[int, Sequence[int]],
+        action_shape: Sequence[int],
+        device: Union[str, int, torch.device] = "cpu",
+        features_only: bool = False,
+        output_dim: Optional[int] = None,
+        hidden_dim: int = 128,
+        n_layers: int = 1,
+        dropout: float = 0.,
+        bidirectional: bool = True,
+        trans_model_name: str = 'bert-base-uncased',
+        max_length: int = 512,
+    ) -> None:
+        super().__init__()
+        self.device = device
+        self.max_length = max_length
+        action_dim = int(np.prod(action_shape))
+        self.net = MyGRU(768, hidden_dim, n_layers, dropout, bidirectional,
+                hidden_dim)
+        if not features_only:
+            self.net = MyGRU(768, hidden_dim, n_layers, dropout, bidirectional,
+                action_dim)
+            self.output_dim = action_dim
+        elif output_dim is not None:
+            self.net = MyGRU(768, hidden_dim, n_layers, dropout, bidirectional,
+                output_dim)
+            self.output_dim = output_dim
+        else:
+            self.net = MyGRU(768, hidden_dim, n_layers, dropout, bidirectional,
+                hidden_dim)
+            self.output_dim = hidden_dim
+        self.trans_model_name = trans_model_name
+        self.bert_model = AutoModel.from_pretrained(self.trans_model_name).to(self.device)
+        self.tokenizer = AutoTokenizer.from_pretrained(trans_model_name)
+        from transformers import logging
+        logging.set_verbosity_error()
+    def bert_embedding(self, x, max_length=512):
+        text = x
+        if isinstance(text, np.ndarray):
+            text = list(text)
+        tokens = self.tokenizer(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='pt')
+        input_ids = tokens['input_ids'].to(self.device)
+        attention_mask = tokens['attention_mask'].to(self.device)
+        with torch.no_grad():
+            outputs = self.bert_model(input_ids, attention_mask=attention_mask)
+            embeddings = outputs.last_hidden_state
+        return embeddings
+    def forward(
+        self,
+        obs: Union[np.ndarray, torch.Tensor],
+        state: Optional[Any] = None,
+        info: Dict[str, Any] = {},
+    ) -> Tuple[torch.Tensor, Any]:
+        r"""Mapping: s -> Q(s, \*)."""
+        embedding = self.bert_embedding(obs, max_length=self.max_length)
+        return self.net(embedding), state
+class Rainbow_GRU(DQN_GRU):
+    """Reference: Rainbow: Combining Improvements in Deep Reinforcement Learning.
+    """
+    def __init__(
+        self,
+        state_shape: Union[int, Sequence[int]],
+        action_shape: Sequence[int],
+        num_atoms: int = 51,
+        noisy_std: float = 0.5,
+        device: Union[str, int, torch.device] = "cpu",
+        is_dueling: bool = True,
+        is_noisy: bool = True,
+        output_dim: Optional[int] = None,
+        hidden_dim: int = 128,
+        n_layers: int = 1,
+        dropout: float = 0.,
+        bidirectional: bool = True,
+        trans_model_name: str = 'bert-base-uncased',
+        max_length: int = 512,
+    ) -> None:
+        super().__init__(state_shape, action_shape, device, features_only=True,
+                         output_dim=output_dim, hidden_dim=hidden_dim, n_layers=n_layers,
+                         dropout=dropout, bidirectional=bidirectional, trans_model_name=trans_model_name)
+        self.action_num = np.prod(action_shape)
+        self.num_atoms = num_atoms
+        def linear(x, y):
+            if is_noisy:
+                return NoisyLinear(x, y, noisy_std)
+            else:
+                return nn.Linear(x, y)
+        self.Q = nn.Sequential(
+            linear(self.output_dim, 512), nn.ReLU(inplace=True),
+            linear(512, self.action_num * self.num_atoms)
+        )
+        self._is_dueling = is_dueling
+        if self._is_dueling:
+            self.V = nn.Sequential(
+                linear(self.output_dim, 512), nn.ReLU(inplace=True),
+                linear(512, self.num_atoms)
+            )
+        self.output_dim = self.action_num * self.num_atoms
+    def forward(
+        self,
+        obs: Union[np.ndarray, torch.Tensor],
+        state: Optional[Any] = None,
+        info: Dict[str, Any] = {},
+    ) -> Tuple[torch.Tensor, Any]:
+        r"""Mapping: x -> Z(x, \*)."""
+        obs, state = super().forward(obs)
+        q = self.Q(obs)
+        q = q.view(-1, self.action_num, self.num_atoms)
+        if self._is_dueling:
+            v = self.V(obs)
+            v = v.view(-1, 1, self.num_atoms)
+            logits = q - q.mean(dim=1, keepdim=True) + v
+        else:
+            logits = q
+        probs = logits.softmax(dim=2)
+        return probs, state
+class Net_GRU_nn_emb_tianshou(Net):
+    def __init__(
+        self,
+        action_shape: Union[int, Sequence[int]] = 0,
+        hidden_sizes: Sequence[int] = (),
+        norm_layer: Optional[ModuleType] = None,
+        activation: Optional[ModuleType] = nn.ReLU,
+        device: Union[str, int, torch.device] = "cpu",
+        softmax: bool = False,
+        concat: bool = False,
+        num_atoms: int = 1,
+        dueling_param: Optional[Tuple[Dict[str, Any], Dict[str, Any]]] = None,
+        linear_layer: Type[nn.Linear] = nn.Linear,
+        hidden_dim: int = 128,
+        bidirectional: bool = True,
+        dropout: float = 0.,
+        n_layers: int = 1,
+        max_length: int = 512,
+        trans_model_name: str = 'bert-base-uncased',
+        word_emb_dim: int = 128,
+    ) -> None:
+        nn.Module.__init__(self)
+        self.device = device
+        self.softmax = softmax
+        self.num_atoms = num_atoms
+        self.hidden_dim = hidden_dim
+        self.bidirectional = bidirectional
+        self.dropout = dropout
+        self.n_layers = n_layers
+        self.trans_model_name = trans_model_name
+        self.max_length = max_length
+        action_dim = int(np.prod(action_shape)) * num_atoms
+        self.use_dueling = dueling_param is not None
+        output_dim = action_dim if not self.use_dueling and not concat else 0
+        self.output_dim = output_dim or hidden_dim
+        self.tokenizer = AutoTokenizer.from_pretrained(trans_model_name)
+        from transformers import logging
+        logging.set_verbosity_error()
+        self.vocab_size = self.tokenizer.vocab_size
+        self.embedding = nn.Embedding(self.vocab_size, word_emb_dim)
+        self.model = MyGRU(word_emb_dim, self.hidden_dim, self.n_layers,
+                          self.dropout, self.bidirectional, self.output_dim)
+        if self.use_dueling:  # dueling DQN
+            q_kwargs, v_kwargs = dueling_param  # type: ignore
+            q_output_dim, v_output_dim = 0, 0
+            if not concat:
+                q_output_dim, v_output_dim = action_dim, num_atoms
+            q_kwargs: Dict[str, Any] = {
+                **q_kwargs, "input_dim": self.output_dim,
+                "output_dim": q_output_dim,
+                "device": self.device
+            }
+            v_kwargs: Dict[str, Any] = {
+                **v_kwargs, "input_dim": self.output_dim,
+                "output_dim": v_output_dim,
+                "device": self.device
+            }
+            self.Q, self.V = MLP(**q_kwargs), MLP(**v_kwargs)
+            self.output_dim = self.Q.output_dim
+    def forward(
+        self,
+        obs: Union[np.ndarray, torch.Tensor],
+        state: Any = None,
+        info: Dict[str, Any] = {},
+    ) -> Tuple[torch.Tensor, Any]:
+        """Mapping: obs -> flatten (inside MLP)-> logits."""
+        if isinstance(obs, np.ndarray):
+            text = list(obs)
+        else:
+            text = obs
+        tokens = self.tokenizer(text, max_length=self.max_length, padding='max_length', truncation=True, return_tensors='pt')
+        input_ids = tokens['input_ids'].to(self.device)
+        attention_mask = tokens['attention_mask'].to(self.device)
+        embedding = self.embedding(input_ids)
+        mask = attention_mask.unsqueeze(-1).expand(embedding.size()).float()
+        embedding = embedding * mask
+        logits = self.model(embedding)
+        bsz = logits.shape[0]
+        if self.use_dueling:  # Dueling DQN
+            q, v = self.Q(logits), self.V(logits)
+            if self.num_atoms > 1:
+                q = q.view(bsz, -1, self.num_atoms)
+                v = v.view(bsz, -1, self.num_atoms)
+            logits = q - q.mean(dim=1, keepdim=True) + v
+        elif self.num_atoms > 1:
+            logits = logits.view(bsz, -1, self.num_atoms)
+        if self.softmax:
+            logits = torch.softmax(logits, dim=-1)
+        return logits, state

deciders/__init__.py ADDED Viewed

	@@ -0,0 +1,26 @@

+from .act import NaiveAct, RandomAct
+from .selfask import SelfAskAct
+from .pal import PAL
+from .cot import ChainOfThought
+from .self_consistency import SelfConsistency
+from .spp import SPP
+from .reflexion import Reflexion
+from .jarvis import Jarvis
+from .jarvis_without_insights import JarvisWithoutInsight
+from .jarvis_without_suggestions import JarvisWithoutSuggestions
+from .jarvis_without_shortmem import JarvisWithoutShortMem
+REGISTRY = {}
+REGISTRY['random_actor'] = RandomAct
+REGISTRY['naive_actor'] = NaiveAct
+REGISTRY['selfask_actor'] = SelfAskAct
+REGISTRY['pal_actor'] = PAL
+REGISTRY['cot_actor'] = ChainOfThought
+REGISTRY['self_consistency_actor'] = SelfConsistency
+REGISTRY['spp_actor'] = SPP
+REGISTRY['reflexion_actor'] = Reflexion
+REGISTRY['jarvis_actor'] = Jarvis
+REGISTRY['jarvis_actor_woi'] = JarvisWithoutInsight
+REGISTRY['jarvis_actor_wosug'] = JarvisWithoutSuggestions
+REGISTRY['jarvis_actor_wosh'] = JarvisWithoutShortMem

deciders/act.py ADDED Viewed

	@@ -0,0 +1,248 @@

+# This file contains functions for interacting with the ChatGPT model
+import openai
+from .gpt import gpt
+from loguru import logger
+from .parser import PARSERS
+from langchain.output_parsers import PydanticOutputParser
+from langchain.output_parsers import OutputFixingParser
+from langchain.chat_models import AzureChatOpenAI, ChatOpenAI
+from memory.env_history import EnvironmentHistory
+import tiktoken
+import json
+import re
+from .utils import run_chain
+class RandomAct():
+    def __init__(self, action_space):
+        self.action_space = action_space
+    def act(self, state_description, action_description, env_info, game_description=None, goal_description=None):
+        return self.action_space.sample()+1, '', '', '', 0, 0
+class NaiveAct(gpt):
+    def __init__(self, action_space, args, prompts, distiller, temperature=0.0, max_tokens=512, logger=None):
+        self.action_space = action_space
+        self.temperature = temperature
+        self.action_desc_dict = args.action_desc_dict
+        self.args = args
+        self.prompts = prompts
+        self.max_tokens = max_tokens
+        self.prompt_level = args.prompt_level
+        if args.gpt_version == "gpt-35-turbo":
+            model = "gpt-3.5-turbo"
+        else:
+            model = args.gpt_version
+        self.encoding = tiktoken.encoding_for_model(model)
+        super().__init__()
+        self.distiller = distiller
+        self.fewshot_example_initialization(args.prompt_level, args.prompt_path, distiller = self.distiller)
+        self.default_action = 1
+        self.parser = self._parser_initialization()
+        self.irr_game_description = ''
+        self.memory = []
+        self.env_history = EnvironmentHistory()
+        self.first_call = True
+        self.logger = logger
+        if self.prompt_level in [2, 4]:
+            self.memory = self.summarized_fewshot_example
+        if args.use_short_mem == 1:
+            self.use_short_mem = True
+            self.mem_num = self.args.trajectories_num
+        else:
+            self.use_short_mem = False
+            self.mem_num = 0
+    def num_tokens_from_string(self,string: str) -> int:
+        """Returns the number of tokens in a text string."""
+        num_tokens = len(self.encoding.encode(string))
+        return num_tokens
+    def update_mem(self,):
+        traj = "Firstly, the description and the goal of the task will be provided. Please pay close attention to comprehend the information presented below.\n"
+        traj += "Task Description: " + self.game_description + '\n'
+        traj += "Goal Description: " + self.goal_description + '\n'
+        traj += self.action_description
+        traj += "Below is the historical data for this round of the game, which includes the state and corresponding action for each step.\n"
+        traj += str(self.env_history)
+        # print(traj)
+        self._update_mem(traj)
+    def _update_mem(self, traj):
+        my_reflection = self.distiller.generate(traj, self.memory)
+        self.memory.append(my_reflection)
+        self.env_history.reset()
+    def clear_mem(self):
+        self.pre_memory = []
+        self.post_memory = []
+        self.is_first = True
+        self._update_mem(None)
+    def _parser_initialization(self):
+        if hasattr(self.action_space, 'n'):
+            assert self.action_space.n in PARSERS.keys(), f'Action space {self.action_space} is not supported.'
+            num_action = self.action_space.n
+        else:
+            num_action = 1
+        # autofixing_chat = AzureChatOpenAI(
+        #     openai_api_type=openai.api_type,
+        #     openai_api_version=openai.api_version,
+        #     openai_api_base=openai.api_base,
+        #     openai_api_key=openai.api_key,
+        #     deployment_name="gpt-35-turbo",
+        #     temperature=self.temperature,
+        #     max_tokens=self.max_tokens
+        # )
+        autofixing_chat = ChatOpenAI(temperature=0, openai_api_key=openai.api_key)
+        parser = PydanticOutputParser(pydantic_object=PARSERS[num_action])
+        autofixing_parser = OutputFixingParser.from_llm(
+            llm=autofixing_chat, parser=parser)
+        return autofixing_parser
+    def fewshot_example_initialization(self, level, path=None, distiller=None):
+        self.fewshot_example = []
+        self.irr_few_shot_examples = []
+        self.prompt_level = level
+        self.expert_knowledge = None
+        if level in [1,3]:
+            self.irr_few_shot_examples = self.prompts.TASK_IRRELEVANT_PROMPTS
+        elif level == 5:
+            if hasattr(self.prompts, "expert_prompt"):
+                self.expert_knowledge = self.prompts.expert_prompt
+            self.fewshot_example = self.prompts.PERCEPTRON_BASIC_FS_EXAMPLES
+        else:
+            self.irr_few_shot_examples = self.prompts.TASK_IRRELEVANT_PROMPTS
+            json_file = f'{path}_l{level}.json'
+            with open(json_file, 'r') as infile:
+                data = json.load(infile)
+            max_step_num = 0
+            for traj in data:
+                traj_text = traj[0]['game_description']
+                traj_text += traj[0]['goal_description']
+                for i, transition in enumerate(traj):
+                    traj_text += transition['observation']
+                    traj_text += f"> {transition['action']}"
+                    one_traj_token = self.num_tokens_from_string(traj_text)
+                    if one_traj_token > 5000:
+                        max_step_num = i+1
+                        break
+                traj_text += f"Your performance is: {transition['cum_reward']}"
+            if not max_step_num:
+                max_step_num = 200
+            self.summarized_fewshot_example = self.distiller.generate_from_file(json_file,max_step_num=max_step_num)
+    def response(self, state_description, action_description, env_info, game_description=None, goal_description=None, fewshot_examples=None):
+        if env_info['future_summary']:
+            prompt = f"{game_description}\n{goal_description}\n{fewshot_examples}\n{state_description}\n{env_info['future_summary']}\n{action_description} "
+        else:
+            prompt = f"{game_description}\n{goal_description}\n{fewshot_examples}\nCurrent {state_description}\n{action_description} "
+        prompt += "Please select an action based on the current game state and the information you get. You must select the appropriate action from the given action descriptions and cannot refrain from taking action or performing any prohibited actions. Your Action is: "
+        print(f"prompt is {prompt}")
+        res = openai.Completion.create(
+                engine=self.args.gpt_version,
+                prompt=prompt,
+                temperature=self.temperature,
+                max_tokens=self.max_tokens,
+            )
+        return prompt, res
+    def _add_history_before_action(self, game_description, goal_description, state_description):
+        self.game_description = game_description
+        self.goal_description = goal_description
+        self.env_history.add("observation", state_description)
+        # print(self.env_history)
+        if len(self.env_history) >= 2:
+            one_history_token = self.num_tokens_from_string(self.env_history.get_one_history())
+            self.env_history.set_history(6000 // one_history_token)
+    def act(self, state_description, action_description, env_info, game_description=None, goal_description=None, logfile=None):
+        self._add_history_before_action(game_description, goal_description, state_description)
+        asking_round = 0
+        res = None
+        action = None
+        prompt = None
+        if not self.logger:
+            logger.remove()
+            self.logger = logger.add(logfile, colorize=True, enqueue=True)
+        if self.args.prompt_level == 5:
+            my_mem = ""
+            if self.fewshot_example:
+                my_mem += "Here are some examples of how you should complete a task."
+                for examples in self.fewshot_example:
+                    my_mem += "\nQuestion: \n" + examples['question'] + "Answer: \n" + examples['answer']
+                my_mem += '\nNow you are in the task.\n'
+        elif self.args.prompt_level in [2,3,4]:
+            my_mem = ""
+            if self.prompt_level == 2:
+                my_mem += 'I have collected a few trajectories from a random policy, and the summaries are listed below.'
+            elif self.prompt_level == 3:
+                my_mem += 'I have collected a few trajectories before, and the summaries are listed below.'
+            elif self.prompt_level == 4:
+                my_mem += 'I have collected a few trajectories from an expert policy, and the summaries are listed below.'
+            my_mem += self._read_mem()
+        else:
+            my_mem = ""
+        if self.use_short_mem:
+            if len(self.env_history) > 1:
+                my_mem += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.'
+                my_mem += f"\nBelow are the latest {min(self.args.short_mem_num,len(self.env_history)//2)} historical data entries:\n"
+                my_mem += f"{self.env_history.get_histories(self.mem_num)}"
+        while asking_round < 3:
+            prompt, res = self.response(state_description, action_description, env_info, game_description, goal_description, my_mem)
+            action_str = res.choices[0].text.strip()
+            print(f'my anwser is {action_str}')
+            # import pdb; pdb.set_trace()
+            try:
+                if "Continuous" in self.args.env_name:
+                    action = float(re.findall(r"[-+]?\d*\.\d+", action_str)[0])
+                else:
+                    action = int(re.findall(r"\d+", action_str)[0])
+            except:
+                action = None
+                asking_round += 1
+                continue
+            if "Continuous" not in self.args.env_name:
+                if (action-1) in self.action_space:
+                    break
+                else:
+                    asking_round += 1
+                    action = None
+            else:
+                if action >= self.action_space.low and action <= self.action_space.high:
+                    break
+                else:
+                    asking_round += 1
+                    action = None
+        if action is None:
+            print('err on selecting action')
+            action = self.default_action
+        self._add_history_after_action(action)
+        self.logger.info(f'\n{prompt}')
+        self.logger.info(f'The GPT response is: {res}.')
+        self.logger.info(f'The optimal action is: {action}.')
+        return action, prompt, res, 0, 0
+    def _read_mem(self, ):
+        memory = self.memory
+        mem_str = ""
+        if len(memory) > 5:
+            memory = memory[-5:]
+        if len(memory) > 0:
+            mem_str += '\nYour memory for the task below:'
+            for i, m in enumerate(memory):
+                mem_str += f'\nTrial {i}:\n{m.strip()}'
+        return mem_str
+    def _add_history_after_action(self, action):
+        self.env_history.add('action', action)

deciders/cot.py ADDED Viewed

	@@ -0,0 +1,147 @@

+import openai
+from .misc import history_to_str
+from langchain.chat_models import AzureChatOpenAI
+from langchain.prompts.chat import (
+    PromptTemplate,
+    ChatPromptTemplate,
+    SystemMessagePromptTemplate,
+    HumanMessagePromptTemplate,
+)
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain import LLMChain
+from loguru import logger
+from langchain.callbacks import FileCallbackHandler
+from langchain.callbacks import get_openai_callback
+from .act import NaiveAct
+from .utils import run_chain
+class ChainOfThought(NaiveAct):
+    def __init__(self, action_space, args, prompts, distiller, temperature=0.1, max_tokens=None, logger=None):
+        super().__init__(action_space, args, prompts, distiller, temperature, max_tokens,logger)
+    def act(
+        self,
+        state_description,
+        action_description,
+        env_info,
+        game_description,
+        goal_description,
+        logfile=None,
+    ):
+        self.action_description = action_description
+        self._add_history_before_action(game_description, goal_description, state_description)
+        chat = AzureChatOpenAI(
+            openai_api_type=openai.api_type,
+            openai_api_version=openai.api_version,
+            openai_api_base=openai.api_base,
+            openai_api_key=openai.api_key,
+            deployment_name=self.args.gpt_version,
+            temperature=self.temperature,
+            max_tokens=self.max_tokens
+        )
+        suffix_flag = False
+        reply_format_description = \
+            "Your response should choose an optimal action from a valid action list and terminate with the following format: "
+        # System Message
+        human_template = "Now, you are completing a challenging task. You must carefully understand the Chain-of-Thought method you will use and apply it to the following task.\n"
+        # task-irrelevant SystemMessage
+        if self.irr_few_shot_examples:
+            human_template += 'In the following example, I shall present a set of question and answer with the Chain-of-Thought method. Please adhere to the format and reasoning of the provided response when addressing the subsequent task.\n'
+            for i, examples in enumerate(self.irr_few_shot_examples):
+                human_template += f"\nExample {i+1}:\n"
+                human_template += "Question: \n" + examples['question'] + "\nAnswer: \n" + examples['answer']
+        # task-irrelevant few shot if have
+        if self.irr_few_shot_examples:
+            human_template += "\nMoving forward, I will describe the task, the goal, and the actions you may execute. Please pay close attention to comprehend the information presented below.\n"
+        if self.fewshot_example:
+            human_template += "I will describe the task, the goal, and the actions you may execute. Please pay close attention to comprehend the information presented below."
+        # print(fewshot_example_prompt.format(**fewshot_examples[0]))
+        human_template += '\nTask Description: {game_description} \n'
+        human_template += 'Goal Description: {goal_description}\n'
+        human_template += 'Actions Description: {action_description}\n'
+        if self.fewshot_example:
+            human_template += "Here, I will provide you with some guidance to help you better understand the rules of the task. Next are some examples: "
+            for i, examples in enumerate(self.fewshot_example):
+                human_template += f"\nExample {i+1}:\n"
+                human_template += "Question: \n" + examples['question'] + "\nAnswer: \n" + examples['answer']
+        if self.prompt_level in [2, 3, 4]:
+            if self.memory:
+                human_template += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.\n'
+                suffix_flag = True
+                if self.prompt_level == 2:
+                    human_template += 'I have collected a few trajectories from a random policy, and the summaries are listed below.'
+                elif self.prompt_level == 3:
+                    human_template += 'I have collected a few trajectories before, and the summaries are listed below.'
+                elif self.prompt_level == 4:
+                    human_template += 'I have collected a few trajectories from an expert policy, and the summaries are listed below.'
+                human_template += self._read_mem() + "\n"
+        if self.use_short_mem:
+            if len(self.env_history) > 1:
+                if not suffix_flag:
+                    human_template += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.'
+                human_template += f"\nBelow are the latest {self.args.short_mem_num} historical data entries:\n"
+                human_template += f"{self.env_history.get_histories(self.mem_num)}"
+        human_template += '\nNext is the observation that the agent gets:\nCurrent {state_description}\n'
+        human_template += 'Please select an action based on the current game state and the information you get. You must select the appropriate action from the given action descriptions and cannot refrain from taking action or performing any prohibited actions. Here is the action description below:\n{action_description}\n'
+        human_template += 'Please note that you need to carefully lay out your thought process on the question, not just give an answer. You need to write the corresponding logic of your thinking following the example above. Also, please keep in mind not to answer with any redundant and irrelevant content.\n'
+        human_template += "Finally, you also need to normalize your output according to the reply format description.\n"
+        human_template += 'Reply format description: {reply_format_description}{format_instructions}\n'
+        human_message_prompt = PromptTemplate(
+            template=human_template,
+            input_variables=[
+                'state_description', 'goal_description', 'game_description',
+                'action_description', 'reply_format_description'],
+            partial_variables={'format_instructions': self.parser.get_format_instructions()}
+        )
+        human_message_prompt = HumanMessagePromptTemplate(prompt=human_message_prompt)
+        chat_prompt = ChatPromptTemplate.from_messages([human_message_prompt])
+        if not self.logger:
+            logger.remove()
+            self.logger = logger.add(logfile, colorize=True, enqueue=True)
+        handler = FileCallbackHandler(logfile)
+        chain = LLMChain(llm=chat, prompt=chat_prompt, callbacks=[handler], verbose=False)
+        text_prompt = chat_prompt.format_messages(
+            game_description=game_description,
+            state_description=state_description,
+            goal_description=goal_description,
+            action_description=action_description,
+            reply_format_description=reply_format_description
+        )
+        texts = ""
+        for text in text_prompt:
+            texts += text.content + "\n"
+        with get_openai_callback() as cb:
+            response = run_chain(
+                chain,
+                game_description=game_description,
+                state_description=state_description,
+                goal_description=goal_description,
+                action_description=action_description,
+                reply_format_description=reply_format_description
+            )
+            total_tokens = cb.total_tokens
+            total_cost = cb.total_cost
+        action = self.parser.parse(response).action
+        self._add_history_after_action(action)
+        self.logger.info(f'The GPT response is: {response}.')
+        self.logger.info(f'The optimal action is: {action}.')
+        if env_info.get('history'):
+            self.logger.info(f'History: {history_to_str(env_info["history"])}')
+        return action, texts, response, total_tokens, total_cost

deciders/jarvis.py ADDED Viewed

	@@ -0,0 +1,177 @@

+import openai
+from .misc import history_to_str
+from langchain.chat_models import AzureChatOpenAI, ChatOpenAI
+from langchain.prompts.chat import (
+    PromptTemplate,
+    ChatPromptTemplate,
+    SystemMessagePromptTemplate,
+    HumanMessagePromptTemplate,
+)
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain import LLMChain
+from langchain.callbacks import FileCallbackHandler
+from langchain.callbacks import get_openai_callback
+from .act import NaiveAct
+from memory.env_history import EnvironmentHistory
+import tiktoken
+from .utils import run_chain
+from loguru import logger
+class Jarvis(NaiveAct):
+    def __init__(self, action_space, args, prompts, distiller, temperature=0., max_tokens=None, logger=None, fixed_suggestion=None, fixed_insight=None):
+        super().__init__(action_space, args, prompts, distiller, temperature, max_tokens, logger)
+        self.pre_memory = []
+        self.post_memory = []
+        self.is_first = True
+        self.num_trails = args.num_trails
+        self.game_description = args.game_description
+        self.goal_description = args.goal_description
+        self.action_description = args.action_description
+        self.action_desc_dict = args.action_desc_dict
+        self.mem_num = args.trajectories_num
+        self.temperature = temperature
+        self.fixed_suggestion = fixed_suggestion
+        self.fixed_insight = fixed_insight
+        self._update_mem(None)
+        self.insight = ""
+    def num_tokens_from_string(self,string: str) -> int:
+        """Returns the number of tokens in a text string."""
+        num_tokens = len(self.encoding.encode(string))
+        return num_tokens
+    def update_mem(self,):
+        traj = self.game_description
+        traj += self.goal_description
+        traj += self.action_description
+        traj += str(self.env_history)
+        self._update_mem(traj)
+    def clear_mem(self):
+        self.pre_memory = []
+        self.post_memory = []
+        self.is_first = True
+        self._update_mem(None)
+    def _update_mem(self, traj):
+        if self.memory:
+            self.post_memory = self.memory
+            self.insight = self.distiller.generate_insight(self.post_memory)
+        else:
+            if not self.is_first:
+                summary = self.distiller.generate_summary(traj, self.post_memory)
+                self.post_memory.append(summary)
+                self.insight = self.distiller.generate_insight(self.post_memory)
+            else:
+                self.is_first = False
+                self.insight = ""
+        suggestion = self.distiller.generate_suggestion(self.game_description, self.goal_description, self.action_description, self.pre_memory, self.post_memory, self.insight, self.num_trails)
+        if self.fixed_suggestion:
+            suggestion = self.fixed_suggestion
+        if self.fixed_insight:
+            self.insight = self.fixed_insight
+        self.pre_memory.append(suggestion)
+        self.env_history.reset()
+    def _read_mem(self, ):
+        insight_str = ""
+        if self.insight:
+            insight_str += "The insights of the game are listed below: "
+            insight_str += f"{self.insight}\n"
+        suggestion_str = "The suggestions are listed below:" + self.pre_memory[-1]
+        return insight_str + suggestion_str
+    def act(
+        self,
+        state_description,
+        action_description,
+        env_info,
+        game_description,
+        goal_description,
+        logfile=None,
+    ):
+        self.game_description = game_description
+        self.goal_description = goal_description
+        self.env_history.add("observation", state_description)
+        chat = ChatOpenAI(temperature=0.5, openai_api_key=openai.api_key, model=self.args.gpt_version)
+        # print(self.logger)
+        reply_format_description = \
+            "Your response should choose an optimal action from valid action list, and terminated with following format: "
+            # only task relevant examplesA
+        template = "Now you are completing a task."
+        template += "You need to carefully understand the description of the game. "
+        # TODO: few shot example handle
+        if self.irr_few_shot_examples:
+            template += "Here are some examples of how you should completing a task."
+            for examples in self.irr_few_shot_examples:
+                template += "\nQuestion: \n" + examples['question'] + "Answer: \n" + examples['answer']
+        template += "\n\nNow you are in the task."
+        template += " {game_description} {action_description} {goal_description}"
+        template += "You are observing something and  " \
+                "you need to choose the optimal action acoordingly."
+        template += 'Response and interact using the format: {reply_format_description}{format_instructions}\n'
+        template += self._read_mem()
+        system_message_prompt = SystemMessagePromptTemplate.from_template(template)
+        short_memory_template = HumanMessagePromptTemplate.from_template("{history}")
+        chat_prompt = ChatPromptTemplate.from_messages(
+            [system_message_prompt, short_memory_template])
+        if self.logger:
+            pass
+        else:
+            if logfile:
+                # logger.remove()
+                if self.first_call:
+                    self.logger = logger.add(logfile, colorize=True, enqueue=True, filter=lambda x: '[Reflexion Memory]' not in x['message'])
+                    self.first_call = False
+        handler = FileCallbackHandler(logfile)
+        total_tokens, total_cost = 0, 0
+        max_think_times = 1
+        # TODO: ADD REACT Support
+        # print(str(self.env_history))
+        if self.use_short_mem:
+            my_history = str(self.env_history)
+        else:
+            my_history = ""
+        for i_think in range(max_think_times):
+            # chain = LLMChain(llm=chat, prompt=chat_prompt, callbacks=[handler], verbose=True)
+            chain = LLMChain(llm=chat, prompt=chat_prompt, callbacks=[handler], verbose=False)
+            with get_openai_callback() as cb:
+                response = run_chain(
+                    chain,
+                    game_description=game_description,
+                    goal_description=goal_description,
+                    action_description=action_description,
+                    # state_description = self.env_history.get_last_history(),
+                    history=self.env_history.get_histories_with_last(self.mem_num),
+                    format_instructions=self.parser.get_format_instructions(),
+                    reply_format_description=reply_format_description,
+                    max_token=3000
+                )
+                total_tokens += cb.total_tokens
+                total_cost += cb.total_cost
+            action = self.parser.parse(response).action
+        self._add_history_after_action(action)
+        self.logger.info(f'The GPT response is: {response}.')
+        self.logger.info(f'The optimal action is: {action}.')
+        if self.pre_memory:
+            self.logger.info(f'The suggestion is: {self.pre_memory[-1]}.')
+        if self.post_memory:
+            self.logger.info(f'The summary is: {self.post_memory[-1]}.')
+        if env_info.get('history'):
+            self.logger.info(f'History: {history_to_str(env_info["history"])}')
+        text_prompt = chat_prompt.format_messages(
+            game_description=game_description,
+                    goal_description=goal_description,
+                    action_description=action_description,
+                    # state_description = self.env_history.get_last_history(),
+                    history=self.env_history.get_histories_with_last(self.mem_num),
+                    format_instructions=self.parser.get_format_instructions(),
+                    reply_format_description=reply_format_description,
+        )
+        text_prompt = f'{text_prompt[0].content}\n{text_prompt[1].content}'
+        return action, text_prompt, response, total_tokens, total_cost

deciders/jarvis_without_insights.py ADDED Viewed

	@@ -0,0 +1,179 @@

+import openai
+from .misc import history_to_str
+from langchain.chat_models import AzureChatOpenAI
+from langchain.prompts.chat import (
+    PromptTemplate,
+    ChatPromptTemplate,
+    SystemMessagePromptTemplate,
+    HumanMessagePromptTemplate,
+)
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain import LLMChain
+from loguru import logger
+from langchain.callbacks import FileCallbackHandler
+from langchain.callbacks import get_openai_callback
+from .act import NaiveAct
+from memory.env_history import EnvironmentHistory
+import tiktoken
+from .utils import run_chain
+class JarvisWithoutInsight(NaiveAct):
+    def __init__(self, action_space, args, prompts, distiller, temperature=0.1, max_tokens=None):
+        super().__init__(action_space, args, prompts, distiller, temperature, max_tokens)
+        self.pre_memory = []
+        self.post_memory = []
+        self.is_first = True
+        self.num_trails = args.num_trails
+        self.game_description = args.game_description
+        self.goal_description = args.goal_description
+        self.action_description = args.action_description
+        self._update_mem(None)
+    def update_mem(self,):
+        traj = self.game_description
+        traj += self.goal_description
+        max_step_num = min(14000 // self.num_tokens_from_string(self.env_history.get_one_history()),200)
+        traj += self.env_history.get_histories(max_step_num)
+        self._update_mem(traj)
+    def _update_mem(self, traj):
+        if not self.is_first:
+            summary = self.distiller.generate_summary(traj, self.post_memory)
+            self.post_memory.append(summary)
+            self.insight = self.distiller.generate_insight(self.post_memory)
+        else:
+            self.is_first = False
+        suggestion = self.distiller.generate_suggestion(self.game_description, self.goal_description, self.action_description, self.pre_memory, self.post_memory, self.num_trails)
+        self.pre_memory.append(suggestion)
+        self.env_history.reset()
+    def _read_mem(self, ):
+        insight_str = ""
+        suggestion_str = "The suggestions are listed below:" + self.pre_memory[-1]
+        return insight_str + suggestion_str
+    def act(
+        self,
+        state_description,
+        action_description,
+        env_info,
+        game_description,
+        goal_description,
+        logfile=None,
+    ):
+        self.game_description = game_description
+        self.goal_description = goal_description
+        self.env_history.add("observation", state_description)
+        chat = AzureChatOpenAI(
+            openai_api_type=openai.api_type,
+            openai_api_version=openai.api_version,
+            openai_api_base=openai.api_base,
+            openai_api_key=openai.api_key,
+            deployment_name=self.args.gpt_version,
+            temperature=self.temperature,
+            max_tokens=self.max_tokens,
+        )
+        reply_format_description = \
+            "Your response should choose an optimal action from valid action list, and terminated with following format: "
+            # only task relevant examplesA
+        template = "Now you are completing a task. "
+        template += "You need to carefully understand the description of the game. "
+        # TODO: few shot example handle
+        if self.irr_few_shot_examples:
+            template += "Here are some examples of how you should completing a task."
+            for examples in self.irr_few_shot_examples:
+                template += "\nQuestion: \n" + examples['question'] + "Answer: \n" + examples['answer']
+        if self.fewshot_example:
+            if self.expert_knowledge:
+                template += "Here, I will provide you with some expert knowledge to help you better understand the rules of the task."
+                template += self.expert_knowledge + '\n'
+            template += "Next are some examples: "
+        system_message_prompt = SystemMessagePromptTemplate.from_template(template)
+        human_template = ""
+        human_template += "\n\nNow you are in the task.\n"
+        human_template += "{game_description}\n{action_description}\n{goal_description}\n"
+        human_template += "You are observing something and  " \
+                "you need to choose the optimal action acoordingly. "
+        human_template += 'Response and interact using the format: {reply_format_description}{format_instructions}\n'
+        human_template += self._read_mem()
+        human_template += "\n\nHere are some history states listed below:\n"
+        fewshot_example_prompt = PromptTemplate(
+            input_variables=["question", "answer"],
+            template="Question: \n{question}\n{answer}"
+        )
+        human_message_prompt = FewShotPromptTemplate(
+            examples=self.fewshot_example,
+            example_prompt=fewshot_example_prompt,
+            suffix=human_template,
+            input_variables=[
+                'game_description', 'goal_description',
+                'action_description', 'reply_format_description'],
+            partial_variables={'format_instructions': self.parser.get_format_instructions()}
+        )
+        human_message_prompt = HumanMessagePromptTemplate(prompt=human_message_prompt)
+        short_memory_template = HumanMessagePromptTemplate.from_template("{history} Please select an action based on the current game state:")
+        chat_prompt = ChatPromptTemplate.from_messages(
+            [system_message_prompt, human_message_prompt, short_memory_template])
+        if logfile:
+            # logger.remove()
+            if self.first_call:
+                logger.add(logfile, colorize=True, enqueue=True, filter=lambda x: '[Reflexion Memory]' not in x['message'])
+                self.first_call = False
+            handler = FileCallbackHandler(logfile)
+        total_tokens, total_cost = 0, 0
+        max_think_times = 1
+        # TODO: ADD REACT Support
+        # print(str(self.env_history))
+        if self.use_short_mem:
+            my_history = str(self.env_history)
+        else:
+            my_history = ""
+        for i_think in range(max_think_times):
+            chain = LLMChain(llm=chat, prompt=chat_prompt, callbacks=[handler], verbose=False)
+            with get_openai_callback() as cb:
+                response = run_chain(
+                    chain,
+                    game_description=game_description,
+                    goal_description=goal_description,
+                    action_description=action_description,
+                    history=str(self.env_history),
+                    format_instructions=self.parser.get_format_instructions(),
+                    reply_format_description=reply_format_description,
+                    max_token = 3000
+                )
+                total_tokens += cb.total_tokens
+                total_cost += cb.total_cost
+            action = self.parser.parse(response).action
+        text_prompt = chat_prompt.format_messages(
+            game_description=game_description,
+            goal_description=goal_description,
+            action_description=action_description,
+            history=str(self.env_history),
+            format_instructions=self.parser.get_format_instructions(),
+            reply_format_description=reply_format_description,
+        )
+        texts = ""
+        for text in text_prompt:
+            texts += text.content + "\n"
+        self._add_history_after_action(action)
+        logger.info(f'The GPT response is: {response}.')
+        logger.info(f'The optimal action is: {action}.')
+        if self.pre_memory:
+            logger.info(f'The suggestion is: {self.pre_memory[-1]}.')
+        if self.post_memory:
+            logger.info(f'The summary is: {self.post_memory[-1]}.')
+        if env_info.get('history'):
+            logger.info(f'History: {history_to_str(env_info["history"])}')
+        return action, texts, response, logger, total_tokens, total_cost

deciders/jarvis_without_shortmem.py ADDED Viewed

	@@ -0,0 +1,182 @@

+import openai
+from .misc import history_to_str
+from langchain.chat_models import AzureChatOpenAI
+from langchain.prompts.chat import (
+    PromptTemplate,
+    ChatPromptTemplate,
+    SystemMessagePromptTemplate,
+    HumanMessagePromptTemplate,
+)
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain import LLMChain
+from loguru import logger
+from langchain.callbacks import FileCallbackHandler
+from langchain.callbacks import get_openai_callback
+from .act import NaiveAct
+from memory.env_history import EnvironmentHistory
+import tiktoken
+from .utils import run_chain
+class JarvisWithoutShortMem(NaiveAct):
+    def __init__(self, action_space, args, prompts, distiller, temperature=0.1, max_tokens=None):
+        super().__init__(action_space, args, prompts, distiller, temperature, max_tokens)
+        self.pre_memory = []
+        self.post_memory = []
+        self.is_first = True
+        self.num_trails = args.num_trails
+        self.game_description = args.game_description
+        self.goal_description = args.goal_description
+        self.action_description = args.action_description
+        self._update_mem(None)
+    def update_mem(self,):
+        traj = self.game_description
+        traj += self.goal_description
+        max_step_num = min(14000 // self.num_tokens_from_string(self.env_history.get_one_history()),200)
+        traj += self.env_history.get_histories(max_step_num)
+        self._update_mem(traj)
+    def _update_mem(self, traj):
+        if not self.is_first:
+            summary = self.distiller.generate_summary(traj, self.post_memory)
+            self.post_memory.append(summary)
+            self.insight = self.distiller.generate_insight(self.post_memory)
+        else:
+            self.is_first = False
+        suggestion = self.distiller.generate_suggestion(self.game_description, self.goal_description, self.action_description, self.pre_memory, self.post_memory, self.num_trails)
+        self.pre_memory.append(suggestion)
+        self.env_history.reset()
+    def _read_mem(self, ):
+        insight_str = ""
+        if len(self.post_memory) > 0:
+            insight_str += "The insights of the game are listed below: "
+            insight_str += f"{self.insight}\n"
+        suggestion_str = "The suggestions are listed below:" + self.pre_memory[-1]
+        return insight_str + suggestion_str
+    def act(
+        self,
+        state_description,
+        action_description,
+        env_info,
+        game_description,
+        goal_description,
+        logfile=None,
+    ):
+        self.game_description = game_description
+        self.goal_description = goal_description
+        self.env_history.add("observation", state_description)
+        chat = AzureChatOpenAI(
+            openai_api_type=openai.api_type,
+            openai_api_version=openai.api_version,
+            openai_api_base=openai.api_base,
+            openai_api_key=openai.api_key,
+            deployment_name=self.args.gpt_version,
+            temperature=self.temperature,
+            max_tokens=self.max_tokens,
+        )
+        reply_format_description = \
+            "Your response should choose an optimal action from valid action list, and terminated with following format: "
+            # only task relevant examplesA
+        template = "Now you are completing a task. "
+        template += "You need to carefully understand the description of the game. "
+        # TODO: few shot example handle
+        if self.irr_few_shot_examples:
+            template += "Here are some examples of how you should completing a task."
+            for examples in self.irr_few_shot_examples:
+                template += "\nQuestion: \n" + examples['question'] + "Answer: \n" + examples['answer']
+        if self.fewshot_example:
+            if self.expert_knowledge:
+                template += "Here, I will provide you with some expert knowledge to help you better understand the rules of the task."
+                template += self.expert_knowledge + '\n'
+            template += "Next are some examples: "
+        system_message_prompt = SystemMessagePromptTemplate.from_template(template)
+        human_template = ""
+        human_template += "\n\nNow you are in the task.\n"
+        human_template += "{game_description}\n{action_description}\n{goal_description}\n"
+        human_template += "You are observing something and  " \
+                "you need to choose the optimal action acoordingly. "
+        human_template += 'Response and interact using the format: {reply_format_description}{format_instructions}\n'
+        human_template += self._read_mem()
+        human_template += "\n\nHere are some history states listed below:\n"
+        fewshot_example_prompt = PromptTemplate(
+            input_variables=["question", "answer"],
+            template="Question: \n{question}\n{answer}"
+        )
+        human_message_prompt = FewShotPromptTemplate(
+            examples=self.fewshot_example,
+            example_prompt=fewshot_example_prompt,
+            suffix=human_template,
+            input_variables=[
+                'game_description', 'goal_description',
+                'action_description', 'reply_format_description'],
+            partial_variables={'format_instructions': self.parser.get_format_instructions()}
+        )
+        human_message_prompt = HumanMessagePromptTemplate(prompt=human_message_prompt)
+        short_memory_template = HumanMessagePromptTemplate.from_template("{history} Please select an action based on the current game state:")
+        chat_prompt = ChatPromptTemplate.from_messages(
+            [system_message_prompt, human_message_prompt, short_memory_template])
+        if logfile:
+            # logger.remove()
+            if self.first_call:
+                logger.add(logfile, colorize=True, enqueue=True, filter=lambda x: '[Reflexion Memory]' not in x['message'])
+                self.first_call = False
+            handler = FileCallbackHandler(logfile)
+        total_tokens, total_cost = 0, 0
+        max_think_times = 1
+        # TODO: ADD REACT Support
+        # print(str(self.env_history))
+        if self.use_short_mem:
+            my_history = str(self.env_history)
+        else:
+            my_history = ""
+        for i_think in range(max_think_times):
+            chain = LLMChain(llm=chat, prompt=chat_prompt, callbacks=[handler], verbose=False)
+            with get_openai_callback() as cb:
+                response = run_chain(
+                    chain,
+                    game_description=game_description,
+                    goal_description=goal_description,
+                    action_description=action_description,
+                    history=self.env_history.get_last_history(),
+                    format_instructions=self.parser.get_format_instructions(),
+                    reply_format_description=reply_format_description,
+                    max_token = 3000
+                )
+                total_tokens += cb.total_tokens
+                total_cost += cb.total_cost
+            action = self.parser.parse(response).action
+        text_prompt = chat_prompt.format_messages(
+            game_description=game_description,
+            goal_description=goal_description,
+            action_description=action_description,
+            history=self.env_history.get_last_history(),
+            format_instructions=self.parser.get_format_instructions(),
+            reply_format_description=reply_format_description,
+        )
+        texts = ""
+        for text in text_prompt:
+            texts += text.content + "\n"
+        self._add_history_after_action(action)
+        logger.info(f'The GPT response is: {response}.')
+        logger.info(f'The optimal action is: {action}.')
+        if self.pre_memory:
+            logger.info(f'The suggestion is: {self.pre_memory[-1]}.')
+        if self.post_memory:
+            logger.info(f'The summary is: {self.post_memory[-1]}.')
+        if env_info.get('history'):
+            logger.info(f'History: {history_to_str(env_info["history"])}')
+        return action, texts, response, logger, total_tokens, total_cost

deciders/jarvis_without_suggestions.py ADDED Viewed

	@@ -0,0 +1,180 @@

+import openai
+from .misc import history_to_str
+from langchain.chat_models import AzureChatOpenAI
+from langchain.prompts.chat import (
+    PromptTemplate,
+    ChatPromptTemplate,
+    SystemMessagePromptTemplate,
+    HumanMessagePromptTemplate,
+)
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain import LLMChain
+from loguru import logger
+from langchain.callbacks import FileCallbackHandler
+from langchain.callbacks import get_openai_callback
+from .act import NaiveAct
+from memory.env_history import EnvironmentHistory
+import tiktoken
+from .utils import run_chain
+class JarvisWithoutSuggestions(NaiveAct):
+    def __init__(self, action_space, args, prompts, distiller, temperature=0.1, max_tokens=None):
+        super().__init__(action_space, args, prompts, distiller, temperature, max_tokens)
+        self.pre_memory = []
+        self.post_memory = []
+        self.is_first = True
+        self.num_trails = args.num_trails
+        self.game_description = args.game_description
+        self.goal_description = args.goal_description
+        self.action_description = args.action_description
+        self._update_mem(None)
+    def update_mem(self,):
+        traj = self.game_description
+        traj += self.goal_description
+        max_step_num = min(14000 // self.num_tokens_from_string(self.env_history.get_one_history()),200)
+        traj += self.env_history.get_histories(max_step_num)
+        self._update_mem(traj)
+    def _update_mem(self, traj):
+        if not self.is_first:
+            summary = self.distiller.generate_summary(traj, self.post_memory)
+            self.post_memory.append(summary)
+            self.insight = self.distiller.generate_insight(self.post_memory)
+        else:
+            self.is_first = False
+        suggestion = self.distiller.generate_suggestion(self.game_description, self.goal_description, self.action_description, self.pre_memory, self.post_memory, self.num_trails)
+        self.pre_memory.append(suggestion)
+        self.env_history.reset()
+    def _read_mem(self, ):
+        insight_str = ""
+        if len(self.post_memory) > 0:
+            insight_str += "The insights of the game are listed below: "
+            insight_str += f"{self.insight}\n"
+        suggestion_str = "\n"
+        return insight_str + suggestion_str
+    def act(
+        self,
+        state_description,
+        action_description,
+        env_info,
+        game_description,
+        goal_description,
+        logfile=None,
+    ):
+        self.game_description = game_description
+        self.goal_description = goal_description
+        self.env_history.add("observation", state_description)
+        chat = AzureChatOpenAI(
+            openai_api_type=openai.api_type,
+            openai_api_version=openai.api_version,
+            openai_api_base=openai.api_base,
+            openai_api_key=openai.api_key,
+            deployment_name=self.args.gpt_version,
+            temperature=self.temperature,
+            max_tokens=self.max_tokens,
+        )
+        reply_format_description = \
+            "Your response should choose an optimal action from valid action list, and terminated with following format: "
+            # only task relevant examplesA
+        template = "Now you are completing a task. "
+        template += "You need to carefully understand the description of the game. "
+        # TODO: few shot example handle
+        if self.irr_few_shot_examples:
+            template += "Here are some examples of how you should completing a task."
+            for examples in self.irr_few_shot_examples:
+                template += "\nQuestion: \n" + examples['question'] + "Answer: \n" + examples['answer']
+        if self.fewshot_example:
+            if self.expert_knowledge:
+                template += "Here, I will provide you with some expert knowledge to help you better understand the rules of the task."
+                template += self.expert_knowledge + '\n'
+            template += "Next are some examples: "
+        system_message_prompt = SystemMessagePromptTemplate.from_template(template)
+        human_template = ""
+        human_template += "\n\nNow you are in the task.\n"
+        human_template += "{game_description}\n{action_description}\n{goal_description}\n"
+        human_template += "You are observing something and  " \
+                "you need to choose the optimal action acoordingly. "
+        human_template += 'Response and interact using the format: {reply_format_description}{format_instructions}\n'
+        human_template += self._read_mem()
+        human_template += "\n\nHere are some history states listed below:\n"
+        fewshot_example_prompt = PromptTemplate(
+            input_variables=["question", "answer"],
+            template="Question: \n{question}\n{answer}"
+        )
+        human_message_prompt = FewShotPromptTemplate(
+            examples=self.fewshot_example,
+            example_prompt=fewshot_example_prompt,
+            suffix=human_template,
+            input_variables=[
+                'game_description', 'goal_description',
+                'action_description', 'reply_format_description'],
+            partial_variables={'format_instructions': self.parser.get_format_instructions()}
+        )
+        human_message_prompt = HumanMessagePromptTemplate(prompt=human_message_prompt)
+        short_memory_template = HumanMessagePromptTemplate.from_template("{history} Please select an action based on the current game state:")
+        chat_prompt = ChatPromptTemplate.from_messages(
+            [system_message_prompt, human_message_prompt, short_memory_template])
+        if logfile:
+            # logger.remove()
+            if self.first_call:
+                logger.add(logfile, colorize=True, enqueue=True, filter=lambda x: '[Reflexion Memory]' not in x['message'])
+                self.first_call = False
+            handler = FileCallbackHandler(logfile)
+        total_tokens, total_cost = 0, 0
+        max_think_times = 1
+        # TODO: ADD REACT Support
+        # print(str(self.env_history))
+        if self.use_short_mem:
+            my_history = str(self.env_history)
+        else:
+            my_history = ""
+        for i_think in range(max_think_times):
+            chain = LLMChain(llm=chat, prompt=chat_prompt, callbacks=[handler], verbose=False)
+            with get_openai_callback() as cb:
+                response = run_chain(
+                    chain,
+                    game_description=game_description,
+                    goal_description=goal_description,
+                    action_description=action_description,
+                    history=str(self.env_history),
+                    format_instructions=self.parser.get_format_instructions(),
+                    reply_format_description=reply_format_description,
+                    max_token = 3000
+                )
+                total_tokens += cb.total_tokens
+                total_cost += cb.total_cost
+            action = self.parser.parse(response).action
+        text_prompt = chat_prompt.format_messages(
+            game_description=game_description,
+            goal_description=goal_description,
+            action_description=action_description,
+            history=str(self.env_history),
+            format_instructions=self.parser.get_format_instructions(),
+            reply_format_description=reply_format_description,
+        )
+        texts = ""
+        for text in text_prompt:
+            texts += text.content + "\n"
+        self._add_history_after_action(action)
+        logger.info(f'The GPT response is: {response}.')
+        logger.info(f'The optimal action is: {action}.')
+        if self.post_memory:
+            logger.info(f'The summary is: {self.post_memory[-1]}.')
+        if env_info.get('history'):
+            logger.info(f'History: {history_to_str(env_info["history"])}')
+        return action, texts, response, logger, total_tokens, total_cost

deciders/jarvis_without_summary.py ADDED Viewed

	@@ -0,0 +1,179 @@

+import openai
+from .misc import history_to_str
+from langchain.chat_models import AzureChatOpenAI
+from langchain.prompts.chat import (
+    PromptTemplate,
+    ChatPromptTemplate,
+    SystemMessagePromptTemplate,
+    HumanMessagePromptTemplate,
+)
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain import LLMChain
+from loguru import logger
+from langchain.callbacks import FileCallbackHandler
+from langchain.callbacks import get_openai_callback
+from .act import NaiveAct
+from memory.env_history import EnvironmentHistory
+import tiktoken
+class Jarvis(NaiveAct):
+    def __init__(self, action_space, args, prompts, distiller, temperature=0.1, max_tokens=None):
+        super().__init__(action_space, args, prompts, distiller, temperature, max_tokens)
+        self.pre_memory = []
+        self.post_memory = []
+        self.is_first = True
+        self.num_trails = args.num_trails
+        self.game_description = args.game_description
+        self.goal_description = args.goal_description
+        self.action_description = args.action_description
+        self._update_mem(None)
+    def update_mem(self,):
+        traj = self.game_description
+        traj += self.goal_description
+        max_step_num = min(14000 // self.num_tokens_from_string(self.env_history.get_one_history()),200)
+        traj += self.env_history.get_histories(max_step_num)
+        self._update_mem(traj)
+    def _update_mem(self, traj):
+        if not self.is_first:
+            summary = self.distiller.generate_summary(traj, self.post_memory)
+            self.post_memory.append(summary)
+            self.insight = self.distiller.generate_insight(self.post_memory)
+        else:
+            self.is_first = False
+        suggestion = self.distiller.generate_suggestion(self.game_description, self.goal_description, self.action_description, self.pre_memory, self.post_memory, self.num_trails)
+        self.pre_memory.append(suggestion)
+        self.env_history.reset()
+    def _read_mem(self, ):
+        insight_str = ""
+        if len(self.post_memory) > 0:
+            insight_str += "The insights of the game are listed below: "
+            insight_str += f"{self.insight}\n"
+        suggestion_str = "The suggestions are listed below:" + self.pre_memory[-1]
+        return insight_str + suggestion_str
+    def act(
+        self,
+        state_description,
+        action_description,
+        env_info,
+        game_description,
+        goal_description,
+        logfile=None,
+    ):
+        self.game_description = game_description
+        self.goal_description = goal_description
+        self.env_history.add("observation", state_description)
+        chat = AzureChatOpenAI(
+            openai_api_type=openai.api_type,
+            openai_api_version=openai.api_version,
+            openai_api_base=openai.api_base,
+            openai_api_key=openai.api_key,
+            deployment_name=self.args.gpt_version,
+            temperature=self.temperature,
+            max_tokens=self.max_tokens,
+        )
+        reply_format_description = \
+            "Your response should choose an optimal action from valid action list, and terminated with following format: "
+            # only task relevant examplesA
+        template = "Now you are completing a task. "
+        template += "You need to carefully understand the description of the game. "
+        # TODO: few shot example handle
+        if self.irr_few_shot_examples:
+            template += "Here are some examples of how you should completing a task."
+            for examples in self.irr_few_shot_examples:
+                template += "\nQuestion: \n" + examples['question'] + "Answer: \n" + examples['answer']
+        if self.fewshot_example:
+            if self.expert_knowledge:
+                template += "Here, I will provide you with some expert knowledge to help you better understand the rules of the task."
+                template += self.expert_knowledge + '\n'
+            template += "Next are some examples: "
+        system_message_prompt = SystemMessagePromptTemplate.from_template(template)
+        human_template = ""
+        human_template += "\n"
+        human_template += "{game_description}\n{action_description}\n{goal_description}\n"
+        human_template += "You are observing something and  " \
+                "you need to choose the optimal action acoordingly. "
+        human_template += 'Response and interact using the format: {reply_format_description}{format_instructions}\n'
+        human_template += self._read_mem()
+        human_template += "\n\nHere are some history states listed below:\n"
+        fewshot_example_prompt = PromptTemplate(
+            input_variables=["question", "answer"],
+            template="Question: \n{question}\n{answer}"
+        )
+        human_message_prompt = FewShotPromptTemplate(
+            examples=self.fewshot_example,
+            example_prompt=fewshot_example_prompt,
+            suffix=human_template,
+            input_variables=[
+                'game_description', 'goal_description',
+                'action_description', 'reply_format_description'],
+            partial_variables={'format_instructions': self.parser.get_format_instructions()}
+        )
+        human_message_prompt = HumanMessagePromptTemplate(prompt=human_message_prompt)
+        short_memory_template = HumanMessagePromptTemplate.from_template("{history} Please select an action based on the current game state. You must select the appropriate action from the given action descriptions and cannot refrain from taking action or perform any prohibited actions. Here's the action description below: \n {action_description}\n")
+        chat_prompt = ChatPromptTemplate.from_messages(
+            [system_message_prompt, human_message_prompt, short_memory_template])
+        if logfile:
+            # logger.remove()
+            if self.first_call:
+                logger.add(logfile, colorize=True, enqueue=True, filter=lambda x: '[Reflexion Memory]' not in x['message'])
+                self.first_call = False
+            handler = FileCallbackHandler(logfile)
+        total_tokens, total_cost = 0, 0
+        max_think_times = 1
+        # TODO: ADD REACT Support
+        # print(str(self.env_history))
+        if self.use_short_mem:
+            my_history = str(self.env_history)
+        else:
+            my_history = ""
+        for i_think in range(max_think_times):
+            chain = LLMChain(llm=chat, prompt=chat_prompt, callbacks=[handler], verbose=False)
+            with get_openai_callback() as cb:
+                response = chain.run(
+                    game_description=game_description,
+                    goal_description=goal_description,
+                    action_description=action_description,
+                    history=self.env_history.get_histories(11),
+                    format_instructions=self.parser.get_format_instructions(),
+                    reply_format_description=reply_format_description,
+                    max_token = 3000
+                )
+                total_tokens += cb.total_tokens
+                total_cost += cb.total_cost
+            action = self.parser.parse(response).action
+        text_prompt = chat_prompt.format_messages(
+            game_description=game_description,
+            goal_description=goal_description,
+            action_description=action_description,
+            history=self.env_history.get_histories(11),
+            format_instructions=self.parser.get_format_instructions(),
+            reply_format_description=reply_format_description,
+        )
+        texts = ""
+        for text in text_prompt:
+            texts += text.content + "\n"
+        self._add_history_after_action(action)
+        logger.info(f'The GPT response is: {response}.')
+        logger.info(f'The optimal action is: {action}.')
+        if self.pre_memory:
+            logger.info(f'The suggestion is: {self.pre_memory[-1]}.')
+        if self.post_memory:
+            logger.info(f'The summary is: {self.post_memory[-1]}.')
+        if env_info.get('history'):
+            logger.info(f'History: {history_to_str(env_info["history"])}')
+        return action, texts, response, logger, total_tokens, total_cost

deciders/misc.py ADDED Viewed

	@@ -0,0 +1,21 @@

+def history_to_str(history):
+    history_str = ""
+    for d in history:
+        history_str += f"state: {d['state']}, action: {d['action']}, reward: {d['reward']}\n"
+    return history_str
+def get_majority_vote(actions):
+    return max(set(actions), key=actions.count)
+def test_get_majority_vote():
+    assert get_majority_vote([1, 1, 1, 2, 2]) == 1
+    assert get_majority_vote([1, 1, 2, 2, 2]) == 2
+    assert get_majority_vote([1, 1, 2, 2, 3]) == 1
+    assert get_majority_vote([1, 2, 3, 4, 5]) == 1
+    assert get_majority_vote([1, 2, 3, 4, 5, 1, 1, 1, 1, 1]) == 1
+    assert get_majority_vote([1, 2, 3, 4, 5, 1, 1, 1, 1, 2]) == 1
+    assert get_majority_vote([1, 2, 3, 4, 5, 1, 1, 1, 2, 2]) == 1
+    assert get_majority_vote([1, 2, 3, 4, 5, 1, 1, 2, 2, 2]) == 2
+if __name__ == "__main__":
+    test_get_majority_vote()

deciders/pal.py ADDED Viewed

	@@ -0,0 +1,149 @@

+import openai
+from .misc import history_to_str
+from langchain.chat_models import AzureChatOpenAI
+from langchain.prompts.chat import (
+    PromptTemplate,
+    ChatPromptTemplate,
+    SystemMessagePromptTemplate,
+    HumanMessagePromptTemplate,
+)
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain import LLMChain
+from loguru import logger
+from langchain.callbacks import FileCallbackHandler
+from langchain.callbacks import get_openai_callback
+from .act import NaiveAct
+from .utils import run_chain
+def get_last_n_lines(text, n):
+    lines = text.splitlines()
+    return '\n'.join(lines[-n:])
+class PAL(NaiveAct):
+    def __init__(self, action_space, args, prompts, distiller, temperature=0.1, max_tokens=None, logger=None):
+        super().__init__(action_space, args, prompts, distiller, temperature, max_tokens, logger)
+    def act(
+        self,
+        state_description,
+        action_description,
+        env_info,
+        game_description,
+        goal_description,
+        logfile=None,
+    ):
+        self._add_history_before_action(game_description, goal_description, state_description)
+        chat = AzureChatOpenAI(
+            openai_api_type=openai.api_type,
+            openai_api_version=openai.api_version,
+            openai_api_base=openai.api_base,
+            openai_api_key=openai.api_key,
+            deployment_name=self.args.gpt_version,
+            temperature=self.temperature,
+            max_tokens=self.max_tokens
+        )
+        suffix_flag = False
+        reply_format_description = \
+            "Your response should choose an optimal action from a valid action list and terminate with the following format: "
+        # System Message
+        human_template = "Now, you are completing a challenging task. You must carefully understand the Program-aided Language method you will use and apply it to the following task.\n"
+        # task-irrelevant SystemMessage
+        if self.irr_few_shot_examples:
+            human_template += 'In the following example, I shall present a set of question and answer with the Program-aided Language method. Please adhere to the format and reasoning of the provided response when addressing the subsequent task.\n'
+            for i, examples in enumerate(self.irr_few_shot_examples):
+                human_template += f"\nExample {i+1}:\n"
+                human_template += "Question: \n" + examples['question'] + "\nAnswer: \n" + examples['answer']
+        # task-irrelevant few shot if have
+        if self.irr_few_shot_examples:
+            human_template += "\nMoving forward, I will describe the task, the goal, and the actions you may execute. Please pay close attention to comprehend the information presented below.\n"
+        if self.fewshot_example:
+            human_template += "I will describe the task, the goal, and the actions you may execute. Please pay close attention to comprehend the information presented below."
+        # print(fewshot_example_prompt.format(**fewshot_examples[0]))
+        human_template += '\nTask Description: {game_description} \n'
+        human_template += 'Goal Description: {goal_description}\n'
+        human_template += 'Actions Description: {action_description}\n'
+        if self.fewshot_example:
+            human_template += "Here, I will provide you with some guidance to help you better understand the rules of the task. Next are some examples: "
+            for i, examples in enumerate(self.fewshot_example):
+                human_template += f"\nExample {i+1}:\n"
+                human_template += "Question: \n" + examples['question'] + "\nAnswer: \n" + examples['answer']
+        if self.prompt_level in [2, 3, 4]:
+            if self.memory:
+                human_template += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.\n'
+                suffix_flag = True
+                if self.prompt_level == 2:
+                    human_template += 'I have collected a few trajectories from a random policy, and the summaries are listed below.'
+                elif self.prompt_level == 3:
+                    human_template += 'I have collected a few trajectories before, and the summaries are listed below.'
+                elif self.prompt_level == 4:
+                    human_template += 'I have collected a few trajectories from an expert policy, and the summaries are listed below.'
+                human_template += self._read_mem() + "\n"
+        if self.use_short_mem:
+            if len(self.env_history) > 1:
+                if not suffix_flag:
+                    human_template += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.'
+                human_template += f"\nBelow are the latest {min(self.args.short_mem_num,len(self.env_history)//2)} historical data entries:\n"
+                human_template += f"{self.env_history.get_histories(self.mem_num)}"
+        human_template += '\nNext is the observation that the agent gets:\nCurrent {state_description}\n'
+        human_template += 'Please select an action based on the current game state and the information you get. You must select the appropriate action from the given action descriptions and cannot refrain from taking action or performing any prohibited actions. Here is the action description below:\n{action_description}\n'
+        human_template += 'Please generate Python program as answers to given questions, similar to the provided examples.\n'
+        human_template += 'And You should calculate the final result based on the program ,not just give a code script alone!\n'
+        human_message_prompt = PromptTemplate(
+            template=human_template,
+            input_variables=[
+                'state_description', 'goal_description', 'game_description',
+                'action_description'],
+        )
+        human_message_prompt = HumanMessagePromptTemplate(prompt=human_message_prompt)
+        chat_prompt = ChatPromptTemplate.from_messages([human_message_prompt])
+        if not self.logger:
+            logger.remove()
+            self.logger = logger.add(logfile, colorize=True, enqueue=True)
+        handler = FileCallbackHandler(logfile)
+        chain = LLMChain(llm=chat, prompt=chat_prompt, callbacks=[handler], verbose=False)
+        with get_openai_callback() as cb:
+            response = run_chain(
+                chain,
+                game_description=game_description,
+                state_description=state_description,
+                goal_description=goal_description,
+                action_description=action_description,
+            )
+            total_tokens = cb.total_tokens
+            total_cost = cb.total_cost
+            _response = get_last_n_lines(response, 2)
+        action = self.parser.parse(_response).action
+        text_prompt = chat_prompt.format_messages(
+            game_description=game_description,
+            state_description=state_description,
+            goal_description=goal_description,
+            action_description=action_description,
+        )
+        texts = ""
+        for text in text_prompt:
+            texts += text.content + "\n"
+        self._add_history_after_action(action)
+        self.logger.info(f'The GPT response is: {response}.')
+        self.logger.info(f'The optimal action is: {action}.')
+        if env_info.get('history'):
+            self.logger.info(f'History: {history_to_str(env_info["history"])}')
+        return action, texts, response, total_tokens, total_cost

deciders/parser.py ADDED Viewed

	@@ -0,0 +1,53 @@

+from pydantic import BaseModel, Field, validator
+# Define your desired data structure.
+class TwoAction(BaseModel):
+    action: int = Field(description="the choosed action to perform")
+    # You can add custom validation logic easily with Pydantic.
+    @validator('action')
+    def action_is_valid(cls, field):
+        if field not in [1, 2]:
+            raise ValueError("Action is not valid ([1, 2])!")
+        return field
+class ThreeAction(BaseModel):
+    action: int = Field(description="the choosed action to perform")
+    # You can add custom validation logic easily with Pydantic.
+    @validator('action')
+    def action_is_valid(cls, field):
+        if field not in [1, 2, 3]:
+            raise ValueError("Action is not valid ([1, 2, 3])!")
+        return field
+class FourAction(BaseModel):
+    action: int = Field(description="the choosed action to perform")
+    # You can add custom validation logic easily with Pydantic.
+    @validator('action')
+    def action_is_valid(cls, field):
+        if field not in [1, 2, 3, 4]:
+            raise ValueError("Action is not valid ([1, 2, 3, 4])!")
+        return field
+class SixAction(BaseModel):
+    action: int = Field(description="the choosed action to perform")
+    # You can add custom validation logic easily with Pydantic.
+    @validator('action')
+    def action_is_valid(cls, field):
+        if field not in [1, 2, 3, 4, 5, 6]:
+            raise ValueError("Action is not valid ([1, 2, 3, 4, 5, 6])!")
+        return field
+class ContinuousAction(BaseModel):
+    action: float = Field(description="the choosed action to perform")
+    # You can add custom validation logic easily with Pydantic.
+    @validator('action')
+    def action_is_valid(cls, field):
+        if not (field >= -1 and field <= 1):
+            raise ValueError("Action is not valid ([-1,1])!")
+        return field
+PARSERS = {1:ContinuousAction, 2: TwoAction, 3: ThreeAction, 4: FourAction, 6: SixAction}

deciders/reflexion.py ADDED Viewed

	@@ -0,0 +1,179 @@

+import openai
+from .misc import history_to_str
+from langchain.chat_models import AzureChatOpenAI
+from langchain.prompts.chat import (
+    PromptTemplate,
+    ChatPromptTemplate,
+    SystemMessagePromptTemplate,
+    HumanMessagePromptTemplate,
+)
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain import LLMChain
+from loguru import logger
+from langchain.callbacks import FileCallbackHandler
+from langchain.callbacks import get_openai_callback
+from .act import NaiveAct
+from memory.env_history import EnvironmentHistory
+import tiktoken
+from .utils import run_chain
+class Reflexion(NaiveAct):
+    def __init__(self, action_space, args, prompts, distiller, temperature=0.1, max_tokens=None, logger=None):
+        super().__init__(action_space, args, prompts, distiller, temperature, max_tokens, logger)
+    def num_tokens_from_string(self,string: str) -> int:
+        """Returns the number of tokens in a text string."""
+        num_tokens = len(self.encoding.encode(string))
+        return num_tokens
+    def update_mem(self,):
+        traj = self.game_description
+        traj += self.goal_description
+        one_history_token = self.num_tokens_from_string(self.env_history.get_one_history())
+        history_num = 4000 // one_history_token
+        traj += self.env_history.get_histories_with_last(history_num)
+        self._update_mem(traj)
+    def _update_mem(self, traj):
+        my_reflection = self.distiller.generate(traj, self.memory)
+        self.memory.append(my_reflection)
+        self.env_history.reset()
+    def act(
+        self,
+        state_description,
+        action_description,
+        env_info,
+        game_description,
+        goal_description,
+        logfile=None,
+    ):
+        self.action_description = action_description
+        self.game_description = game_description
+        self.goal_description = goal_description
+        self.env_history.add("observation", state_description)
+        chat = AzureChatOpenAI(
+            openai_api_type=openai.api_type,
+            openai_api_version=openai.api_version,
+            openai_api_base=openai.api_base,
+            openai_api_key=openai.api_key,
+            deployment_name=self.args.gpt_version,
+            temperature=self.temperature,
+            max_tokens=self.max_tokens,
+        )
+        suffix_flag = False
+        reply_format_description = \
+            "Your response should choose an optimal action from a valid action list and terminate with the following format: "
+        # System Message
+        human_template = "Now, you are completing a challenging task. You must carefully understand the Reflexion method you will use and apply it to the following task.\n"
+        # task-irrelevant SystemMessage
+        if self.irr_few_shot_examples:
+            human_template += 'In the following example, I shall present a set of question and answer about the Reflexion method. Please adhere to the format and reasoning of the provided response when addressing the subsequent task.\n'
+            for i, examples in enumerate(self.irr_few_shot_examples):
+                human_template += f"\nExample {i+1}:\n"
+                human_template += "Question: \n" + examples['question'] + "\nAnswer: \n" + examples['answer']
+        # task-irrelevant few shot if have
+        if self.irr_few_shot_examples:
+            human_template += "\nMoving forward, I will describe the task, the goal, and the actions you may execute. Please pay close attention to comprehend the information presented below.\n"
+        if self.fewshot_example:
+            human_template += "I will describe the task, the goal, and the actions you may execute. Please pay close attention to comprehend the information presented below."
+        # print(fewshot_example_prompt.format(**fewshot_examples[0]))
+        human_template += '\nTask Description: {game_description} \n'
+        human_template += 'Goal Description: {goal_description}\n'
+        human_template += 'Actions Description: {action_description}\n'
+        if self.fewshot_example:
+            human_template += "Here, I will provide you with some guidance to help you better understand the rules of the task. Next are some examples: "
+            for i, examples in enumerate(self.fewshot_example):
+                human_template += f"\nExample {i+1}:\n"
+                human_template += "Question: \n" + examples['question'] + "\nAnswer: \n" + examples['answer']
+        if self.prompt_level in [2, 3, 4]:
+            if self.memory:
+                human_template += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.\n'
+                suffix_flag = True
+                if self.prompt_level == 2:
+                    human_template += 'I have collected a few trajectories from a random policy, and the summaries are listed below.'
+                elif self.prompt_level == 3:
+                    human_template += 'I have collected a few trajectories before, and the summaries are listed below.'
+                elif self.prompt_level == 4:
+                    human_template += 'I have collected a few trajectories from an expert policy, and the summaries are listed below.'
+                human_template += self._read_mem() + "\n"
+        if self.use_short_mem:
+            if len(self.env_history) > 1:
+                if not suffix_flag:
+                    human_template += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.'
+                human_template += f"\nBelow are the latest {self.mem_num} historical data entries:\n"
+                human_template += f"{self.env_history.get_histories(self.mem_num)}"
+        human_template += '\nNext is the observation that the agent gets:\nCurrent {state_description}\n'
+        human_template += 'Please select an action based on the current game state and the information you get. You must select the appropriate action from the given action descriptions and cannot refrain from taking action or performing any prohibited actions. Here is the action description below:\n{action_description}\n'
+        human_template += 'Also, please keep in mind not to answer with any redundant and irrelevant content.\n'
+        human_template += "Finally, you also need to normalize your output according to the reply format description.\n"
+        human_template += 'Reply format description: {reply_format_description}{format_instructions}\n'
+        human_message_prompt = PromptTemplate(
+            template=human_template,
+            input_variables=[
+                'state_description', 'goal_description', 'game_description',
+                'action_description', 'reply_format_description'],
+            partial_variables={'format_instructions': self.parser.get_format_instructions()}
+        )
+        human_message_prompt = HumanMessagePromptTemplate(prompt=human_message_prompt)
+        chat_prompt = ChatPromptTemplate.from_messages([human_message_prompt])
+        if not self.logger:
+            # logger.remove()
+            if self.first_call:
+                self.logger = logger.add(logfile, colorize=True, enqueue=True, filter=lambda x: '[Reflexion Memory]' not in x['message'])
+                self.first_call = False
+        handler = FileCallbackHandler(logfile)
+        total_tokens, total_cost = 0, 0
+        max_think_times = 1
+        # TODO: ADD REACT Support
+        # print(str(self.env_history))
+        for i_think in range(max_think_times):
+            chain = LLMChain(llm=chat, prompt=chat_prompt, callbacks=[handler], verbose=False)
+            with get_openai_callback() as cb:
+                response = run_chain(
+                    chain,
+                    state_description=self.env_history.get_last_history(),
+                    game_description=game_description,
+                    goal_description=goal_description,
+                    action_description=action_description,
+                    format_instructions=self.parser.get_format_instructions(),
+                    reply_format_description=reply_format_description,
+                    max_token = 3000
+                )
+                total_tokens += cb.total_tokens
+                total_cost += cb.total_cost
+            action = self.parser.parse(response).action
+        text_prompt = chat_prompt.format_messages(
+            state_description=self.env_history.get_last_history(),
+            game_description=game_description,
+            goal_description=goal_description,
+            action_description=action_description,
+            format_instructions=self.parser.get_format_instructions(),
+            reply_format_description=reply_format_description,
+        )
+        texts = ""
+        for text in text_prompt:
+            texts += text.content + "\n"
+        self._add_history_after_action(action)
+        self.logger.info(f'The GPT response is: {response}.')
+        self.logger.info(f'The optimal action is: {action}.')
+        if self.memory:
+            self.logger.info(f'The memory is: {self.memory[-1]}.')
+        if env_info.get('history'):
+            self.logger.info(f'History: {history_to_str(env_info["history"])}')
+        return action, texts, response, total_tokens, total_cost

deciders/self_consistency.py ADDED Viewed

	@@ -0,0 +1,170 @@

+import openai
+from .misc import history_to_str
+from langchain.chat_models import AzureChatOpenAI
+from langchain.prompts.chat import (
+    PromptTemplate,
+    ChatPromptTemplate,
+    SystemMessagePromptTemplate,
+    HumanMessagePromptTemplate,
+)
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain import LLMChain
+from loguru import logger
+from langchain.callbacks import FileCallbackHandler
+from langchain.callbacks import get_openai_callback
+from .act import NaiveAct
+from .utils import run_chain
+class SelfConsistency(NaiveAct):
+    def __init__(self, action_space, args, prompts, distiller, temperature=0.1, max_tokens=None, logger=None):
+        temperature = 0.7
+        super().__init__(action_space, args, prompts, distiller, temperature, max_tokens, logger)
+        self.temperature = temperature
+    def act(
+        self,
+        state_description,
+        action_description,
+        env_info,
+        game_description,
+        goal_description,
+        logfile=None,
+    ):
+        # print(self.temperature)
+        self.action_description = action_description
+        self._add_history_before_action(game_description, goal_description, state_description)
+        chat = AzureChatOpenAI(
+            openai_api_type=openai.api_type,
+            openai_api_version=openai.api_version,
+            openai_api_base=openai.api_base,
+            openai_api_key=openai.api_key,
+            deployment_name=self.args.gpt_version,
+            temperature=self.temperature,
+            max_tokens=self.max_tokens
+        )
+        suffix_flag = False
+        reply_format_description = \
+            "Your response should choose an optimal action from a valid action list and terminate with the following format: "
+        # System Message
+        human_template = "Now, you are completing a challenging task. You must carefully understand the Self-Consistency method you will use and apply it to the following task.\n"
+        # task-irrelevant SystemMessage
+        if self.irr_few_shot_examples:
+            human_template += 'In the following example, I shall present a set of question and answer with the Self-Consistency method. Please adhere to the format and reasoning of the provided response when addressing the subsequent task.\n'
+            for i, examples in enumerate(self.irr_few_shot_examples):
+                human_template += f"\nExample {i+1}:\n"
+                human_template += "Question: \n" + examples['question'] + "\nAnswer: \n" + examples['answer']
+        # task-irrelevant few shot if have
+        if self.irr_few_shot_examples:
+            human_template += "\nMoving forward, I will describe the task, the goal, and the actions you may execute. Please pay close attention to comprehend the information presented below.\n"
+        if self.fewshot_example:
+            human_template += "I will describe the task, the goal, and the actions you may execute. Please pay close attention to comprehend the information presented below."
+        # print(fewshot_example_prompt.format(**fewshot_examples[0]))
+        human_template += '\nTask Description: {game_description} \n'
+        human_template += 'Goal Description: {goal_description}\n'
+        human_template += 'Actions Description: {action_description}\n'
+        if self.fewshot_example:
+            human_template += "Here, I will provide you with some guidance to help you better understand the rules of the task. Next are some examples: "
+            for i, examples in enumerate(self.fewshot_example):
+                human_template += f"\nExample {i+1}:\n"
+                human_template += "Question: \n" + examples['question'] + "\nAnswer: \n" + examples['answer']
+        if self.prompt_level in [2, 3, 4]:
+            if self.memory:
+                human_template += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.\n'
+                suffix_flag = True
+                if self.prompt_level == 2:
+                    human_template += 'I have collected a few trajectories from a random policy, and the summaries are listed below.'
+                elif self.prompt_level == 3:
+                    human_template += 'I have collected a few trajectories before, and the summaries are listed below.'
+                elif self.prompt_level == 4:
+                    human_template += 'I have collected a few trajectories from an expert policy, and the summaries are listed below.'
+                human_template += self._read_mem() + "\n"
+        if self.use_short_mem:
+            if len(self.env_history) > 1:
+                if not suffix_flag:
+                    human_template += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.'
+                human_template += f"\nBelow are the latest {self.args.short_mem_num} historical data entries:\n"
+                human_template += f"{self.env_history.get_histories(self.mem_num)}"
+        human_template += '\nNext is the observation that the agent gets:\nCurrent {state_description}\n'
+        human_template += 'Please select an action based on the current game state and the information you get. You must select the appropriate action from the given action descriptions and cannot refrain from taking action or performing any prohibited actions. Here is the action description below:\n{action_description}\n'
+        human_template += 'Please note that you need to carefully lay out your thought process on the question, not just give an answer. You need to write the corresponding logic of your thinking following the example above. Also, please keep in mind not to answer with any redundant and irrelevant content.\n'
+        human_template += "Finally, you also need to normalize your output according to the reply format description.\n"
+        human_template += 'Reply format description: {reply_format_description}{format_instructions}\n'
+        human_message_prompt = PromptTemplate(
+            template=human_template,
+            input_variables=[
+                'state_description', 'goal_description', 'game_description',
+                'action_description', 'reply_format_description'],
+            partial_variables={'format_instructions': self.parser.get_format_instructions()}
+        )
+        human_message_prompt = HumanMessagePromptTemplate(prompt=human_message_prompt)
+        chat_prompt = ChatPromptTemplate.from_messages([human_message_prompt])
+        if not self.logger:
+            logger.remove()
+            self.logger = logger.add(logfile, colorize=True, enqueue=True)
+        handler = FileCallbackHandler(logfile)
+        chain = LLMChain(llm=chat, prompt=chat_prompt, callbacks=[handler], verbose=False)
+        text_prompt = chat_prompt.format_messages(
+            game_description=game_description,
+            state_description=state_description,
+            goal_description=goal_description,
+            action_description=action_description,
+            reply_format_description=reply_format_description
+        )
+        texts = ""
+        for text in text_prompt:
+            texts += text.content + "\n"
+        actions = []
+        response_dict = {}
+        error_flag = True
+        for i in range(5):
+            try:
+                with get_openai_callback() as cb:
+                    response = run_chain(
+                        chain,
+                        game_description=game_description,
+                        state_description=state_description,
+                        goal_description=goal_description,
+                        action_description=action_description,
+                        reply_format_description=reply_format_description
+                    )
+                    total_tokens = cb.total_tokens
+                    total_cost = cb.total_cost
+                    action = self.parser.parse(response).action
+                    actions.append(action)
+                    response_dict[action] = response
+                    self.logger.info(f'The GPT response is: {response}.')
+                    self.logger.info(f'The optimal action is: {action}.\n')
+            except:
+                continue
+        action = max(set(actions), key=actions.count)
+        # print(actions)
+        # print(action)
+        if actions:
+            self._add_history_after_action(action)
+            self.logger.info(f'The action list is: {actions}.')
+            self.logger.info(f'The GPT response is: {response_dict[action]}.')
+            self.logger.info(f'The optimal action is: {action}.')
+            if env_info.get('history'):
+                self.logger.info(f'History: {history_to_str(env_info["history"])}')
+        else:
+            raise Exception("No valid Actions!")
+        return action, texts, response, total_tokens, total_cost

deciders/selfask.py ADDED Viewed

	@@ -0,0 +1,150 @@

+import openai
+from .misc import history_to_str
+from langchain.chat_models import AzureChatOpenAI
+from langchain.prompts.chat import (
+    PromptTemplate,
+    ChatPromptTemplate,
+    SystemMessagePromptTemplate,
+    HumanMessagePromptTemplate,
+)
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain import LLMChain
+from loguru import logger
+from langchain.callbacks import FileCallbackHandler
+from langchain.callbacks import get_openai_callback
+from .act import NaiveAct
+from .utils import run_chain
+class SelfAskAct(NaiveAct):
+    def __init__(self, action_space, args, prompts, distiller, temperature=0.1, max_tokens=None, logger=None):
+        super().__init__(action_space, args, prompts, distiller, temperature, max_tokens,logger)
+    def act(
+        self,
+        state_description,
+        action_description,
+        env_info,
+        game_description,
+        goal_description,
+        logfile=None,
+    ):
+        self.action_description = action_description
+        self._add_history_before_action(game_description, goal_description, state_description)
+        chat = AzureChatOpenAI(
+            openai_api_type=openai.api_type,
+            openai_api_version=openai.api_version,
+            openai_api_base=openai.api_base,
+            openai_api_key=openai.api_key,
+            deployment_name=self.args.gpt_version,
+            temperature=self.temperature,
+            max_tokens=self.max_tokens
+        )
+        suffix_flag = False
+        reply_format_description = \
+            "Your response should choose an optimal action from a valid action list and terminate with the following format: "
+        # System Message
+        human_template = "Now, you are completing a challenging task. You must carefully understand the self-ask method you will use and apply it to the following task.\n"
+        # task-irrelevant SystemMessage
+        if self.irr_few_shot_examples:
+            human_template += 'In the following example, I shall present a set of question and answer with the self-ask method. Please adhere to the format and reasoning of the provided response when addressing the subsequent task.\n'
+            for i, examples in enumerate(self.irr_few_shot_examples):
+                human_template += f"\nExample {i+1}:\n"
+                human_template += "Question: \n" + examples['question'] + "\nAnswer: \n" + examples['answer']
+        # task-irrelevant few shot if have
+        if self.irr_few_shot_examples:
+            human_template += "\nMoving forward, I will describe the task, the goal, and the actions you may execute. Please pay close attention to comprehend the information presented below.\n"
+        if self.fewshot_example:
+            human_template += "I will describe the task, the goal, and the actions you may execute. Please pay close attention to comprehend the information presented below."
+        # print(fewshot_example_prompt.format(**fewshot_examples[0]))
+        human_template += '\nTask Description: {game_description} \n'
+        human_template += 'Goal Description: {goal_description}\n'
+        human_template += 'Actions Description: {action_description}\n'
+        if self.fewshot_example:
+            human_template += "Here, I will provide you with some guidance to help you better understand the rules of the task. Next are some examples: "
+            for i, examples in enumerate(self.fewshot_example):
+                human_template += f"\nExample {i+1}:\n"
+                human_template += "Question: \n" + examples['question'] + "\nAnswer: \n" + examples['answer']
+        if self.prompt_level in [2, 3, 4]:
+            if self.memory:
+                human_template += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.\n'
+                suffix_flag = True
+                if self.prompt_level == 2:
+                    human_template += 'I have collected a few trajectories from a random policy, and the summaries are listed below.'
+                elif self.prompt_level == 3:
+                    human_template += 'I have collected a few trajectories before, and the summaries are listed below.'
+                elif self.prompt_level == 4:
+                    human_template += 'I have collected a few trajectories from an expert policy, and the summaries are listed below.'
+                human_template += self._read_mem() + "\n"
+        if self.use_short_mem:
+            if len(self.env_history) > 1:
+                if not suffix_flag:
+                    human_template += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.'
+                human_template += f"\nBelow are the latest {self.args.short_mem_num} historical data entries:\n"
+                human_template += f"{self.env_history.get_histories(self.mem_num)}"
+        human_template += '\nNext is the observation that the agent gets:\nCurrent {state_description}\n'
+        human_template += 'Please select an action based on the current game state and the information you get. You must select the appropriate action from the given action descriptions and cannot refrain from taking action or performing any prohibited actions. Here is the action description below:\n{action_description}\n'
+        human_template += 'You must utilize a multi-turn dialogue approach, just as the format illustrated in the example above(like "Follow up" and "Intermediate answer"). And you need to write down the thought process during the self-ask process. Also, please keep in mind not to answer with any redundant and irrelevant content.\n'
+        human_template += "Finally, you also need to normalize your output according to the reply format description.\n"
+        human_template += 'Reply format description: {reply_format_description}{format_instructions}\n'
+        human_message_prompt = PromptTemplate(
+            template=human_template,
+            input_variables=[
+                'state_description', 'goal_description', 'game_description',
+                'action_description', 'reply_format_description'],
+            partial_variables={'format_instructions': self.parser.get_format_instructions()}
+        )
+        human_message_prompt = HumanMessagePromptTemplate(prompt=human_message_prompt)
+        chat_prompt = ChatPromptTemplate.from_messages([human_message_prompt])
+        if not self.logger:
+            logger.remove()
+            self.logger = logger.add(logfile, colorize=True, enqueue=True)
+        handler = FileCallbackHandler(logfile)
+        chain = LLMChain(
+            llm=chat, prompt=chat_prompt, callbacks=[handler], verbose=False)
+        with get_openai_callback() as cb:
+            response = run_chain(
+                chain,
+                game_description=game_description,
+                state_description=state_description,
+                goal_description=goal_description,
+                action_description=action_description,
+                reply_format_description=reply_format_description
+            )
+            total_tokens = cb.total_tokens
+            total_cost = cb.total_cost
+        action = self.parser.parse(response).action
+        text_prompt = chat_prompt.format_messages(
+            game_description=game_description,
+            state_description=state_description,
+            goal_description=goal_description,
+            action_description=action_description,
+            reply_format_description=reply_format_description
+        )
+        texts = ""
+        for text in text_prompt:
+            texts += text.content + "\n"
+        self._add_history_after_action(action)
+        self.logger.info(f'The GPT response is: {response}.')
+        self.logger.info(f'The optimal action is: {action}.')
+        if env_info.get('history'):
+            self.logger.info(f'History: {history_to_str(env_info["history"])}')
+        return action, texts, response, total_tokens, total_cost

deciders/spp.py ADDED Viewed

	@@ -0,0 +1,142 @@

+import openai
+from .misc import history_to_str
+from langchain.chat_models import AzureChatOpenAI
+from langchain.prompts.chat import (
+    PromptTemplate,
+    ChatPromptTemplate,
+    SystemMessagePromptTemplate,
+    HumanMessagePromptTemplate,
+)
+from langchain.prompts.few_shot import FewShotPromptTemplate
+from langchain import LLMChain
+from loguru import logger
+from langchain.callbacks import FileCallbackHandler
+from langchain.callbacks import get_openai_callback
+from .act import NaiveAct
+from .utils import run_chain
+class SPP(NaiveAct):
+    def __init__(self, action_space, args, prompts, distiller, temperature=0.1, max_tokens=None, logger=None):
+        super().__init__(action_space, args, prompts, distiller, temperature, max_tokens, logger)
+    def act(
+        self,
+        state_description,
+        action_description,
+        env_info,
+        game_description,
+        goal_description,
+        logfile=None,
+    ):
+        self.action_description = action_description
+        self._add_history_before_action(game_description, goal_description, state_description)
+        chat = AzureChatOpenAI(
+            openai_api_type=openai.api_type,
+            openai_api_version=openai.api_version,
+            openai_api_base=openai.api_base,
+            openai_api_key=openai.api_key,
+            deployment_name=self.args.gpt_version,
+            temperature=self.temperature,
+            max_tokens=self.max_tokens
+        )
+        self.fewshot_example = self.irr_few_shot_examples if not self.fewshot_example else self.fewshot_example
+        self.irr_few_shot_examples = self.irr_few_shot_examples if not self.fewshot_example else self.fewshot_example
+        suffix_flag = False
+        reply_format_description = \
+            "Your response should choose an optimal action from a valid action list and terminate with the following format: "
+        # System Message
+        human_template = "When faced with a task, begin by identifying the participants who will contribute to solving the task. Then, initiate a multi-round collaboration process until a final solution is reached. The participants will give critical comments and detailed suggestions whenever necessary.\n"
+        human_template += "Now, you are completing a challenging task. You must carefully understand the Solo-Performance-Prompting method you will use and apply it to the following task.\n"
+        # task-irrelevant SystemMessage
+        if self.irr_few_shot_examples:
+            human_template += 'In the following example, I shall present a set of question and answer with the Solo-Performance-Prompting method. Please adhere to the format and reasoning of the provided response when addressing the subsequent task.\n'
+            for i, examples in enumerate(self.irr_few_shot_examples):
+                human_template += f"\nExample {i+1}:\n"
+                human_template += "Question: \n" + examples['question'] + "\nAnswer: \n" + examples['answer']
+        # task-irrelevant few shot if have
+        if self.irr_few_shot_examples:
+            human_template += "\nMoving forward, I will describe the task, the goal, and the actions you may execute. Please pay close attention to comprehend the information presented below.\n"
+        human_template += '\nTask Description: {game_description} \n'
+        human_template += 'Goal Description: {goal_description}\n'
+        human_template += 'Actions Description: {action_description}\n'
+        if self.prompt_level in [2, 3, 4]:
+            if self.memory:
+                human_template += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.\n'
+                suffix_flag = True
+                if self.prompt_level == 2:
+                    human_template += 'I have collected a few trajectories from a random policy, and the summaries are listed below.'
+                elif self.prompt_level == 3:
+                    human_template += 'I have collected a few trajectories before, and the summaries are listed below.'
+                elif self.prompt_level == 4:
+                    human_template += 'I have collected a few trajectories from an expert policy, and the summaries are listed below.'
+                human_template += self._read_mem() + "\n"
+        if self.use_short_mem:
+            if len(self.env_history) > 1:
+                if not suffix_flag:
+                    human_template += '\nSubsequently, I will offer pertinent guidance or information about the task. Please utilize this instruction to accomplish the given task effectively.'
+                human_template += f"\nBelow are the latest {self.args.short_mem_num} historical data entries:\n"
+                human_template += f"{self.env_history.get_histories(self.mem_num)}"
+        human_template += '\nNext is the observation that the agent gets:\nCurrent {state_description}\n'
+        human_template += 'Please select an action based on the current game state and the information you get. You must select the appropriate action from the given action descriptions and cannot refrain from taking action or performing any prohibited actions. Here is the action description below:\n{action_description}\n'
+        human_template += 'Please note that you need to carefully lay out the participants who will contribute to solving the task and initiate a multi-round collaboration process until a final solution is reached. Now, identify the participants and collaboratively solve the following task step by step.Also, please keep in mind not to answer with any redundant and irrelevant content.\n'
+        human_template += "Finally, you also need to normalize your output according to the reply format description.\n"
+        human_template += 'Reply format description: {reply_format_description}{format_instructions}\n'
+        human_message_prompt = PromptTemplate(
+            template=human_template,
+            input_variables=[
+                'state_description', 'goal_description', 'game_description',
+                'action_description', 'reply_format_description'],
+            partial_variables={'format_instructions': self.parser.get_format_instructions()}
+        )
+        human_message_prompt = HumanMessagePromptTemplate(prompt=human_message_prompt)
+        chat_prompt = ChatPromptTemplate.from_messages([human_message_prompt])
+        if not self.logger:
+            logger.remove()
+            self.logger = logger.add(logfile, colorize=True, enqueue=True)
+        handler = FileCallbackHandler(logfile)
+        chain = LLMChain(llm=chat, prompt=chat_prompt, callbacks=[handler], verbose=False)
+        with get_openai_callback() as cb:
+            response = run_chain(
+                chain,
+                game_description=game_description,
+                state_description=state_description,
+                goal_description=goal_description,
+                action_description=action_description,
+                reply_format_description=reply_format_description
+            )
+            total_tokens = cb.total_tokens
+            total_cost = cb.total_cost
+        action = self.parser.parse(response).action
+        text_prompt = chat_prompt.format_messages(
+            game_description=game_description,
+            state_description=state_description,
+            goal_description=goal_description,
+            action_description=action_description,
+            reply_format_description=reply_format_description
+        )
+        texts = ""
+        for text in text_prompt:
+            texts += text.content + "\n"
+        self._add_history_after_action(action)
+        self.logger.info(f'The GPT response is: {response}.')
+        self.logger.info(f'The optimal action is: {action}.')
+        if env_info.get('history'):
+            self.logger.info(f'History: {history_to_str(env_info["history"])}')
+        return action, texts, response, total_tokens, total_cost

deciders/utils.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import os
+import sys
+import openai
+from openai import OpenAI
+from tenacity import (
+    retry,
+    stop_after_attempt, # type: ignore
+    wait_random_exponential, # type: ignore
+)
+from typing import Optional, List
+if sys.version_info >= (3, 8):
+    from typing import Literal
+else:
+    from typing_extensions import Literal
+Model = Literal["gpt-4", "gpt-35-turbo", "text-davinci-003"]
+from .gpt import gpt
+gpt().__init__()
+import timeout_decorator
+@timeout_decorator.timeout(30)
+def run_chain(chain, *args, **kwargs):
+    return chain.run(*args, **kwargs)
+@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
+def get_completion(prompt: str, engine: str = "gpt-35-turbo", temperature: float = 0.0, max_tokens: int = 256, stop_strs: Optional[List[str]] = None) -> str:
+    client = OpenAI(api_key=openai.api_key)
+    response = client.chat.completions.create(
+                model=engine,
+                prompt=prompt,
+                temperature=temperature,
+                max_tokens=max_tokens,
+                top_p=1,
+                frequency_penalty=0.0,
+                presence_penalty=0.0,
+                stop=stop_strs,
+                # request_timeout = 1
+            )
+    return response.choices[0].text
+# @retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
+def get_chat(prompt: str, model: str = "gpt-35-turbo", engine: str = "gpt-35-turbo", temperature: float = 0.0, max_tokens: int = 256, stop_strs: Optional[List[str]] = None, is_batched: bool = False) -> str:
+    assert model != "text-davinci-003"
+    messages = [
+        {
+            "role": "user",
+            "content": prompt
+        }
+    ]
+    # import pdb;pdb.set_trace()
+    client = OpenAI(api_key=openai.api_key)
+    response = client.chat.completions.create(
+            model=model,
+            messages=messages,
+            max_tokens=max_tokens,
+            stop=stop_strs,
+            temperature=temperature,
+            # request_timeout = 1
+    )
+    return response.choices[0].message.content

distillers/__init__.py ADDED Viewed

	@@ -0,0 +1,10 @@

+from .raw_prompt_generator import RawPromptGenerator
+from .self_reflection import RefletionGenerator
+from .traj_prompt_summarizer import TrajPromptSummarizer
+from .guider import Guidance_Generator
+REGISTRY = {}
+REGISTRY['raw_distiller'] = RawPromptGenerator
+REGISTRY['reflect_distiller'] = RefletionGenerator
+REGISTRY['traj_distiller'] = TrajPromptSummarizer
+REGISTRY['guide_generator'] = Guidance_Generator

distillers/guidance_summary_few_shot_examples.txt ADDED Viewed

	@@ -0,0 +1,85 @@

+You are in the middle of a room. Looking quickly around you, you see a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 2, a countertop 1, a diningtable 2, a diningtable 1, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.
+Your task is to: heat some mug and put it in coffeemachine.
+Action: think: To solve the task, I need to find and take a mug, then heat it with stoveburner, then put it in coffeemachine.
+Ok.
+Action: look
+You are in the middle of a room. Looking quickly around you, you see nothing.
+Action: think: First I need to find a mug. A mug is more likely to appear in countertop (1-2), diningtable (1-2), drawer (1-3), fridge (1), sinkbasin (1), stoveburner (1-4), toaster (1). I can check one by one, starting with countertop 1.
+Ok.
+Action: look
+You are in the middle of a room. Looking quickly around you, you see nothing.
+Action: go to countertop 1
+On the countertop 1, you see a butterknife 1, a cup 1, a fork 2, and a mug 1.
+Action: think: Now I find a mug (1). Next, I need to take it.
+Ok.
+Action: take mug 1 from countertop 1
+You pick up the mug 1 from the countertop 1.
+Action: think: Now I take a mug (1). Next, I need to heat it with stoveburner (1-4).
+Ok.
+Action: go to stoveburner 1
+On the stoveburner 1, you see a pan 2.
+Action: examine stoveburner 1
+On the stoveburner 1, you see a pan 2.
+Action: examine stoveburner 1
+On the stoveburner 1, you see a pan 2.
+Action: examine stoveburner 1
+On the stoveburner 1, you see a pan 2.
+Action: examine stoveburner 1
+On the stoveburner 1, you see a pan 2.
+STATUS: FAIL
+Summary and Key Findings: Summary of Trajectory and Outcomes:
+The task was to heat a mug and put it in the coffeemachine. The initial plan was to find and take a mug, heat it with a stoveburner, and then put it in the coffeemachine. The user successfully found and took a mug from countertop 1. However, upon reaching stoveburner 1, the user got stuck in a loop of examining the stoveburner without making any progress in heating the mug. The task ultimately ended in failure.
+Key Findings:
+1. The user was able to identify potential locations for finding a mug and successfully located one on countertop 1.
+2. The user got stuck in a loop of examining stoveburner 1 without taking any action to heat the mug.
+3. The task was not completed successfully due to the user's inability to progress beyond examining the stoveburner.
+You are in the middle of a room. Looking quickly around you, you see a bed 1, a desk 2, a desk 1, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a garbagecan 1, a laundryhamper 1, a safe 1, a shelf 6, a shelf 5, a shelf 4, a shelf 3, a shelf 2, and a shelf 1.
+Your task is to: look at bowl under the desklamp.
+Action: think: To solve the task, I need to find and take a bowl, then find and use a desklamp.
+OK.
+Action: think: First I need to find a bowl. A bowl is more likely to appear in drawer (1-6), desk (1-2), shelf (1-6), garbagecan (1), laundryhamper (1). I can check one by one, starting with drawer 1.
+OK.
+Action: go to drawer 1
+The drawer 1 is closed.
+Action: open drawer 1
+You open the drawer 1. The drawer 1 is open. In it, you see a cellphone 2.
+Action: go to drawer 2
+On the drawer 2, you see a keychain 1.
+Action: go to drawer 3
+Nothing happens.
+Action: go to drawer 4
+The drawer 4 is closed.
+Action: open drawer 4
+You open the drawer 4. The drawer 4 is open. In it, you see nothing.
+Action: go to drawer 5
+On the drawer 5, you see nothing.
+Action: go to drawer 6
+The drawer 6 is closed.
+Action: open drawer 6
+You open the drawer 6. The drawer 6 is open. In it, you see a book 2, and a pen 2.
+Action: go to desk 1
+On the desk 1, you see a alarmclock 1, a desklamp 1, a mug 1, and a pencil 1.
+Action: think: Now I find a desklamp (1). Next, I need to use it.
+OK.
+Action: use desklamp 1
+You turn on the desklamp 1.
+Action:
+Nothing happens.
+Action: go to desk 2
+On the desk 2, you see a bowl 2, a bowl 1, a cd 2, a cd 1, and a creditcard 3.
+Action: think: Now I find a bowl (2). Next, I need to take it.
+OK.
+Action: take bowl 2 from desk 2
+You pick up the bowl 2 from the desk 2.
+Action:
+Nothing happens.
+Action:
+Nothing happens.
+STATUS: FAIL
+Summary: Summary of Trajectory and Outcomes:
+The task was to look at a bowl under the desklamp. The initial plan was to find and take a bowl, then find and use a desklamp. The user successfully found a desklamp on desk 1 and turned it on. They also found a bowl on desk 2 and picked it up. However, the user did not complete the task of looking at the bowl under the desklamp and instead got stuck in a state of inaction. The task ultimately ended in failure.
+Key Findings:
+1. The user was able to identify potential locations for finding a bowl and a desklamp and successfully located them on desk 2 and desk 1, respectively.
+2. The user turned on the desklamp but did not proceed to look at the bowl under it.
+3. The task was not completed successfully due to the user's inability to progress beyond taking the bowl and turning on the desklamp.

distillers/guider.py ADDED Viewed

	@@ -0,0 +1,144 @@

+from deciders.utils import get_completion, get_chat
+from typing import List, Dict, Any
+from loguru import logger
+import random
+import json
+class Guidance_Generator():
+    def __init__(self,logfile="",args=None):
+        self.args = args
+        with open("./distillers/guidance_summary_few_shot_examples.txt", 'r') as f:
+            self.SUMMARY_FEW_SHOT_EXAMPLES = f.read()
+        # with open("./distillers/exploration_few_shot_examples.txt", 'r') as f:
+        #     self.SUGGEST_FEW_SHOT_EXAMPLES = f.read()
+        self.insight = ""
+        self.suggestion = ""
+        if logfile:
+            # logger.remove()
+            logger.add(logfile, colorize=True, enqueue=True, filter=lambda x: '[Reflexion Memory]' in x['message'])
+    def generate_from_file(self, file_path,max_step_num=200):
+        mem = []
+        with open(file_path, 'r') as infile:
+            data = json.load(infile)
+        for traj in data:
+            traj_text = traj[0]['game_description']
+            traj_text += traj[0]['goal_description']
+            for transition in traj[-max_step_num:]:
+                traj_text += transition['observation']
+                traj_text += f"Action: {transition['action']}"
+            summary = self.generate_summary(traj_text, mem)
+            mem.append(summary)
+        return mem
+    def _generate_summary_query(self, traj, post_memory):
+        """
+        Generates an exploration guidance query for GPT-3.5 based on given trajectory and memory.
+        Parameters:
+        - traj: Trajectory of the new experience.
+        - post_memory: List of memory items to summarize.
+        Returns:
+        - query: Formulated query string for GPT-3.5.
+        """
+        segments = []
+        # Summarization memory
+        # if post_memory:
+        #     segments.append('Your summarization memory is as below:')
+        #     segments.extend([f'Episode #{i}: {m}' for i, m in enumerate(post_memory)])
+        # Trajectory
+        segments.append(f"Your new collected trajectory is as below:\n {traj}")
+        segments.append(f"The suggestion to guide the trajectory is:\n{self.suggestion}")
+        # Questions
+        questions = """
+        Please answer the following questions directly, without additional explanation:
+        1. Based on the collected trajectory, infer the specific values of game-relevant knowledge proposed in the suggestion with json format.
+        2. Summarize the policy behavior and its performance.
+        Provide concise responses.
+        """
+        segments.append(questions)
+        # Construct the final query
+        query = '\n'.join(segments)
+        return query
+    # def _generate_summary_query(self, traj, post_memory):
+    #     """Allows the Agent to generate exploration guidance."""
+    #     query = ""
+    #     if len(post_memory) > 0:
+    #         query += '\Your summarization memory is as below:\n'
+    #         for i, m in enumerate(post_memory):
+    #             query += f'Episode #{i}: {m}\n'
+    #     query += f"""
+    #     {traj}
+    #     Above is the trajectory of the new experience.
+    #     """
+    #     query += '\n Anwser the following questions.\n 1. What is the performance of this policy and does it improve the performance compared to before? 2. Summarize the main reason that makes the policy improve or reduce the performance; 3. What new information of the task can be inferred compared to the memory?'
+    #     return query
+    def generate_summary(self, traj, post_memory):
+        query = self._generate_summary_query(traj, post_memory)
+        summary = get_chat(query,model=self.args.gpt_version, engine=self.args.gpt_version)
+        logger.info(f'[Reflexion Memory]The summary prompt is: {query}.')
+        logger.info(f'[Reflexion Memory]The summary response is: {summary}.')
+        return summary
+    def generate_insight(self, post_memory):
+        query: str = f"""As an AI assistant, you are helping a six-year-old player who has never played this game before. The experiences you have are as follows:"""
+        if len(post_memory) > 0:
+            for i, m in enumerate(post_memory):
+                query += f'Episode #{i}: {m}\n'
+        query += '\n Identify and summarize the key information that can be exploited to improve performance of the player.'
+        insight = get_chat(query,model=self.args.gpt_version, engine=self.args.gpt_version)
+        logger.info(f'[Reflexion Memory]The insight prompt is: {query}.')
+        logger.info(f'[Reflexion Memory]The insight response is: {insight}.')
+        return insight
+    def generate_suggestion(self, game_description, goal_description, action_description, pre_memory, post_memory, insight, max_num_trials):
+        query: str = f"""You are an AI assitant that help a human player win the following game.
+        The game is \n"{game_description}" \n, the action space is described as {action_description},\n the player's goal is \n "{goal_description}".\n
+        The player can play for {max_num_trials} episodes. The main aim for you is to help the player win the game in the last episode. """
+        if len(post_memory) > 0:
+            query +=  f"""You have obtained experience as below """
+            for i, m in enumerate(post_memory):
+                query += f'Episode #{i}: {m}\n'
+        # if max_num_trials - len(post_memory) == 1:
+        #     query = (f"\n The main goal is to aid the human player in winning the game in the next episode. "
+        #             f"This is his {len(post_memory) + 1} try out of {max(max_num_trials, 1)} episodes. "
+        #             "Your suggestions should be simple, executable with heuristic policy, and suitable for an LLM agent. "
+        #             "Reply in an item list format. Specifically, focus on:"
+        #             "\n1. How to achieve optimal performance (exploitation) using the obtained knowledge?"
+        #             "\nNote: Stress the importance of prioritizing performance without exploration.")
+        #     suggestion = get_chat(query) + "\n Remember, in this attempt, aim solely for high performance without exploration."
+        # else:
+        # if max_num_trials-len(post_memory) == 1:
+        #     query += f"\n The main aim for you is to help the human player win the game in the last episode. The next episode is the last episode. You can give suggestions before each episode. Then what is your suggestion for his next episode? Note that this is the last try and he should not explore which may decrease his performance. The suggestions should be simple to follow, executable with heuristic policy, easy to use for an llm agent,and reply in item list format. The answer should instruct him to exploit all the knowlegde to gain the highest performance (exploitation) in the next episode. "
+        # else:
+        query += f"\n The main aim for you is to help the human player win the game in the last episode. He has only {max(max_num_trials-len(post_memory), 1)} episodes left to try.You can give suggestions before each episode. Then what is your suggestion for his next episode? Please provide simple, concise answers suitable for a six-year-old child, focusing on the following in item list format: 1. What game-relevant knowledge is critical to determine the optimal policy. Notice that the knowledge should be obtainable by interacting with the environment and helpful for the decisions.\n 2. How should the player conduct exploration in the next episode to acquire this information?\n3. How can the player exploit the information obtained to achieve higher performance in subsequent episodes?\n 4. How should exploration and exploitation be balanced to improve performance in the next episode?\n"
+            # query += (f"\n The primary goal is to assist the human player in winning the game in the final episode. "
+            #         f"This is his {len(post_memory) + 1} try out of {max(max_num_trials, 1)} episodes. "
+            #         "Provide suggestions for the next episode that balance both exploration and exploitation. "
+            #         "The suggestions should be in item list format, easy to follow, aligned with heuristic policy, and usable for an LLM agent. Address:"
+            #         "\n1. Which information the player should gather via exploration and the best ways to explore?"
+            #         "\n2. Strategies to refine the policy for enhanced performance (exploitation)?"
+            #         "\n3. How should exploration and exploitation be weighted in the next episode?")
+        # TODO: consider the inconsistency between past suggestion and past memory.
+        suggestion = get_chat(query,model=self.args.gpt_version, engine=self.args.gpt_version)
+        self.suggestion = suggestion
+        logger.info(f'[Reflexion Memory]The suggestion prompt is: {query}.')
+        logger.info(f'[Reflexion Memory]The suggestion response is: {suggestion}.')
+        return suggestion
+    def generate(self, traj, memory, max_len_mem=5):
+        if len(memory)> max_len_mem:
+            reflection_query = self._generate_summary_query(traj, memory[-max_len_mem:])
+        else:
+            reflection_query = self._generate_summary_query(traj, memory)
+        reflection = get_completion(reflection_query,engine=self.args.gpt_version)
+        logger.info(f'[Reflexion Memory]The reflexion prompt is: {reflection_query}.')
+        logger.info(f'[Reflexion Memory]The reflexion response is: {reflection}.')
+        return reflection

distillers/raw_prompt_generator.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import random
+import json
+class RawPromptGenerator():
+    def __init__(self,args=None):
+        self.args = args
+        pass
+    def generate_from_file(self, file_path, choice_num=1):
+        with open(file_path, 'r') as infile:
+            data = json.load(infile)
+        result = []
+        for my_data in data[0]:
+            result.append({'question': my_data['question'], 'answer': my_data['answer']})
+        selected_index = random.sample(range(len(result)), choice_num)
+        selected_result = [result[index] for index in selected_index]
+        return selected_result

distillers/reflexion_few_shot_examples.txt ADDED Viewed

	@@ -0,0 +1,75 @@

+You are in the middle of a room. Looking quickly around you, you see a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 2, a countertop 1, a diningtable 2, a diningtable 1, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.
+Your task is to: heat some mug and put it in coffeemachine.
+> think: To solve the task, I need to find and take a mug, then heat it with stoveburner, then put it in coffeemachine.
+Ok.
+> look
+You are in the middle of a room. Looking quickly around you, you see nothing.
+> think: First I need to find a mug. A mug is more likely to appear in countertop (1-2), diningtable (1-2), drawer (1-3), fridge (1), sinkbasin (1), stoveburner (1-4), toaster (1). I can check one by one, starting with countertop 1.
+Ok.
+> look
+You are in the middle of a room. Looking quickly around you, you see nothing.
+> go to countertop 1
+On the countertop 1, you see a butterknife 1, a cup 1, a fork 2, and a mug 1.
+> think: Now I find a mug (1). Next, I need to take it.
+Ok.
+> take mug 1 from countertop 1
+You pick up the mug 1 from the countertop 1.
+> think: Now I take a mug (1). Next, I need to heat it with stoveburner (1-4).
+Ok.
+> go to stoveburner 1
+On the stoveburner 1, you see a pan 2.
+> examine stoveburner 1
+On the stoveburner 1, you see a pan 2.
+> examine stoveburner 1
+On the stoveburner 1, you see a pan 2.
+> examine stoveburner 1
+On the stoveburner 1, you see a pan 2.
+> examine stoveburner 1
+On the stoveburner 1, you see a pan 2.
+STATUS: FAIL
+New plan: I was stuck in a loop in which I continually examined stoveburner 1 instead of heating mug 1 with stoveburner 1. I should have taken mug 1 from countertop 1, then heated it with stoveburner 1, then put it in coffeemachine 1. It did not help to execute two identical actions in a row. I will try to execute a different action if I am stuck in a loop again.
+You are in the middle of a room. Looking quickly around you, you see a bed 1, a desk 2, a desk 1, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a garbagecan 1, a laundryhamper 1, a safe 1, a shelf 6, a shelf 5, a shelf 4, a shelf 3, a shelf 2, and a shelf 1.
+Your task is to: look at bowl under the desklamp.
+> think: To solve the task, I need to find and take a bowl, then find and use a desklamp.
+OK.
+> think: First I need to find a bowl. A bowl is more likely to appear in drawer (1-6), desk (1-2), shelf (1-6), garbagecan (1), laundryhamper (1). I can check one by one, starting with drawer 1.
+OK.
+> go to drawer 1
+The drawer 1 is closed.
+> open drawer 1
+You open the drawer 1. The drawer 1 is open. In it, you see a cellphone 2.
+> go to drawer 2
+On the drawer 2, you see a keychain 1.
+> go to drawer 3
+Nothing happens.
+> go to drawer 4
+The drawer 4 is closed.
+> open drawer 4
+You open the drawer 4. The drawer 4 is open. In it, you see nothing.
+> go to drawer 5
+On the drawer 5, you see nothing.
+> go to drawer 6
+The drawer 6 is closed.
+> open drawer 6
+You open the drawer 6. The drawer 6 is open. In it, you see a book 2, and a pen 2.
+> go to desk 1
+On the desk 1, you see a alarmclock 1, a desklamp 1, a mug 1, and a pencil 1.
+> think: Now I find a desklamp (1). Next, I need to use it.
+OK.
+> use desklamp 1
+You turn on the desklamp 1.
+>
+Nothing happens.
+> go to desk 2
+On the desk 2, you see a bowl 2, a bowl 1, a cd 2, a cd 1, and a creditcard 3.
+> think: Now I find a bowl (2). Next, I need to take it.
+OK.
+> take bowl 2 from desk 2
+You pick up the bowl 2 from the desk 2.
+>
+Nothing happens.
+>
+Nothing happens.
+STATUS: FAIL
+New plan: In this environment, my plan was to find a bowl then find and use a desklamp. However, the task says to look at bowl under the desklamp. I should have looked for the desklamp then looked for the bowl. I noticed that the desklamp was found on desk 1. In the next trial, I will go to desk 1, find the lamp, then look for the bowl under the desklamp.

distillers/self_reflection.py ADDED Viewed

	@@ -0,0 +1,53 @@

+from deciders.utils import get_completion
+from typing import List, Dict, Any
+from loguru import logger
+import random
+import json
+class RefletionGenerator():
+    def __init__(self,logfile="",args=None):
+        self.args = args
+        with open("./distillers/reflexion_few_shot_examples.txt", 'r') as f:
+            self.FEW_SHOT_EXAMPLES = f.read()
+        if logfile:
+            # logger.remove()
+            logger.add(logfile, colorize=True, enqueue=True, filter=lambda x: '[Reflexion Memory]' in x['message'])
+    def generate_from_file(self, file_path,max_step_num=200):
+        mem = []
+        with open(file_path, 'r') as infile:
+            data = json.load(infile)
+        for traj in data:
+            traj_text = traj[0]['game_description']
+            traj_text += traj[0]['goal_description']
+            for transition in traj[-max_step_num:]:
+                traj_text += transition['observation']
+                traj_text += f"Action: {transition['action']}"
+            reflection = self.generate(traj_text, mem, max_len_mem=5)
+            mem.append(reflection)
+        return mem
+    def _generate_reflection_query(self, traj, memory):
+        """Allows the Agent to reflect upon a past experience."""
+        query: str = f"""You will be given the history of a past experience in which you were placed in an environment and given a task to complete. You were unsuccessful in completing the task. Do not summarize your environment, but rather think about the strategy and path you took to attempt to complete the task. Think step by step what mistakes you made leading the failure. Then devise a concise, new plan of action that accounts for your mistake with reference to specific actions that you should have taken. For example, if you tried A and B but forgot C, then you should reason that the forgetting C is the key mistake. After that you devise a plan to achieve C with environment-specific actions. You remind yourself the plan your will take in the next trail and Give your plan after "Plan". Here are two examples:
+        {self.FEW_SHOT_EXAMPLES}
+        {traj}"""
+        if len(memory) > 0:
+            query += '\n\nPlans from past attempts:\n'
+            for i, m in enumerate(memory):
+                query += f'Trial #{i}: {m}\n'
+        query += '\n\nNew plan:'
+        return query
+    def generate(self, traj, memory, max_len_mem=5):
+        if len(memory)> max_len_mem:
+            reflection_query = self._generate_reflection_query(traj, memory[-max_len_mem:])
+        else:
+            reflection_query = self._generate_reflection_query(traj, memory)
+        reflection = get_completion(reflection_query, engine=self.args.gpt_version)
+        logger.info(f'[Reflexion Memory]The reflexion prompt is: {reflection_query}.')
+        logger.info(f'[Reflexion Memory]The reflexion response is: {reflection}.')
+        return reflection

distillers/traj_prompt_summarizer.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import random
+from deciders.utils import get_completion
+import json
+class TrajPromptSummarizer():
+    def __init__(self,args=None):
+        self.args = args
+        with open("./distillers/traj_summary_few_shot_examples.txt", 'r') as f:
+            self.FEW_SHOT_EXAMPLES = f.read()
+    def generate_from_file(self, file_path,max_step_num=200):
+        mem = []
+        with open(file_path, 'r') as infile:
+            data = json.load(infile)
+        for traj in data:
+            traj_text = traj[0]['game_description']
+            traj_text += traj[0]['goal_description']
+            for transition in traj[-max_step_num:]:
+                traj_text += transition['observation']
+                traj_text += f"> {transition['action']}"
+            traj_text += f"Your performance is: {transition['cum_reward']}"
+            reflection = self.generate(traj_text, mem, max_len_mem=5)
+            mem.append(reflection)
+        return mem
+    def _generate_summary_query(self, traj, memory):
+        """Allows the Agent to reflect upon a past experience."""
+        query: str = f"""You will be given the history of a past experience in which you were placed in an environment and given a task to complete. Summarize your trajectory and reasoning the relation between your policy and the obtained result. Here are two examples:
+        {self.FEW_SHOT_EXAMPLES}
+        {traj}"""
+        if len(memory) > 0:
+            query += '\n\nPlans from past attempts:\n'
+            for i, m in enumerate(memory):
+                query += f'Trial #{i}: {m}\n'
+        query += '\n\nSummary:'
+        return query
+    def generate(self, traj, memory, max_len_mem=5):
+        if len(memory)> max_len_mem:
+            reflection_query = self._generate_summary_query(traj, memory[-max_len_mem:])
+        else:
+            reflection_query = self._generate_summary_query(traj, memory)
+        reflection = get_completion(reflection_query, engine=self.args.gpt_version)
+        return reflection

distillers/traj_summary_few_shot_examples.txt ADDED Viewed

	@@ -0,0 +1,76 @@

+You are in the middle of a room. Looking quickly around you, you see a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 2, a countertop 1, a diningtable 2, a diningtable 1, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.
+Your task is to: heat some mug and put it in coffeemachine.
+> think: To solve the task, I need to find and take a mug, then heat it with stoveburner, then put it in coffeemachine.
+Ok.
+> look
+You are in the middle of a room. Looking quickly around you, you see nothing.
+> think: First I need to find a mug. A mug is more likely to appear in countertop (1-2), diningtable (1-2), drawer (1-3), fridge (1), sinkbasin (1), stoveburner (1-4), toaster (1). I can check one by one, starting with countertop 1.
+Ok.
+> look
+You are in the middle of a room. Looking quickly around you, you see nothing.
+> go to countertop 1
+On the countertop 1, you see a butterknife 1, a cup 1, a fork 2, and a mug 1.
+> think: Now I find a mug (1). Next, I need to take it.
+Ok.
+> take mug 1 from countertop 1
+You pick up the mug 1 from the countertop 1.
+> think: Now I take a mug (1). Next, I need to heat it with stoveburner (1-4).
+Ok.
+> go to stoveburner 1
+On the stoveburner 1, you see a pan 2.
+> examine stoveburner 1
+On the stoveburner 1, you see a pan 2.
+> examine stoveburner 1
+On the stoveburner 1, you see a pan 2.
+> examine stoveburner 1
+On the stoveburner 1, you see a pan 2.
+> examine stoveburner 1
+On the stoveburner 1, you see a pan 2.
+STATUS: FAIL
+Summary: I continually examined stoveburner 1 instead of heating mug 1 with stoveburner 1. This makes me stuck in a loop and fail.
+You are in the middle of a room. Looking quickly around you, you see a bed 1, a desk 2, a desk 1, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a garbagecan 1, a laundryhamper 1, a safe 1, a shelf 6, a shelf 5, a shelf 4, a shelf 3, a shelf 2, and a shelf 1.
+Your task is to: look at bowl under the desklamp.
+> think: To solve the task, I need to find and take a bowl, then find and use a desklamp.
+OK.
+> think: First I need to find a bowl. A bowl is more likely to appear in drawer (1-6), desk (1-2), shelf (1-6), garbagecan (1), laundryhamper (1). I can check one by one, starting with drawer 1.
+OK.
+> go to drawer 1
+The drawer 1 is closed.
+> open drawer 1
+You open the drawer 1. The drawer 1 is open. In it, you see a cellphone 2.
+> go to drawer 2
+On the drawer 2, you see a keychain 1.
+> go to drawer 3
+Nothing happens.
+> go to drawer 4
+The drawer 4 is closed.
+> open drawer 4
+You open the drawer 4. The drawer 4 is open. In it, you see nothing.
+> go to drawer 5
+On the drawer 5, you see nothing.
+> go to drawer 6
+The drawer 6 is closed.
+> open drawer 6
+You open the drawer 6. The drawer 6 is open. In it, you see a book 2, and a pen 2.
+> go to desk 1
+On the desk 1, you see a alarmclock 1, a desklamp 1, a mug 1, and a pencil 1.
+> think: Now I find a desklamp (1). Next, I need to use it.
+OK.
+> use desklamp 1
+You turn on the desklamp 1.
+>
+Nothing happens.
+> go to desk 2
+On the desk 2, you see a bowl 2, a bowl 1, a cd 2, a cd 1, and a creditcard 3.
+> think: Now I find a bowl (2). Next, I need to take it.
+OK.
+> take bowl 2 from desk 2
+You pick up the bowl 2 from the desk 2.
+>
+Nothing happens.
+>
+Nothing happens.
+STATUS: FAIL
+Summary: I try to find a bowl then find and use a desklamp. This is inconsistent to the task which require looking at
+bowl under the desklamp. Thus I fail.

draw_overall_performance.py ADDED Viewed

	@@ -0,0 +1,59 @@

+import pandas as pd
+import matplotlib.pyplot as plt
+# Load the CSV data
+data = pd.read_csv("performance_data.csv")
+# Group games by type
+game_types = {
+    "Classic Control": ["Acrobot-v1", "CartPole-v0", "MountainCar-v0"],
+    "Box 2D": ["LunarLander-v2"],
+    "Toy Text": ["Taxi-v3", "CliffWalking-v0", "Blackjack-v1"]
+}
+for game_type, games in game_types.items():
+    fig, axs = plt.subplots(1, len(games), figsize=(12 * len(games), 6))
+    fig.suptitle(f"Performance Plot: {game_type}", fontsize=28, fontname="Times New Roman")
+    if len(games) == 1:
+        axs = [axs]
+    handles, labels = [], []
+    for idx, game in enumerate(games):
+        # Filter data to get information for the current game (in the loop)
+        game_data = data[data["game"] == game]
+        axs[idx].set_title(game, fontsize=20, fontname="Times New Roman")
+        axs[idx].set_xlabel("Levels", fontsize=16, fontname="Times New Roman")
+        if idx == 0:
+            axs[idx].set_ylabel("Scores", fontsize=16, fontname="Times New Roman")
+        for index, row in game_data.iterrows():
+            decider_name = row["decider_name"]
+            levels = ["l1", "l2", "l3", "l4", "l5"]
+            scores = row[levels].values.tolist()
+            lines = axs[idx].plot(levels, scores, "-o", label=decider_name)
+            # Grab the handle and label for creating a global legend
+            handles.append(lines[0])
+            labels.append(decider_name)
+    # Eliminate duplicate labels and handles
+    unique_labels = []
+    unique_handles = []
+    for handle, label in zip(handles, labels):
+        if label not in unique_labels:
+            unique_labels.append(label)
+            unique_handles.append(handle)
+    # Add a legend at the bottom middle of the figure
+    fig.legend(
+        unique_handles,
+        unique_labels,
+        loc="lower center",
+        ncol=4, prop={'size': 18}
+    )
+    # Adjust layout to accommodate the legend and prevent cropping
+    plt.savefig("./vis/" + game_type + ".png", dpi=300)

environment.yml ADDED Viewed

	@@ -0,0 +1,193 @@

+name: llm-gym
+channels:
+  - conda-forge
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=main
+  - _openmp_mutex=5.1=1_gnu
+  - aiosignal=1.2.0=pyhd3eb1b0_0
+  - asttokens=2.0.5=pyhd3eb1b0_0
+  - async-timeout=4.0.2=py38h06a4308_0
+  - attrs=22.1.0=py38h06a4308_0
+  - backcall=0.2.0=pyhd3eb1b0_0
+  - blas=1.0=mkl
+  - brotlipy=0.7.0=py38h27cfd23_1003
+  - ca-certificates=2023.08.22=h06a4308_0
+  - cached-property=1.5.2=py_0
+  - certifi=2023.7.22=py38h06a4308_0
+  - cffi=1.15.1=py38h5eee18b_3
+  - chardet=4.0.0=py38h06a4308_1003
+  - comm=0.1.2=py38h06a4308_0
+  - cryptography=39.0.1=py38h9ce1e76_2
+  - cudatoolkit=11.3.1=h2bc3f7f_2
+  - debugpy=1.5.1=py38h295c915_0
+  - executing=0.8.3=pyhd3eb1b0_0
+  - frozenlist=1.3.3=py38h5eee18b_0
+  - hdf5=1.10.6=h3ffc7dd_1
+  - idna=3.4=py38h06a4308_0
+  - importlib_metadata=6.0.0=hd3eb1b0_0
+  - intel-openmp=2023.1.0=hdb19cb5_46305
+  - ipykernel=6.19.2=py38hb070fc8_0
+  - ipython=8.12.0=py38h06a4308_0
+  - jedi=0.18.1=py38h06a4308_1
+  - jupyter_client=8.1.0=py38h06a4308_0
+  - jupyter_core=5.3.0=py38h06a4308_0
+  - ld_impl_linux-64=2.38=h1181459_1
+  - libffi=3.4.4=h6a678d5_0
+  - libgcc-ng=11.2.0=h1234567_1
+  - libgfortran-ng=11.2.0=h00389a5_1
+  - libgfortran5=11.2.0=h1234567_1
+  - libgomp=11.2.0=h1234567_1
+  - libllvm14=14.0.6=hdb19cb5_3
+  - libprotobuf=3.20.3=he621ea3_0
+  - libsodium=1.0.18=h7b6447c_0
+  - libstdcxx-ng=11.2.0=h1234567_1
+  - loguru=0.7.1=py38h578d9bd_0
+  - matplotlib-inline=0.1.6=py38h06a4308_0
+  - mkl=2023.1.0=h6d00ec8_46342
+  - mkl-service=2.4.0=py38h5eee18b_1
+  - mkl_fft=1.3.6=py38h417a72b_1
+  - mkl_random=1.2.2=py38h417a72b_1
+  - ncurses=6.4=h6a678d5_0
+  - nest-asyncio=1.5.6=py38h06a4308_0
+  - numpy-base=1.24.3=py38h060ed82_1
+  - openssl=3.0.10=h7f8727e_2
+  - packaging=23.0=py38h06a4308_0
+  - parso=0.8.3=pyhd3eb1b0_0
+  - pcre=8.45=h295c915_0
+  - pexpect=4.8.0=pyhd3eb1b0_3
+  - pickleshare=0.7.5=pyhd3eb1b0_1003
+  - pip=23.2.1=py38h06a4308_0
+  - platformdirs=2.5.2=py38h06a4308_0
+  - prompt-toolkit=3.0.36=py38h06a4308_0
+  - psutil=5.9.0=py38h5eee18b_0
+  - ptyprocess=0.7.0=pyhd3eb1b0_2
+  - pure_eval=0.2.2=pyhd3eb1b0_0
+  - pycparser=2.21=pyhd3eb1b0_0
+  - pygments=2.15.1=py38h06a4308_1
+  - pyopenssl=23.0.0=py38h06a4308_0
+  - pysocks=1.7.1=py38h06a4308_0
+  - python=3.8.16=h955ad1f_4
+  - python-dateutil=2.8.2=pyhd3eb1b0_0
+  - python_abi=3.8=2_cp38
+  - pyyaml=6.0=py38h0a891b7_4
+  - pyzmq=25.1.0=py38h6a678d5_0
+  - readline=8.2=h5eee18b_0
+  - setuptools=67.8.0=py38h06a4308_0
+  - six=1.16.0=pyhd3eb1b0_1
+  - sqlite=3.41.2=h5eee18b_0
+  - stack_data=0.2.0=pyhd3eb1b0_0
+  - tbb=2021.8.0=hdb19cb5_0
+  - tk=8.6.12=h1ccaba5_0
+  - tornado=6.2=py38h5eee18b_0
+  - traitlets=5.7.1=py38h06a4308_0
+  - typing_extensions=4.6.3=py38h06a4308_0
+  - wcwidth=0.2.5=pyhd3eb1b0_0
+  - wheel=0.38.4=py38h06a4308_0
+  - xz=5.4.2=h5eee18b_0
+  - yaml=0.2.5=h7b6447c_0
+  - zeromq=4.3.4=h2531618_0
+  - zlib=1.2.13=h5eee18b_0
+  - pip:
+    - absl-py==1.4.0
+    - aiohttp==3.8.4
+    - ale-py==0.8.1
+    - annotated-types==0.5.0
+    - appdirs==1.4.4
+    - beautifulsoup4==4.12.2
+    - box2d-py==2.3.5
+    - cachetools==5.3.1
+    - cchardet==2.1.7
+    - charset-normalizer==3.1.0
+    - click==8.1.3
+    - cloudpickle==2.2.1
+    - contourpy==1.1.0
+    - cycler==0.11.0
+    - cython==3.0.1
+    - dataclasses-json==0.5.14
+    - decorator==4.4.2
+    - docker-pycreds==0.4.0
+    - fasteners==0.18
+    - filelock==3.12.2
+    - fonttools==4.40.0
+    - fsspec==2023.6.0
+    - gitdb==4.0.10
+    - gitpython==3.1.31
+    - glfw==2.6.2
+    - google-auth==2.21.0
+    - google-auth-oauthlib==1.0.0
+    - greenlet==2.0.2
+    - grpcio==1.56.0
+    - gym==0.26.2
+    - gym-notices==0.0.8
+    - h5py==3.9.0
+    - huggingface-hub==0.15.1
+    - imageio==2.31.2
+    - imageio-ffmpeg==0.4.8
+    - importlib-metadata==6.6.0
+    - importlib-resources==5.12.0
+    - iniconfig==2.0.0
+    - kiwisolver==1.4.4
+    - langchain==0.0.284
+    - langsmith==0.0.33
+    - llvmlite==0.40.1
+    - lz4==4.3.2
+    - markdown==3.4.3
+    - markupsafe==2.1.1
+    - marshmallow==3.20.1
+    - matplotlib==3.7.1
+    - moviepy==1.0.3
+    - mujoco==2.2.0
+    - mujoco-py==2.1.2.14
+    - multidict==6.0.4
+    - numba==0.57.1
+    - numexpr==2.8.5
+    - numpy==1.24.4
+    - oauthlib==3.2.2
+    - openai==0.27.8
+    - opencv-python==4.8.0.76
+    - pathtools==0.1.2
+    - pillow==9.5.0
+    - pluggy==1.2.0
+    - proglog==0.1.10
+    - protobuf==3.19.6
+    - py==1.11.0
+    - pyasn1==0.5.0
+    - pyasn1-modules==0.3.0
+    - pydantic==2.3.0
+    - pydantic-core==2.6.3
+    - pygame==2.1.0
+    - pyopengl==3.1.7
+    - pyparsing==3.0.9
+    - pytest==7.0.1
+    - regex==2023.6.3
+    - requests==2.31.0
+    - requests-oauthlib==1.3.1
+    - rsa==4.9
+    - safetensors==0.3.1
+    - sentry-sdk==1.26.0
+    - setproctitle==1.3.2
+    - smmap==5.0.0
+    - soupsieve==2.4.1
+    - sqlalchemy==2.0.20
+    - swig==4.1.1
+    - tenacity==8.2.3
+    - tensorboard==2.14.0
+    - tensorboard-data-server==0.7.1
+    - tianshou==0.4.10
+    - tokenizers==0.13.3
+    # - torch==1.12.0+cu113
+    # - torchaudio==0.12.0+cu113
+    # - torchvision==0.13.0+cu113
+    - tqdm==4.65.0
+    - transformers==4.30.2
+    - typing==3.7.4.3
+    - typing-extensions==4.7.1
+    - typing-inspect==0.9.0
+    - urllib3
+    - v==1
+    - wandb==0.15.4
+    - werkzeug==2.3.6
+    - yarl==1.9.2
+    - zipp==3.15.0
+    - aquarel==0.0.5

envs/__init__.py ADDED Viewed

	@@ -0,0 +1,51 @@

+from .base_env import BaseEnv, SettableStateEnv
+from .classic_control import cartpole_translator, cartpole_policies
+from .classic_control import acrobot_translator, acrobot_policies
+from .classic_control import mountaincar_translator, mountaincar_policies
+from .classic_control import mountaincarContinuous_translator,mountaincarContinuous_policies
+from .box2d import LunarLander_translator, LunarLander_policies
+from .toy_text import blackjack_translator, blackjack_policies
+from .toy_text import taxi_translator, taxi_policies
+from .toy_text import cliffwalking_translator, cliffwalking_policies
+from .toy_text import frozenlake_translator, frozenlake_policies
+REGISTRY = {}
+REGISTRY["sampling_wrapper"] = SettableStateEnv
+REGISTRY["base_env"] = BaseEnv
+REGISTRY["cart_init_translator"] = cartpole_translator.GameDescriber
+REGISTRY["cart_basic_translator"] = cartpole_translator.BasicStateSequenceTranslator
+REGISTRY["acrobot_init_translator"] = acrobot_translator.GameDescriber
+REGISTRY["acrobot_basic_translator"] = acrobot_translator.BasicStateSequenceTranslator
+REGISTRY["mountaincar_init_translator"] = mountaincar_translator.GameDescriber
+REGISTRY["mountaincar_basic_translator"] = mountaincar_translator.BasicStateSequenceTranslator
+REGISTRY["cart_policies"] = [cartpole_policies.dedicated_1_policy, cartpole_policies.dedicated_2_policy, cartpole_policies.pseudo_random_policy, cartpole_policies.real_random_policy]
+REGISTRY["acrobot_policies"] = [acrobot_policies.dedicated_1_policy, acrobot_policies.dedicated_2_policy, acrobot_policies.dedicated_3_policy, acrobot_policies.pseudo_random_policy, acrobot_policies.real_random_policy]
+REGISTRY["mountaincar_policies"] = [mountaincar_policies.dedicated_1_policy, mountaincar_policies.dedicated_2_policy, mountaincar_policies.dedicated_3_policy, mountaincar_policies.pseudo_random_policy, mountaincar_policies.real_random_policy]
+REGISTRY["lunarLander_init_translator"] = LunarLander_translator.GameDescriber
+REGISTRY["lunarLander_basic_translator"] = LunarLander_translator.BasicStateSequenceTranslator
+REGISTRY["lunarLander_policies"] = [LunarLander_policies.dedicated_1_policy, LunarLander_policies.dedicated_2_policy, LunarLander_policies.dedicated_3_policy,LunarLander_policies.dedicated_4_policy, LunarLander_policies.pseudo_random_policy, LunarLander_policies.real_random_policy]
+REGISTRY["blackjack_init_translator"] = blackjack_translator.GameDescriber
+REGISTRY["blackjack_basic_translator"] = blackjack_translator.BasicStateSequenceTranslator
+REGISTRY["blackjack_policies"] = [blackjack_policies.dedicated_1_policy, blackjack_policies.dedicated_2_policy, blackjack_policies.pseudo_random_policy, blackjack_policies.real_random_policy]
+REGISTRY["taxi_init_translator"] = taxi_translator.GameDescriber
+REGISTRY["taxi_basic_translator"] = taxi_translator.BasicStateSequenceTranslator
+REGISTRY["taxi_policies"] = [taxi_policies.dedicated_1_policy, taxi_policies.dedicated_2_policy, taxi_policies.dedicated_3_policy, taxi_policies.dedicated_4_policy, taxi_policies.dedicated_5_policy, taxi_policies.dedicated_6_policy, taxi_policies.pseudo_random_policy, taxi_policies.real_random_policy]
+REGISTRY["cliffwalking_init_translator"] = cliffwalking_translator.GameDescriber
+REGISTRY["cliffwalking_basic_translator"] = cliffwalking_translator.BasicStateSequenceTranslator
+REGISTRY["cliffwalking_policies"] = [cliffwalking_policies.dedicated_1_policy, cliffwalking_policies.dedicated_2_policy, cliffwalking_policies.dedicated_3_policy, cliffwalking_policies.dedicated_4_policy, cliffwalking_policies.pseudo_random_policy, cliffwalking_policies.real_random_policy]
+REGISTRY["frozenlake_init_translator"] = frozenlake_translator.GameDescriber
+REGISTRY["frozenlake_basic_translator"] = frozenlake_translator.BasicStateSequenceTranslator
+REGISTRY["frozenlake_policies"] = [frozenlake_policies.dedicated_1_policy, frozenlake_policies.dedicated_2_policy, frozenlake_policies.dedicated_3_policy, frozenlake_policies.dedicated_4_policy, frozenlake_policies.pseudo_random_policy, frozenlake_policies.real_random_policy]
+REGISTRY["mountaincarContinuous_init_translator"] = mountaincarContinuous_translator.GameDescriber
+REGISTRY["mountaincarContinuous_basic_translator"] = mountaincarContinuous_translator.BasicStateSequenceTranslator
+REGISTRY["mountaincarContinuous_policies"] = [mountaincarContinuous_policies.pseudo_random_policy, mountaincarContinuous_policies.real_random_policy]

envs/base_env.py ADDED Viewed

	@@ -0,0 +1,97 @@

+# This file contains functions for interacting with the CartPole environment
+import gym
+class SettableStateEnv(gym.Wrapper):
+    def __init__(self, env):
+        super().__init__(env)
+        self.env = env
+    def set_state(self, state):
+        self.env.state = state
+        self.env.steps_beyond_terminated = None
+class BaseEnv(gym.Wrapper):
+    def __init__(self, env, translator):
+        super().__init__(env)
+        self.translator = translator
+        self.env_name = super().spec.id
+        self.transition_data = {}
+        self.game_description = self.get_game_description()
+        self.goal_description = self.get_goal_description()
+        self.action_description = self.get_action_description()
+        self.action_desc_dict = self.get_action_desc_dict()
+        self.reward_desc_dict = self.get_reward_desc_dict()
+    def reset(self, **kwargs):
+        state, _ = super().reset(**kwargs)
+        self.transition_data['state'] = state
+        self.translator.obtain(self.transition_data)
+        summary, future_summary = self.translator.translate()
+        info = {
+            'future_summary': future_summary
+        }
+        self.state = state
+        return summary, info
+    def step(self, action):
+        potential_next_state = self.get_potential_next_state(action)
+        state, reward, terminated, _, info = super().step(action)
+        self.transition_data['action'] = action
+        self.transition_data['next_state'] = state
+        self.transition_data['reward'] = reward
+        self.transition_data['terminated'] = terminated
+        self.translator.update(self.transition_data)
+        self.transition_data = {}
+        self.transition_data['state'] = state
+        self.translator.obtain(self.transition_data)
+        summary, future_summary = self.translator.translate()
+        info = {
+            'future_summary': future_summary,
+            'potential_state': potential_next_state
+        }
+        return summary, reward, terminated, _, info
+    def step_llm(self, action):
+        potential_next_state = self.get_potential_next_state(action)
+        if "Continuous" in self.env_name:
+            state, reward, terminated, _, info = super().step(action)
+        else:
+            state, reward, terminated, _, info = super().step(action-1)
+        self.transition_data['action'] = action
+        self.transition_data['next_state'] = state
+        self.transition_data['reward'] = reward
+        self.transition_data['terminated'] = terminated
+        self.translator.update(self.transition_data)
+        self.transition_data = {}
+        self.transition_data['state'] = state
+        self.translator.obtain(self.transition_data)
+        self.state = state
+        summary, future_summary = self.translator.translate()
+        info = {
+            'future_summary': future_summary,
+            'potential_state': potential_next_state,
+        }
+        return summary, reward, terminated, _, info
+    def get_terminate_state(self, episode_len, max_episode_len):
+        return self.translator.translate_terminate_state(self.state, episode_len, max_episode_len)
+    def get_game_description(self,):
+        return self.translator.describe_game()
+    def get_goal_description(self,):
+        return self.translator.describe_goal()
+    def get_action_description(self,):
+        return self.translator.describe_action()
+    def get_action_desc_dict(self,):
+        return self.translator.get_action_desc_dict()
+    def get_reward_desc_dict(self,):
+        return self.translator.get_reward_desc_dict()
+    def get_potential_next_state(self, action):
+        return self.translator.translate_potential_next_state(self.state, action)

envs/box2d/LunarLander_policies.py ADDED Viewed

	@@ -0,0 +1,36 @@

+import numpy as np
+def dedicated_1_policy(state, pre_action=1):
+    def get_description():
+        return "Always select action 1 which do nothing"
+    dedicated_1_policy.description = get_description()
+    return 1
+def dedicated_2_policy(state, pre_action=1):
+    def get_description():
+        return "Always select action 2 which fire left engine"
+    dedicated_2_policy.description = get_description()
+    return 2
+def dedicated_3_policy(state, pre_action=1):
+    def get_description():
+        return "Always select action 3 which fire main engine"
+    dedicated_3_policy.description = get_description()
+    return 3
+def dedicated_4_policy(state, pre_action=1):
+    def get_description():
+        return "Always select action 4 which fire right engine"
+    dedicated_4_policy.description = get_description()
+    return 4
+def pseudo_random_policy(state, pre_action):
+    def get_description():
+        return "Select action 1, 2, 3, 4 alternatively which do nothing, fire left engine, fire main engine, and fire right engine alternatively"
+    pseudo_random_policy.description = get_description()
+    return pre_action%4+1
+def real_random_policy(state,pre_action=0):
+    def get_description():
+        return "Select action with a random policy"
+    real_random_policy.description = get_description()
+    return np.random.choice([1, 2, 3, 4])

envs/box2d/LunarLander_translator.py ADDED Viewed

	@@ -0,0 +1,67 @@

+# [Translator classes and functions for Lunar Lander environment]
+class BasicLevelTranslator:
+    def __init__(self,):
+        pass
+    def translate(self, state):
+        x, y, x_dot, y_dot, angle, angular_velocity, left_leg_contact, right_leg_contact = state
+        left_contact_info = "in contact" if left_leg_contact else "not in contact"
+        right_contact_info = "in contact" if right_leg_contact else "not in contact"
+        return f"The lander is at position ({x:.2f}, {y:.2f}), the horizontal speed of movement is {x_dot:.2f}, " \
+               f"the vertical velocity speed of movement is {y_dot:.2f}. The angle is {angle:.2f} radians, and it's rotating at {angular_velocity:.2f} radians per second. The left leg is {left_contact_info} with ground. The right leg is {right_contact_info} with ground."
+class GameDescriber:
+    def __init__(self, args):
+        self.is_only_local_obs = args.is_only_local_obs == 1
+        self.max_episode_len = args.max_episode_len
+        self.action_desc_dict = {
+        }
+        self.reward_desc_dict = {
+        }
+    def describe_goal(self):
+        return "The goal is to successfully land the lander on the landing pad which is at position (0, 0) while avoiding crash."
+    def translate_terminate_state(self, state, episode_len, max_episode_len):
+        return ""
+    def translate_potential_next_state(self, state, action):
+        return ""
+    def describe_game(self):
+        return "In the Lunar Lander game, you control a lander that is descending towards " \
+               "the landing pad. The goal is to successfully land the lander on the landing pad " \
+               "while avoiding crash. Please note that the lander is affected by gravity, and the lander starts at the " \
+               "top center of the viewport with a random initial force applied to its center of mass. " \
+               "Be careful to balance the engine to slow down your descent " \
+               "and land gently. If you land too quickly or crash into the landing pad, the game will " \
+               "end, and you will be punished."
+    def describe_action(self):
+        return "Your Next Move: \n Please choose an action. Type '1' to do noting, '2' to fire left engine and make lander move to right, '3' to fire main engine and make lander move to up, " \
+               "or '4' to fire right engine and make lander move to left. Ensure you only provide the action number from the valid action list, i.e., [1, 2, 3, 4]."
+class BasicStateSequenceTranslator(BasicLevelTranslator):
+    def translate(self, infos, is_current=False):
+        descriptions = []
+        if is_current:
+            state_desc = BasicLevelTranslator().translate(infos[-1]['state'])
+            return state_desc
+        for i, info in enumerate(infos):
+            assert 'state' in info, "info should contain state information"
+            state_desc = BasicLevelTranslator().translate(info['state'])
+            if info['action'] == 1:
+                action_desc = f"Take Action: 'Do Noting'"
+            elif info['action'] == 2:
+                action_desc = f"Take Action: 'Fire left engine'"
+            elif info['action'] == 3:
+                action_desc = f"Take Action: 'Fire main engine'"
+            else:
+                action_desc = f"Take Action: 'Fire right engine'"
+            reward_desc = f"Result: Reward of {info['reward']}, "
+            next_state_desc = BasicLevelTranslator().translate(info['next_state'])
+            descriptions.append(f"{state_desc}.\n {action_desc} \n {reward_desc} \n Transit to {next_state_desc}")
+        return descriptions

envs/box2d/__init__.py ADDED Viewed

File without changes

envs/box2d/few_shot_examples/lunarlander_l2.json ADDED Viewed

The diff for this file is too large to render. See raw diff

envs/box2d/few_shot_examples/lunarlander_l4.json ADDED Viewed

The diff for this file is too large to render. See raw diff

envs/classic_control/__init__.py ADDED Viewed

File without changes

envs/classic_control/acrobot_policies.py ADDED Viewed

	@@ -0,0 +1,36 @@

+import numpy as np
+# https://colab.research.google.com/drive/1DdWsGi10232orUv-reY4wsTmT0VMoHaX?usp=sharing#scrollTo=4OfVmDKk7XvG
+# LLMs bias on 0 so make the actions 1, 2 and 3 instead.
+def dedicated_1_policy(state, pre_action=1):
+    def get_description():
+        return "Always select action 1"
+    dedicated_0_policy.description = get_description()
+    return 1
+def dedicated_2_policy(state, pre_action=1):
+    def get_description():
+        return "Always select action 2"
+    dedicated_2_policy.description = get_description()
+    return 2
+def dedicated_3_policy(state, pre_action=1):
+    def get_description():
+        return "Always select action 3"
+    dedicated_3_policy.description = get_description()
+    return 3
+def pseudo_random_policy(state, pre_action):
+    def get_description():
+        return "Select action 1, 2, and 3 alternatively"
+    pseudo_random_policy.description = get_description()
+    return pre_action % 3 + 1
+def real_random_policy(state, pre_action=1):
+    def get_description():
+        return "Select action with a random policy"
+    real_random_policy.description = get_description()
+    return np.random.choice([1, 2, 3])

envs/classic_control/acrobot_translator.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import math
+class BasicLevelTranslator:
+    def __init__(self):
+        pass
+    def translate(self, state):
+        cos_theta1, sin_theta1, cos_theta2, sin_theta2, theta1_dot, theta2_dot = state
+        theta1_direction = "clockwise" if theta1_dot > 0 else "counterclockwise"
+        theta2_direction = "clockwise" if theta2_dot > 0 else "counterclockwise"
+        theta1 = math.atan(sin_theta1 / (cos_theta1+1e-6))
+        theta2 = math.atan(sin_theta2 / (cos_theta2+1e-6))
+        res = (f"Link1: angle theta1 {theta1:.2f} radians, rotating {abs(theta1_dot):.2f} radians per second {theta1_direction}. "
+               f"Link2: angle theta2 {theta2:.2f} radians relative to Link1, rotating {abs(theta2_dot):.2f} radians per second {theta2_direction}.")
+        return res
+class GameDescriber:
+    def __init__(self, args):
+        self.is_only_local_obs = args.is_only_local_obs == 1
+        self.max_episode_len = args.max_episode_len
+        self.action_desc_dict = {
+        }
+        self.reward_desc_dict = {
+        }
+    def describe_goal(self):
+        return "The goal is to apply torque on the actuator to swing the free end of the linear chain above the target height, which is constructed as: -cos(theta1) - cos(theta2 + theta1) > 1.0."
+    def translate_terminate_state(self, state, episode_len, max_episode_len):
+        return ""
+    def translate_potential_next_state(self, state, action):
+        return ""
+    def describe_game(self):
+        return ('''In the Acrobot game, there are two links connected by two joints. The first link is connected to a base, and your goal is to swing the free end of the second link above the target height by applying torques on the actuated joint. The task ends if one of the following occurs: 1. The free end reaches the target height, which is constructed as: -cos(theta1) - cos(theta2 + theta1) > 1.0; or 2. Decision time is greater than 200.''')
+    # https://colab.research.google.com/drive/1DdWsGi10232orUv-reY4wsTmT0VMoHaX?usp=sharing#scrollTo=4OfVmDKk7XvG
+    # LLMs bias on 0 so make the actions 1, 2 and 3 instead.
+    def describe_action(self):
+        return ("Your Next Move: \\n Please choose an action. Type '1' to apply -1 torque, '2' to apply 0 torque, or '3' to apply 1 torque. "
+                "Ensure you provide the action number from the valid action list, i.e., [1, 2, 3].")
+class BasicStateSequenceTranslator(BasicLevelTranslator):
+    def translate(self, infos, is_current=False):
+        descriptions = []
+        if is_current:
+            state_desc = BasicLevelTranslator().translate(infos[-1]['state'])
+            return state_desc
+        for i, info in enumerate(infos):
+            assert 'state' in info, "info should contain state information"
+            state_desc = BasicLevelTranslator().translate(info['state'])
+            action_desc = f"Take Action: Apply {info['action'] - 2} torque on the actuated joint."
+            reward_desc = f"Result: Reward of {info['reward']}."
+            next_state_desc = BasicLevelTranslator().translate(info['next_state'])
+            descriptions.append(f"{state_desc}.\n {action_desc} \n {reward_desc} \n Transit to {next_state_desc}")
+        return descriptions

envs/classic_control/cartpole_policies.py ADDED Viewed

	@@ -0,0 +1,25 @@

+import numpy as np
+def dedicated_1_policy(state, pre_action=1):
+    def get_description():
+        return "Always select action 1"
+    dedicated_1_policy.description = get_description()
+    return 1
+def dedicated_2_policy(state, pre_action=1):
+    def get_description():
+        return "Always select action 2"
+    dedicated_2_policy.description = get_description()
+    return 2
+def pseudo_random_policy(state, pre_action):
+    def get_description():
+        return "Select action 1 and 2 alternatively"
+    pseudo_random_policy.description = get_description()
+    return pre_action%2 +1
+def real_random_policy(state,pre_action=1):
+    def get_description():
+        return "Select action with a random policy"
+    real_random_policy.description = get_description()
+    return np.random.choice([1, 2])

envs/classic_control/cartpole_translator.py ADDED Viewed

	@@ -0,0 +1,57 @@

+class BasicLevelTranslator:
+    def __init__(self,):
+        pass
+    def translate(self, state):
+        cart_position, cart_velocity, pole_angle, pole_angular_velocity = state
+        cart_direction = "right" if cart_velocity > 0 else "left"
+        pole_direction = "right" if pole_angular_velocity > 0 else "left"
+        res = (f"The cart is positioned at {cart_position:.3f}, with a velocity of {abs(cart_velocity):.2f} towards the {cart_direction}. "
+                f"The pole is tilted at {abs(pole_angle):.2f} radians, rotating at {abs(pole_angular_velocity):.2f} radians per second towards the {pole_direction}.")
+        return res
+class GameDescriber:
+    def __init__(self, args):
+        self.is_only_local_obs = args.is_only_local_obs == 1
+        self.max_episode_len = args.max_episode_len
+        self.action_desc_dict = {
+        }
+        self.reward_desc_dict = {
+        }
+    def describe_goal(self):
+        return "The goal is to keep the pole balanced upright for as long as possible."
+    def translate_terminate_state(self, state, episode_len, max_episode_len):
+        return ""
+    def translate_potential_next_state(self, state, action):
+        return ""
+    def describe_game(self):
+        return "In the CartPole game, you control a cart that moves along a horizontal track. There is a pole " \
+               "standing upright on the cart. The goal of the game is to keep the pole balanced upright by moving the " \
+               "cart left or right. The game ends if the pole tilts too far from the vertical position or if the cart " \
+               "moves too far from the center of the track. The longer you can keep the pole balanced, the higher " \
+               "your score.Note that when the Cart Position is out of the (-2.4, 2.4) zone or the Pole Angle is out " \
+               "of the zone (-.2095, .2095), the round ends and the game is lost. "
+    def describe_action(self):
+        return "Your Next Move: \n Please choose an action. Type '1' to push the cart to the left or '2' to push the cart to the right. Ensure you only provide the action number from the valid action list, i.e., [1, 2]."
+class BasicStateSequenceTranslator(BasicLevelTranslator):
+    def translate(self, infos, is_current=False):
+        descriptions = []
+        if is_current:
+            state_desc = BasicLevelTranslator().translate(infos[-1]['state'])
+            return state_desc
+        for i, info in enumerate(infos):
+            assert 'state' in info, "info should contain state information"
+            state_desc = BasicLevelTranslator().translate(info['state'])
+            action_desc = f"Take Action: Push {'right' if info['action'] == 2 else 'left'} ({info['action']})."
+            reward_desc = f"Result: Reward of {info['reward']}, "
+            next_state_desc = BasicLevelTranslator().translate(info['next_state'])
+            descriptions.append(f"{state_desc}.\n {action_desc} \n {reward_desc} \n Transit to {next_state_desc}")
+        return descriptions

envs/classic_control/few_shot_examples/acrobot_l2.json ADDED Viewed

The diff for this file is too large to render. See raw diff

envs/classic_control/few_shot_examples/acrobot_l4.json ADDED Viewed

The diff for this file is too large to render. See raw diff

envs/classic_control/few_shot_examples/cartpole_l2.json ADDED Viewed

The diff for this file is too large to render. See raw diff

envs/classic_control/few_shot_examples/cartpole_l4.json ADDED Viewed

The diff for this file is too large to render. See raw diff

envs/classic_control/few_shot_examples/mountaincarContinuous_l2.json ADDED Viewed

The diff for this file is too large to render. See raw diff

envs/classic_control/few_shot_examples/mountaincarContinuous_l4.json ADDED Viewed

The diff for this file is too large to render. See raw diff

envs/classic_control/few_shot_examples/mountaincar_l2.json ADDED Viewed

The diff for this file is too large to render. See raw diff