|
import streamlit as st |
|
|
|
st.set_page_config(page_title="Multi Agent Systems", page_icon=":robot_face:", layout="wide") |
|
|
|
hide_streamlit_style = """ |
|
<style> |
|
#MainMenu {visibility: hidden;} |
|
footer {visibility: hidden;} |
|
</style> |
|
""" |
|
st.markdown(hide_streamlit_style, unsafe_allow_html=True) |
|
|
|
col1, col2 = st.columns(2) |
|
|
|
with col1: |
|
st.markdown("## **Autonomous agents interacting** :robot_face: :robot_face:**") |
|
st.markdown("### **Key Aspects** :bulb:") |
|
st.markdown(""" |
|
1. **Interaction Protocol** 🤝 \n |
|
- Define rules for communication and cooperation \n |
|
2. **Decentralized Decision Making** 🎯 \n |
|
- Autonomous agents make independent decisions \n |
|
3. **Collaboration and Competition** 🤼 \n |
|
- Agents work together or against each other \n |
|
""") |
|
|
|
with col2: |
|
st.markdown("### **Entities** :guards:") |
|
st.markdown(""" |
|
1. **Autonomous Agents** 🤖 \n |
|
- Independent entities with decision-making capabilities \n |
|
2. **Environment** 🌐 \n |
|
- Shared space where agents interact \n |
|
3. **Ruleset** 📜 \n |
|
- Defines interaction protocol and decision-making processes \n |
|
""") |
|
|
|
st.markdown("---") |
|
|
|
st.markdown("## **Interaction Protocol** 🤝 :bulb:**") |
|
st.markdown("### **Key Elements** :guards:") |
|
st.markdown(""" |
|
1. **Communication** 🗣 \n |
|
- Agents exchange information \n |
|
2. **Cooperation** 🤝 \n |
|
-# 🩺🔍 Search Results |
|
### 04 Dec 2023 | [AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents](https://arxiv.org/abs/2311.17465) | [⬇️](https://arxiv.org/pdf/2311.17465) |
|
*Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang* |
|
|
|
In this study, our goal is to create interactive avatar agents that can |
|
autonomously plan and animate nuanced facial movements realistically, from both |
|
visual and behavioral perspectives. Given high-level inputs about the |
|
environment and agent profile, our framework harnesses LLMs to produce a series |
|
of detailed text descriptions of the avatar agents' facial motions. These |
|
descriptions are then processed by our task-agnostic driving engine into motion |
|
token sequences, which are subsequently converted into continuous motion |
|
embeddings that are further consumed by our standalone neural-based renderer to |
|
generate the final photorealistic avatar animations. These streamlined |
|
processes allow our framework to adapt to a variety of non-verbal avatar |
|
interactions, both monadic and dyadic. Our extensive study, which includes |
|
experiments on both newly compiled and existing datasets featuring two types of |
|
agents -- one capable of monadic interaction with the environment, and the |
|
other designed for dyadic conversation -- validates the effectiveness and |
|
versatility of our approach. To our knowledge, we advanced a leap step by |
|
combining LLMs and neural rendering for generalized non-verbal prediction and |
|
photo-realistic rendering of avatar agents. |
|
|
|
--------------- |
|
|
|
### 06 Jul 2023 | [Caption Anything: Interactive Image Description with Diverse Multimodal Controls](https://arxiv.org/abs/2305.02677) | [⬇️](https://arxiv.org/pdf/2305.02677) |
|
*Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao* |
|
|
|
Controllable image captioning is an emerging multimodal topic that aims to |
|
describe the image with natural language following human purpose, |
|
$\textit{e.g.}$, looking at the specified regions or telling in a particular |
|
text style. State-of-the-art methods are trained on annotated pairs of input |
|
controls and output captions. However, the scarcity of such well-annotated |
|
multimodal data largely limits their usability and scalability for interactive |
|
AI systems. Leveraging unimodal instruction-following foundation models is a |
|
promising alternative that benefits from broader sources of data. In this |
|
paper, we present Caption AnyThing (CAT), a foundation model augmented image |
|
captioning framework supporting a wide range of multimodel controls: 1) visual |
|
controls, including points, boxes, and trajectories; 2) language controls, such |
|
as sentiment, length, language, and factuality. Powered by Segment Anything |
|
Model (SAM) and ChatGPT, we unify the visual and language prompts into a |
|
modularized framework, enabling the flexible combination between different |
|
controls. Extensive case studies demonstrate the user intention alignment |
|
capabilities of our framework, shedding light on effective user interaction |
|
modeling in vision-language applications. Our code is publicly available at |
|
https://github.com/ttengwang/Caption-Anything. |
|
|
|
--------------- |
|
|
|
### 13 Jul 2023 | [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) | [⬇️](https://arxiv.org/pdf/2306.14824) |
|
*Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei* |
|
|
|
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new |
|
capabilities of perceiving object descriptions (e.g., bounding boxes) and |
|
grounding text to the visual world. Specifically, we represent refer |
|
expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where |
|
object descriptions are sequences of location tokens. Together with multimodal |
|
corpora, we construct large-scale data of grounded image-text pairs (called |
|
GrIT) to train the model. In addition to the existing capabilities of MLLMs |
|
(e.g., perceiving general modalities, following instructions, and performing |
|
in-context learning), Kosmos-2 integrates the grounding capability into |
|
downstream applications. We evaluate Kosmos-2 on a wide range of tasks, |
|
including (i) multimodal grounding, such as referring expression comprehension, |
|
and phrase grounding, (ii) multimodal referring, such as referring expression |
|
generation, (iii) perception-language tasks, and (iv) language understanding |
|
and generation. This work lays out the foundation for the development of |
|
Embodiment AI and sheds light on the big convergence of language, multimodal |
|
perception, action, and world modeling, which is a key step toward artificial |
|
general intelligence. Code and pretrained models are available at |
|
https://aka.ms/kosmos-2. |
|
|
|
--------------- |
|
|
|
### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [⬇️](https://arxiv.org/pdf/2402.04615) |
|
*Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Crbune, Jason Lin, Jindong Chen, Abhanshu Sharma* |
|
|
|
Screen user interfaces (UIs) and infographics, sharing similar visual |
|
language and design principles, play important roles in human communication and |
|
human-machine interaction. We introduce ScreenAI, a vision-language model that |
|
specializes in UI and infographics understanding. Our model improves upon the |
|
PaLI architecture with the flexible patching strategy of pix2struct and is |
|
trained on a unique mixture of datasets. At the heart of this mixture is a |
|
novel screen annotation task in which the model has to identify the type and |
|
location of UI elements. We use these text annotations to describe screens to |
|
Large Language Models and automatically generate question-answering (QA), UI |
|
navigation, and summarization training datasets at scale. We run ablation |
|
studies to demonstrate the impact of these design choices. At only 5B |
|
parameters, ScreenAI achieves new state-of-the-artresults on UI- and |
|
infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget |
|
Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and |
|
InfographicVQA) compared to models of similar size. Finally, we release three |
|
new datasets: one focused on the screen annotation task and two others focused |
|
on question answering. |
|
|
|
--------------- |
|
|
|
### 23 Mar 2022 | [ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues](https://arxiv.org/abs/2203.12751) | [⬇️](https://arxiv.org/pdf/2203.12751) |
|
*Monica S. Lam, Giovanni Campagna, Mehrad Moradshahi, Sina J. Semnani, Silei Xu* |
|
|
|
Task-oriented conversational agents rely on semantic parsers to translate |
|
natural language to formal representations. In this paper, we propose the |
|
design and rationale of the ThingTalk formal representation, and how the design |
|
improves the development of transactional task-oriented agents. |
|
ThingTalk is built on four core principles: (1) representing user requests |
|
directly as executable statements, covering all the functionality of the agent, |
|
(2) representing dialogues formally and succinctly to support accurate |
|
contextual semantic parsing, (3) standardizing types and interfaces to maximize |
|
reuse between agents, and (4) allowing multiple, independently-developed agents |
|
to be composed in a single virtual assistant. ThingTalk is developed as part of |
|
the Genie Framework that allows developers to quickly build transactional |
|
agents given a database and APIs. |
|
We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST. |
|
Compared to the others, the ThingTalk design is both more general and more |
|
cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and |
|
associated tools yields a new state of the art accuracy of 79% turn-by-turn. |
|
|
|
--------------- |
|
|
|
### 19 Oct 2023 | [3D-GPT: Procedural 3D Modeling with Large Language Models](https://arxiv.org/abs/2310.12945) | [⬇️](https://arxiv.org/pdf/2310.12945) |
|
*Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould* |
|
|
|
In the pursuit of efficient automated content creation, procedural |
|
generation, leveraging modifiable parameters and rule-based systems, emerges as |
|
a promising approach. Nonetheless, it could be a demanding endeavor, given its |
|
intricate nature necessitating a deep understanding of rules, algorithms, and |
|
parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing |
|
large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT |
|
positions LLMs as proficient problem solvers, dissecting the procedural 3D |
|
modeling tasks into accessible segments and appointing the apt agent for each |
|
task. 3D-GPT integrates three core agents: the task dispatch agent, the |
|
conceptualization agent, and the modeling agent. They collaboratively achieve |
|
two objectives. First, it enhances concise initial scene descriptions, evolving |
|
them into detailed forms while dynamically adapting the text based on |
|
subsequent instructions. Second, it integrates procedural generation, |
|
extracting parameter values from enriched text to effortlessly interface with |
|
3D software for asset creation. Our empirical investigations confirm that |
|
3D-GPT not only interprets and executes instructions, delivering reliable |
|
results but also collaborates effectively with human designers. Furthermore, it |
|
seamlessly integrates with Blender, unlocking expanded manipulation |
|
possibilities. Our work highlights the potential of LLMs in 3D modeling, |
|
offering a basic framework for future advancements in scene generation and |
|
animation. |
|
|
|
--------------- |
|
|
|
### 04 Jul 2023 | [Embodied Task Planning with Large Language Models](https://arxiv.org/abs/2307.01848) | [⬇️](https://arxiv.org/pdf/2307.01848) |
|
*Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan* |
|
|
|
Equipping embodied agents with commonsense is important for robots to |
|
successfully complete complex human instructions in general environments. |
|
Recent large language models (LLM) can embed rich semantic knowledge for agents |
|
in plan generation of complex tasks, while they lack the information about the |
|
realistic world and usually yield infeasible action sequences. In this paper, |
|
we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning |
|
with physical scene constraint, where the agent generates executable plans |
|
according to the existed objects in the scene by aligning LLMs with the visual |
|
perception models. Specifically, we first construct a multimodal dataset |
|
containing triplets of indoor scenes, instructions and action plans, where we |
|
provide the designed prompts and the list of existing objects in the scene for |
|
GPT-3.5 to generate a large number of instructions and corresponding planned |
|
actions. The generated data is leveraged for grounded plan tuning of |
|
pre-trained LLMs. During inference, we discover the objects in the scene by |
|
extending open-vocabulary object detectors to multi-view RGB images collected |
|
in different achievable locations. Experimental results show that the generated |
|
plan from our TaPA framework can achieve higher success rate than LLaVA and |
|
GPT-3.5 by a sizable margin, which indicates the practicality of embodied task |
|
planning in general and complex environments. |
|
|
|
--------------- |
|
|
|
### 18 Jan 2023 | [Joint Representation Learning for Text and 3D Point Cloud](https://arxiv.org/abs/2301.07584) | [⬇️](https://arxiv.org/pdf/2301.07584) |
|
*Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang* |
|
|
|
Recent advancements in vision-language pre-training (e.g. CLIP) have shown |
|
that vision models can benefit from language supervision. While many models |
|
using language modality have achieved great success on 2D vision tasks, the |
|
joint representation learning of 3D point cloud with text remains |
|
under-explored due to the difficulty of 3D-Text data pair acquisition and the |
|
irregularity of 3D data structure. In this paper, we propose a novel Text4Point |
|
framework to construct language-guided 3D point cloud models. The key idea is |
|
utilizing 2D images as a bridge to connect the point cloud and the language |
|
modalities. The proposed Text4Point follows the pre-training and fine-tuning |
|
paradigm. During the pre-training stage, we establish the correspondence of |
|
images and point clouds based on the readily available RGB-D data and use |
|
contrastive learning to align the image and point cloud representations. |
|
Together with the well-aligned image and text features achieved by CLIP, the |
|
point cloud features are implicitly aligned with the text embeddings. Further, |
|
we propose a Text Querying Module to integrate language information into 3D |
|
representation learning by querying text embeddings with point cloud features. |
|
For fine-tuning, the model learns task-specific 3D representations under |
|
informative language guidance from the label set without 2D images. Extensive |
|
experiments demonstrate that our model shows consistent improvement on various |
|
downstream tasks, such as point cloud semantic segmentation, instance |
|
segmentation, and object detection. The code will be available here: |
|
https://github.com/LeapLabTHU/Text4Point |
|
|
|
--------------- |
|
|
|
### 01 Feb 2024 | [Executable Code Actions Elicit Better LLM Agents](https://arxiv.org/abs/2402.01030) | [⬇️](https://arxiv.org/pdf/2402.01030) |
|
*Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji* |
|
|
|
Large Language Model (LLM) agents, capable of performing a broad range of |
|
actions, such as invoking tools and controlling robots, show great potential in |
|
tackling real-world challenges. LLM agents are typically prompted to produce |
|
actions by generating JSON or text in a pre-defined format, which is usually |
|
limited by constrained action space (e.g., the scope of pre-defined tools) and |
|
restricted flexibility (e.g., inability to compose multiple tools). This work |
|
proposes to use executable Python code to consolidate LLM agents' actions into |
|
a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct |
|
can execute code actions and dynamically revise prior actions or emit new |
|
actions upon new observations through multi-turn interactions. Our extensive |
|
analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that |
|
CodeAct outperforms widely used alternatives (up to 20% higher success rate). |
|
The encouraging performance of CodeAct motivates us to build an open-source LLM |
|
agent that interacts with environments by executing interpretable code and |
|
collaborates with users using natural language. To this end, we collect an |
|
instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn |
|
interactions using CodeAct. We show that it can be used with existing data to |
|
improve models in agent-oriented tasks without compromising their general |
|
capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with |
|
Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., |
|
model training) using existing libraries and autonomously self-debug. |
|
|
|
--------------- |
|
""") |