awacke1's picture
Update app.py
ca4aa68 verified
raw
history blame
16.5 kB
import streamlit as st
st.set_page_config(page_title="Multi Agent Systems", page_icon=":robot_face:", layout="wide")
hide_streamlit_style = """
<style>
#MainMenu {visibility: hidden;}
footer {visibility: hidden;}
</style>
"""
st.markdown(hide_streamlit_style, unsafe_allow_html=True)
col1, col2 = st.columns(2)
with col1:
st.markdown("## **Autonomous agents interacting** :robot_face: :robot_face:**")
st.markdown("### **Key Aspects** :bulb:")
st.markdown("""
1. **Interaction Protocol** 🤝 \n
- Define rules for communication and cooperation \n
2. **Decentralized Decision Making** 🎯 \n
- Autonomous agents make independent decisions \n
3. **Collaboration and Competition** 🤼 \n
- Agents work together or against each other \n
""")
with col2:
st.markdown("### **Entities** :guards:")
st.markdown("""
1. **Autonomous Agents** 🤖 \n
- Independent entities with decision-making capabilities \n
2. **Environment** 🌐 \n
- Shared space where agents interact \n
3. **Ruleset** 📜 \n
- Defines interaction protocol and decision-making processes \n
""")
st.markdown("---")
st.markdown("## **Interaction Protocol** 🤝 :bulb:**")
st.markdown("### **Key Elements** :guards:")
st.markdown("""
1. **Communication** 🗣 \n
- Agents exchange information \n
2. **Cooperation** 🤝 \n
-# 🩺🔍 Search Results
### 04 Dec 2023 | [AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents](https://arxiv.org/abs/2311.17465) | [⬇️](https://arxiv.org/pdf/2311.17465)
*Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang*
In this study, our goal is to create interactive avatar agents that can
autonomously plan and animate nuanced facial movements realistically, from both
visual and behavioral perspectives. Given high-level inputs about the
environment and agent profile, our framework harnesses LLMs to produce a series
of detailed text descriptions of the avatar agents' facial motions. These
descriptions are then processed by our task-agnostic driving engine into motion
token sequences, which are subsequently converted into continuous motion
embeddings that are further consumed by our standalone neural-based renderer to
generate the final photorealistic avatar animations. These streamlined
processes allow our framework to adapt to a variety of non-verbal avatar
interactions, both monadic and dyadic. Our extensive study, which includes
experiments on both newly compiled and existing datasets featuring two types of
agents -- one capable of monadic interaction with the environment, and the
other designed for dyadic conversation -- validates the effectiveness and
versatility of our approach. To our knowledge, we advanced a leap step by
combining LLMs and neural rendering for generalized non-verbal prediction and
photo-realistic rendering of avatar agents.
---------------
### 06 Jul 2023 | [Caption Anything: Interactive Image Description with Diverse Multimodal Controls](https://arxiv.org/abs/2305.02677) | [⬇️](https://arxiv.org/pdf/2305.02677)
*Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao*
Controllable image captioning is an emerging multimodal topic that aims to
describe the image with natural language following human purpose,
$\textit{e.g.}$, looking at the specified regions or telling in a particular
text style. State-of-the-art methods are trained on annotated pairs of input
controls and output captions. However, the scarcity of such well-annotated
multimodal data largely limits their usability and scalability for interactive
AI systems. Leveraging unimodal instruction-following foundation models is a
promising alternative that benefits from broader sources of data. In this
paper, we present Caption AnyThing (CAT), a foundation model augmented image
captioning framework supporting a wide range of multimodel controls: 1) visual
controls, including points, boxes, and trajectories; 2) language controls, such
as sentiment, length, language, and factuality. Powered by Segment Anything
Model (SAM) and ChatGPT, we unify the visual and language prompts into a
modularized framework, enabling the flexible combination between different
controls. Extensive case studies demonstrate the user intention alignment
capabilities of our framework, shedding light on effective user interaction
modeling in vision-language applications. Our code is publicly available at
https://github.com/ttengwang/Caption-Anything.
---------------
### 13 Jul 2023 | [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) | [⬇️](https://arxiv.org/pdf/2306.14824)
*Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei*
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (e.g., bounding boxes) and
grounding text to the visual world. Specifically, we represent refer
expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
object descriptions are sequences of location tokens. Together with multimodal
corpora, we construct large-scale data of grounded image-text pairs (called
GrIT) to train the model. In addition to the existing capabilities of MLLMs
(e.g., perceiving general modalities, following instructions, and performing
in-context learning), Kosmos-2 integrates the grounding capability into
downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
including (i) multimodal grounding, such as referring expression comprehension,
and phrase grounding, (ii) multimodal referring, such as referring expression
generation, (iii) perception-language tasks, and (iv) language understanding
and generation. This work lays out the foundation for the development of
Embodiment AI and sheds light on the big convergence of language, multimodal
perception, action, and world modeling, which is a key step toward artificial
general intelligence. Code and pretrained models are available at
https://aka.ms/kosmos-2.
---------------
### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [⬇️](https://arxiv.org/pdf/2402.04615)
*Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Crbune, Jason Lin, Jindong Chen, Abhanshu Sharma*
Screen user interfaces (UIs) and infographics, sharing similar visual
language and design principles, play important roles in human communication and
human-machine interaction. We introduce ScreenAI, a vision-language model that
specializes in UI and infographics understanding. Our model improves upon the
PaLI architecture with the flexible patching strategy of pix2struct and is
trained on a unique mixture of datasets. At the heart of this mixture is a
novel screen annotation task in which the model has to identify the type and
location of UI elements. We use these text annotations to describe screens to
Large Language Models and automatically generate question-answering (QA), UI
navigation, and summarization training datasets at scale. We run ablation
studies to demonstrate the impact of these design choices. At only 5B
parameters, ScreenAI achieves new state-of-the-artresults on UI- and
infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
InfographicVQA) compared to models of similar size. Finally, we release three
new datasets: one focused on the screen annotation task and two others focused
on question answering.
---------------
### 23 Mar 2022 | [ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues](https://arxiv.org/abs/2203.12751) | [⬇️](https://arxiv.org/pdf/2203.12751)
*Monica S. Lam, Giovanni Campagna, Mehrad Moradshahi, Sina J. Semnani, Silei Xu*
Task-oriented conversational agents rely on semantic parsers to translate
natural language to formal representations. In this paper, we propose the
design and rationale of the ThingTalk formal representation, and how the design
improves the development of transactional task-oriented agents.
ThingTalk is built on four core principles: (1) representing user requests
directly as executable statements, covering all the functionality of the agent,
(2) representing dialogues formally and succinctly to support accurate
contextual semantic parsing, (3) standardizing types and interfaces to maximize
reuse between agents, and (4) allowing multiple, independently-developed agents
to be composed in a single virtual assistant. ThingTalk is developed as part of
the Genie Framework that allows developers to quickly build transactional
agents given a database and APIs.
We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
Compared to the others, the ThingTalk design is both more general and more
cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
associated tools yields a new state of the art accuracy of 79% turn-by-turn.
---------------
### 19 Oct 2023 | [3D-GPT: Procedural 3D Modeling with Large Language Models](https://arxiv.org/abs/2310.12945) | [⬇️](https://arxiv.org/pdf/2310.12945)
*Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould*
In the pursuit of efficient automated content creation, procedural
generation, leveraging modifiable parameters and rule-based systems, emerges as
a promising approach. Nonetheless, it could be a demanding endeavor, given its
intricate nature necessitating a deep understanding of rules, algorithms, and
parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
positions LLMs as proficient problem solvers, dissecting the procedural 3D
modeling tasks into accessible segments and appointing the apt agent for each
task. 3D-GPT integrates three core agents: the task dispatch agent, the
conceptualization agent, and the modeling agent. They collaboratively achieve
two objectives. First, it enhances concise initial scene descriptions, evolving
them into detailed forms while dynamically adapting the text based on
subsequent instructions. Second, it integrates procedural generation,
extracting parameter values from enriched text to effortlessly interface with
3D software for asset creation. Our empirical investigations confirm that
3D-GPT not only interprets and executes instructions, delivering reliable
results but also collaborates effectively with human designers. Furthermore, it
seamlessly integrates with Blender, unlocking expanded manipulation
possibilities. Our work highlights the potential of LLMs in 3D modeling,
offering a basic framework for future advancements in scene generation and
animation.
---------------
### 04 Jul 2023 | [Embodied Task Planning with Large Language Models](https://arxiv.org/abs/2307.01848) | [⬇️](https://arxiv.org/pdf/2307.01848)
*Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan*
Equipping embodied agents with commonsense is important for robots to
successfully complete complex human instructions in general environments.
Recent large language models (LLM) can embed rich semantic knowledge for agents
in plan generation of complex tasks, while they lack the information about the
realistic world and usually yield infeasible action sequences. In this paper,
we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
with physical scene constraint, where the agent generates executable plans
according to the existed objects in the scene by aligning LLMs with the visual
perception models. Specifically, we first construct a multimodal dataset
containing triplets of indoor scenes, instructions and action plans, where we
provide the designed prompts and the list of existing objects in the scene for
GPT-3.5 to generate a large number of instructions and corresponding planned
actions. The generated data is leveraged for grounded plan tuning of
pre-trained LLMs. During inference, we discover the objects in the scene by
extending open-vocabulary object detectors to multi-view RGB images collected
in different achievable locations. Experimental results show that the generated
plan from our TaPA framework can achieve higher success rate than LLaVA and
GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
planning in general and complex environments.
---------------
### 18 Jan 2023 | [Joint Representation Learning for Text and 3D Point Cloud](https://arxiv.org/abs/2301.07584) | [⬇️](https://arxiv.org/pdf/2301.07584)
*Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang*
Recent advancements in vision-language pre-training (e.g. CLIP) have shown
that vision models can benefit from language supervision. While many models
using language modality have achieved great success on 2D vision tasks, the
joint representation learning of 3D point cloud with text remains
under-explored due to the difficulty of 3D-Text data pair acquisition and the
irregularity of 3D data structure. In this paper, we propose a novel Text4Point
framework to construct language-guided 3D point cloud models. The key idea is
utilizing 2D images as a bridge to connect the point cloud and the language
modalities. The proposed Text4Point follows the pre-training and fine-tuning
paradigm. During the pre-training stage, we establish the correspondence of
images and point clouds based on the readily available RGB-D data and use
contrastive learning to align the image and point cloud representations.
Together with the well-aligned image and text features achieved by CLIP, the
point cloud features are implicitly aligned with the text embeddings. Further,
we propose a Text Querying Module to integrate language information into 3D
representation learning by querying text embeddings with point cloud features.
For fine-tuning, the model learns task-specific 3D representations under
informative language guidance from the label set without 2D images. Extensive
experiments demonstrate that our model shows consistent improvement on various
downstream tasks, such as point cloud semantic segmentation, instance
segmentation, and object detection. The code will be available here:
https://github.com/LeapLabTHU/Text4Point
---------------
### 01 Feb 2024 | [Executable Code Actions Elicit Better LLM Agents](https://arxiv.org/abs/2402.01030) | [⬇️](https://arxiv.org/pdf/2402.01030)
*Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji*
Large Language Model (LLM) agents, capable of performing a broad range of
actions, such as invoking tools and controlling robots, show great potential in
tackling real-world challenges. LLM agents are typically prompted to produce
actions by generating JSON or text in a pre-defined format, which is usually
limited by constrained action space (e.g., the scope of pre-defined tools) and
restricted flexibility (e.g., inability to compose multiple tools). This work
proposes to use executable Python code to consolidate LLM agents' actions into
a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
can execute code actions and dynamically revise prior actions or emit new
actions upon new observations through multi-turn interactions. Our extensive
analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
CodeAct outperforms widely used alternatives (up to 20% higher success rate).
The encouraging performance of CodeAct motivates us to build an open-source LLM
agent that interacts with environments by executing interpretable code and
collaborates with users using natural language. To this end, we collect an
instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
interactions using CodeAct. We show that it can be used with existing data to
improve models in agent-oriented tasks without compromising their general
capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
model training) using existing libraries and autonomously self-debug.
---------------
""")