Spaces:

awacke1
/

Multi-Agent-Systems-MAS-Autonomous-Agents-Interacting

Runtime error

App Files Files Community

Multi-Agent-Systems-MAS-Autonomous-Agents-Interacting / app.py

awacke1

Update app.py

ca4aa68 verified 10 months ago

raw

history blame

16.5 kB

	import streamlit as st

	st.set_page_config(page_title="Multi Agent Systems", page_icon=":robot_face:", layout="wide")

	hide_streamlit_style = """
	<style>
	#MainMenu {visibility: hidden;}
	footer {visibility: hidden;}
	</style>
	"""
	st.markdown(hide_streamlit_style, unsafe_allow_html=True)

	col1, col2 = st.columns(2)

	with col1:
	st.markdown("## Autonomous agents interacting :robot_face: :robot_face:**")
	st.markdown("### Key Aspects :bulb:")
	st.markdown("""
	1. Interaction Protocol 🤝 \n
	- Define rules for communication and cooperation \n
	2. Decentralized Decision Making 🎯 \n
	- Autonomous agents make independent decisions \n
	3. Collaboration and Competition 🤼 \n
	- Agents work together or against each other \n
	""")

	with col2:
	st.markdown("### Entities :guards:")
	st.markdown("""
	1. Autonomous Agents 🤖 \n
	- Independent entities with decision-making capabilities \n
	2. Environment 🌐 \n
	- Shared space where agents interact \n
	3. Ruleset 📜 \n
	- Defines interaction protocol and decision-making processes \n
	""")

	st.markdown("---")

	st.markdown("## Interaction Protocol 🤝 :bulb:**")
	st.markdown("### Key Elements :guards:")
	st.markdown("""
	1. Communication 🗣 \n
	- Agents exchange information \n
	2. Cooperation 🤝 \n
	-# 🩺🔍 Search Results
	### 04 Dec 2023 \| [AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents](https://arxiv.org/abs/2311.17465) \| [⬇️](https://arxiv.org/pdf/2311.17465)
	Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang

	In this study, our goal is to create interactive avatar agents that can
	autonomously plan and animate nuanced facial movements realistically, from both
	visual and behavioral perspectives. Given high-level inputs about the
	environment and agent profile, our framework harnesses LLMs to produce a series
	of detailed text descriptions of the avatar agents' facial motions. These
	descriptions are then processed by our task-agnostic driving engine into motion
	token sequences, which are subsequently converted into continuous motion
	embeddings that are further consumed by our standalone neural-based renderer to
	generate the final photorealistic avatar animations. These streamlined
	processes allow our framework to adapt to a variety of non-verbal avatar
	interactions, both monadic and dyadic. Our extensive study, which includes
	experiments on both newly compiled and existing datasets featuring two types of
	agents -- one capable of monadic interaction with the environment, and the
	other designed for dyadic conversation -- validates the effectiveness and
	versatility of our approach. To our knowledge, we advanced a leap step by
	combining LLMs and neural rendering for generalized non-verbal prediction and
	photo-realistic rendering of avatar agents.

	---------------

	### 06 Jul 2023 \| [Caption Anything: Interactive Image Description with Diverse Multimodal Controls](https://arxiv.org/abs/2305.02677) \| [⬇️](https://arxiv.org/pdf/2305.02677)
	Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao

	Controllable image captioning is an emerging multimodal topic that aims to
	describe the image with natural language following human purpose,
	$\textit{e.g.}$, looking at the specified regions or telling in a particular
	text style. State-of-the-art methods are trained on annotated pairs of input
	controls and output captions. However, the scarcity of such well-annotated
	multimodal data largely limits their usability and scalability for interactive
	AI systems. Leveraging unimodal instruction-following foundation models is a
	promising alternative that benefits from broader sources of data. In this
	paper, we present Caption AnyThing (CAT), a foundation model augmented image
	captioning framework supporting a wide range of multimodel controls: 1) visual
	controls, including points, boxes, and trajectories; 2) language controls, such
	as sentiment, length, language, and factuality. Powered by Segment Anything
	Model (SAM) and ChatGPT, we unify the visual and language prompts into a
	modularized framework, enabling the flexible combination between different
	controls. Extensive case studies demonstrate the user intention alignment
	capabilities of our framework, shedding light on effective user interaction
	modeling in vision-language applications. Our code is publicly available at
	https://github.com/ttengwang/Caption-Anything.

	---------------

	### 13 Jul 2023 \| [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) \| [⬇️](https://arxiv.org/pdf/2306.14824)
	Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei

	We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
	capabilities of perceiving object descriptions (e.g., bounding boxes) and
	grounding text to the visual world. Specifically, we represent refer
	expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
	object descriptions are sequences of location tokens. Together with multimodal
	corpora, we construct large-scale data of grounded image-text pairs (called
	GrIT) to train the model. In addition to the existing capabilities of MLLMs
	(e.g., perceiving general modalities, following instructions, and performing
	in-context learning), Kosmos-2 integrates the grounding capability into
	downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
	including (i) multimodal grounding, such as referring expression comprehension,
	and phrase grounding, (ii) multimodal referring, such as referring expression
	generation, (iii) perception-language tasks, and (iv) language understanding
	and generation. This work lays out the foundation for the development of
	Embodiment AI and sheds light on the big convergence of language, multimodal
	perception, action, and world modeling, which is a key step toward artificial
	general intelligence. Code and pretrained models are available at
	https://aka.ms/kosmos-2.

	---------------

	### 19 Feb 2024 \| [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) \| [⬇️](https://arxiv.org/pdf/2402.04615)
	Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Crbune, Jason Lin, Jindong Chen, Abhanshu Sharma

	Screen user interfaces (UIs) and infographics, sharing similar visual
	language and design principles, play important roles in human communication and
	human-machine interaction. We introduce ScreenAI, a vision-language model that
	specializes in UI and infographics understanding. Our model improves upon the
	PaLI architecture with the flexible patching strategy of pix2struct and is
	trained on a unique mixture of datasets. At the heart of this mixture is a
	novel screen annotation task in which the model has to identify the type and
	location of UI elements. We use these text annotations to describe screens to
	Large Language Models and automatically generate question-answering (QA), UI
	navigation, and summarization training datasets at scale. We run ablation
	studies to demonstrate the impact of these design choices. At only 5B
	parameters, ScreenAI achieves new state-of-the-artresults on UI- and
	infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
	Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
	InfographicVQA) compared to models of similar size. Finally, we release three
	new datasets: one focused on the screen annotation task and two others focused
	on question answering.

	---------------

	### 23 Mar 2022 \| [ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues](https://arxiv.org/abs/2203.12751) \| [⬇️](https://arxiv.org/pdf/2203.12751)
	Monica S. Lam, Giovanni Campagna, Mehrad Moradshahi, Sina J. Semnani, Silei Xu

	Task-oriented conversational agents rely on semantic parsers to translate
	natural language to formal representations. In this paper, we propose the
	design and rationale of the ThingTalk formal representation, and how the design
	improves the development of transactional task-oriented agents.
	ThingTalk is built on four core principles: (1) representing user requests
	directly as executable statements, covering all the functionality of the agent,
	(2) representing dialogues formally and succinctly to support accurate
	contextual semantic parsing, (3) standardizing types and interfaces to maximize
	reuse between agents, and (4) allowing multiple, independently-developed agents
	to be composed in a single virtual assistant. ThingTalk is developed as part of
	the Genie Framework that allows developers to quickly build transactional
	agents given a database and APIs.
	We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
	Compared to the others, the ThingTalk design is both more general and more
	cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
	associated tools yields a new state of the art accuracy of 79% turn-by-turn.

	---------------

	### 19 Oct 2023 \| [3D-GPT: Procedural 3D Modeling with Large Language Models](https://arxiv.org/abs/2310.12945) \| [⬇️](https://arxiv.org/pdf/2310.12945)
	Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould

	In the pursuit of efficient automated content creation, procedural
	generation, leveraging modifiable parameters and rule-based systems, emerges as
	a promising approach. Nonetheless, it could be a demanding endeavor, given its
	intricate nature necessitating a deep understanding of rules, algorithms, and
	parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
	large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
	positions LLMs as proficient problem solvers, dissecting the procedural 3D
	modeling tasks into accessible segments and appointing the apt agent for each
	task. 3D-GPT integrates three core agents: the task dispatch agent, the
	conceptualization agent, and the modeling agent. They collaboratively achieve
	two objectives. First, it enhances concise initial scene descriptions, evolving
	them into detailed forms while dynamically adapting the text based on
	subsequent instructions. Second, it integrates procedural generation,
	extracting parameter values from enriched text to effortlessly interface with
	3D software for asset creation. Our empirical investigations confirm that
	3D-GPT not only interprets and executes instructions, delivering reliable
	results but also collaborates effectively with human designers. Furthermore, it
	seamlessly integrates with Blender, unlocking expanded manipulation
	possibilities. Our work highlights the potential of LLMs in 3D modeling,
	offering a basic framework for future advancements in scene generation and
	animation.

	---------------

	### 04 Jul 2023 \| [Embodied Task Planning with Large Language Models](https://arxiv.org/abs/2307.01848) \| [⬇️](https://arxiv.org/pdf/2307.01848)
	Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan

	Equipping embodied agents with commonsense is important for robots to
	successfully complete complex human instructions in general environments.
	Recent large language models (LLM) can embed rich semantic knowledge for agents
	in plan generation of complex tasks, while they lack the information about the
	realistic world and usually yield infeasible action sequences. In this paper,
	we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
	with physical scene constraint, where the agent generates executable plans
	according to the existed objects in the scene by aligning LLMs with the visual
	perception models. Specifically, we first construct a multimodal dataset
	containing triplets of indoor scenes, instructions and action plans, where we
	provide the designed prompts and the list of existing objects in the scene for
	GPT-3.5 to generate a large number of instructions and corresponding planned
	actions. The generated data is leveraged for grounded plan tuning of
	pre-trained LLMs. During inference, we discover the objects in the scene by
	extending open-vocabulary object detectors to multi-view RGB images collected
	in different achievable locations. Experimental results show that the generated
	plan from our TaPA framework can achieve higher success rate than LLaVA and
	GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
	planning in general and complex environments.

	---------------

	### 18 Jan 2023 \| [Joint Representation Learning for Text and 3D Point Cloud](https://arxiv.org/abs/2301.07584) \| [⬇️](https://arxiv.org/pdf/2301.07584)
	Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang

	Recent advancements in vision-language pre-training (e.g. CLIP) have shown
	that vision models can benefit from language supervision. While many models
	using language modality have achieved great success on 2D vision tasks, the
	joint representation learning of 3D point cloud with text remains
	under-explored due to the difficulty of 3D-Text data pair acquisition and the
	irregularity of 3D data structure. In this paper, we propose a novel Text4Point
	framework to construct language-guided 3D point cloud models. The key idea is
	utilizing 2D images as a bridge to connect the point cloud and the language
	modalities. The proposed Text4Point follows the pre-training and fine-tuning
	paradigm. During the pre-training stage, we establish the correspondence of
	images and point clouds based on the readily available RGB-D data and use
	contrastive learning to align the image and point cloud representations.
	Together with the well-aligned image and text features achieved by CLIP, the
	point cloud features are implicitly aligned with the text embeddings. Further,
	we propose a Text Querying Module to integrate language information into 3D
	representation learning by querying text embeddings with point cloud features.
	For fine-tuning, the model learns task-specific 3D representations under
	informative language guidance from the label set without 2D images. Extensive
	experiments demonstrate that our model shows consistent improvement on various
	downstream tasks, such as point cloud semantic segmentation, instance
	segmentation, and object detection. The code will be available here:
	https://github.com/LeapLabTHU/Text4Point

	---------------

	### 01 Feb 2024 \| [Executable Code Actions Elicit Better LLM Agents](https://arxiv.org/abs/2402.01030) \| [⬇️](https://arxiv.org/pdf/2402.01030)
	Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji

	Large Language Model (LLM) agents, capable of performing a broad range of
	actions, such as invoking tools and controlling robots, show great potential in
	tackling real-world challenges. LLM agents are typically prompted to produce
	actions by generating JSON or text in a pre-defined format, which is usually
	limited by constrained action space (e.g., the scope of pre-defined tools) and
	restricted flexibility (e.g., inability to compose multiple tools). This work
	proposes to use executable Python code to consolidate LLM agents' actions into
	a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
	can execute code actions and dynamically revise prior actions or emit new
	actions upon new observations through multi-turn interactions. Our extensive
	analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
	CodeAct outperforms widely used alternatives (up to 20% higher success rate).
	The encouraging performance of CodeAct motivates us to build an open-source LLM
	agent that interacts with environments by executing interpretable code and
	collaborates with users using natural language. To this end, we collect an
	instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
	interactions using CodeAct. We show that it can be used with existing data to
	improve models in agent-oriented tasks without compromising their general
	capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
	Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
	model training) using existing libraries and autonomously self-debug.

	---------------
	""")